Abstract by Joshua Greaves

Personal Infomation

Presenter's Name

Joshua Greaves

Degree Level



Kolby Nottingham

Abstract Infomation


Computer Science

Faculty Advisor

David Wingate


Sample-Efficient Distributional Policy Gradients


Policy gradient algorithms are desirable because they are conceptually simple and often easy to implement. However, a major drawback is that they suffer from sample-inefficiency. One solution is to use off-policy policy gradients, but they are prone to instability outside of a trust region. We present a distributional actor-critic policy gradient algorithm that maintains simplicity while using off-policy data to increase sample-efficiency. We use a distributional critic over multiple policies to capture more information about the structure of the task, which we then backpropagate through to train the actor. We show that this makes the actor updates more sample-efficient, since the actor is trained with off-policy data that incorporates information from multiple ways of interacting with the environment.