Abstract by Joshua Greaves
Sample-Efficient Distributional Policy Gradients
Policy gradient algorithms are desirable because they are conceptually simple and often easy to implement. However, a major drawback is that they suffer from sample-inefficiency. One solution is to use off-policy policy gradients, but they are prone to instability outside of a trust region. We present a distributional actor-critic policy gradient algorithm that maintains simplicity while using off-policy data to increase sample-efficiency. We use a distributional critic over multiple policies to capture more information about the structure of the task, which we then backpropagate through to train the actor. We show that this makes the actor updates more sample-efficient, since the actor is trained with off-policy data that incorporates information from multiple ways of interacting with the environment.