Pictured: A simulation of the learning algorithm in a CartPole-v0 environment provided by OpenAI gym.

This last week I started playing around with Tensorflow for reinforcement learning. I was able to implement a policy gradient with parameter-based exploration(PGPE) algorithm for learning a policy calculated by a simple 2 layer neural network controller. This method is described in [1]. To implement this, I created the net in Tensorflow and to train it, I extracted all of the parameters from the net and passed it into the PGPE algorithm which I implemented in numpy. To train, the PGPE algorithm forms a multivariate Gaussian distribution of the parameters and samples it to test against the environment. It then uses the samples from the simulation to calculate a gradient and update the multivariate distribution's mean and covariance matrix.

However, running this algorithm on an OpenAI gym environment did not seem to produce good results. Perhaps this could be because I needed to tune the hyperparameters of the learning algorithm more (especially the initial variances of the Gaussian). After doing some more research, I learned that using more state-of-the-art proximal policy optimization(PPO)[2] methods for reinforcement learning might better suit training the controller as it is fairly simple to backpropogate the gradients through the NN anyway. With PGPE, I am merely training a Gaussian distribution without directly interacting with the network which seems unnecessary when the network is differentiable in the first place. However, unlike the NN parameters, geometric parameters are not as easy to learn as domain-specific knowledge would be necessary to backpropogate gradients through the geometric parameters.

Therefore, it makes sense to use PPO in conjunction with PGPE for co-optimizing geometry and control of parameterized robots. [3] has done exactly this with promising results. My goal in the next few weeks is to explore ways of using both PPO and Multi-SyS-PGPE[4] together for this task. Conveniently, Tensorforce has an easy to use PPO agent. I will need to see if I can subclass this agent to include PGPE for geometric parameters.

[1] Frank Sehnke, Christian Osendorfer, Thomas Ruckstieß, Alex Graves, Jan Peters, and Jürgen Schmidhuber. Parameter-exploring policy gradients. Neural Networks, 23(4):551–559, 2010.

[2] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[3] Charles Schaff, David Yunis, Ayan Chakrabarti, Matthew R. Walter. Jointly Learning to Construct and Control Agents using Deep Reinforcement Learning. arXiv:1801.01432, 2018.

[4] Frank Sehnke, Alex Graves, Christian Osendorfer, Jürgen Schmidhuber: Multimodal parameter- exploring policy gradients. In: Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on, IEEE (2010) 113–118.