This week the ORCAS team finished the communication layer between Gazebo and the reinforcement learning algorithm. With this, we were able to start training robots. We created a parameterized four legged robot and starting training a dense neural network (DNN) controller for it. Video
In this simulation, we did not describe the inertial properties of the robot yet so there is some odd physics behavior. Also, we did not place any constraints on the controller outputs either which caused major problems. The reward function was merely the X-velocity of the robot.
There were some really interesting behaviors that I observed during the training process. At some point in the training process, it learned a behavior to plant 3 legs and rotate one leg in a circular motion as a way of locomotion. It then started modifying this behavior to plant 2 back legs and use the front two legs for locomotion. However, since the joint velocities were not constrained and the inertial properties were not properly specified, the controller learned to output extremely large impulses which caused the robot to twitch and start flying in the air. Eventually, the robot learned to launch itself into the air start flying as this would maximize the reward. The reward numbers eventually got so large that the simulation crashed.