![]() 9 Atari specific implementation details.For each implementation detail, we display the permanent link to its code (which is not done in academic papers) and point out its literature connection. Implementation Checklist with References: During our re-implementation, we have compiled an implementation checklist containing 37 details as follows.Notably, we adopt single-file implementations in our code base, making the code quicker and easier to read. Video Tutorials and Single-file Implementations: we make video tutorials on re-implementing PPO in PyTorch from scratch, matching details in the official PPO implementation to handle classic control tasks, Atari games, and MuJoCo tasks.So it is important to recognize which version of the official implementation is worth studying. As we will show, the code in the openai/baselines repository has undergone several refactorings which could produce different results from the original paper. Genealogy Analysis: we establish what it means to reproduce the official PPO implementation by examining its historical revisions in the openai/baselines GitHub repository (the official repository for PPO).Specifically, this blog post complements prior work in the following ways: If anything helps, I have been making video tutorials on implementing PPO from scratch and a blog post explaining things in more depth!”Īnd the blog post is here! Instead of doing ablation studies and making recommendations on which details matter, this blog post takes a step back and focuses on reproductions of PPO’s results in all accounts. Also, I now realize their conclusions are in MuJoCo tasks and do not necessarily transfer to other games such as Atari. Prior papers analyzed PPO implementation details but didn’t show how these pieces are coded together. “Ooof, I guess it’s going to be difficult.“Lastly, if you have only the standard tools (e.g., numpy, gym.) and a neural network library (e.g., torch, jax.), could you code up PPO from scratch?”.“The official PPO also works with MultiDiscrete action space where you can use multiple discrete values to describe an action.“Ehh… I haven’t read too much on PPO + LSTM” Jon admitted.“The procgen paper contains experiments conducted using the official PPO with LSTM.I don’t think the two papers explain that.” “Hmm… That’s actually a good question.If you run the official PPO with the Atari game Breakout, the agent would get ~400 game scores in about 4 hours. You have been working with PPO, right? Quiz me on PPO!” Jon inquired enthusiastically. “Oh yeah! PPO is tricky, and I love these two papers that dive into the nitty-gritty details.” Sam answered.I knew PPO wasn’t that easy!” Jon exclaimed. “Hey, I just read the implementation details matter paper and the what matters in on-policy RL paper.They then had the following conversation: “This is it!” he grinned.įailing to control his excitement, Jon started running around in the office, accidentally bumping into Sam, whom Jon knew was working on RL. Fortunately, Jon stumbled across two recent papers that explain PPO’s implementations. Jon then looked for reference implementations online but was shortly overwhelmed: unofficial repositories all appeared to do things differently, whereas he just could not read the Tensorflow 1.x code in the official repo. Making PPO work with Atari and MuJoCo seemed more challenging than anticipated. He had a great time and felt motivated to make his PPO work with more interesting environments, such as the Atari games and MuJoCo robotics tasks. Upon reading the paper, Jon thought to himself, “huh, this is pretty straightforward.” He then opened a code editor and started writing PPO.ĬartPole-v1 from Gym was his chosen simulation environment, and before long, Jon made PPO work with CartPole-v1. He quickly recognized Proximal Policy Optimization (PPO) as a fast and versatile algorithm and wanted to implement PPO himself as a learning experience. In his eyes, RL seemed fascinating because he could use RL libraries such as Stable-Baselines3 (SB3) to train agents to play all kinds of games. Jon is a first-year master’s student who is interested in reinforcement learning (RL). Proximal-policy-optimization reproducibility reinforcement-learning implementation-details tutorial Huang, Shengyi Dossa, Rousslan Fernand Julien Raffin, Antonin Kanervisto, Anssi Wang, Weixun The 37 Implementation Details of Proximal Policy Optimization |
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |