[DL輪読会]What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (arXiv’20)

370 Views

June 30, 20

#deep learning #Deep Learning #Reinforcement Learning #PPO #TRPO #Parameter Tuning

スライド概要

2020/06/19
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 89.5K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 64.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 60.7K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 45.4K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 45.4K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 43.6K

各ページのテキスト

DEEP LEARNING JP [DL Papers] What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (arXiv’20) Presenter: Masanori Misono (Univ. Tokyo) http://deeplearning.jp/ 2020-06-19 1

http://deeplearning.jp/

書誌情報 What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study Marcin Andrychowicz, Anton Raichuk, Piotr Stanczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem (Google Research, Brain Team) arXiv [Submitted on 10 Jun 2020] URL - https://arxiv.org/abs/2006.05990 - On-policy RL 50 - Costa Huang, The 32 Implementation Details of Proximal Policy Optimization (PPO) Algorithm, 2020, https://costa.sh/blog-the-32-implementation-details-of-ppo.html - Denny Britz, AI Research, Replicability and Incentives, 2020, https://dennybritz.com/blog/ai-replication-incentives/ 2020-06-19 2

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLRʼ20) • 実験的に，TRPOと比べたPPOの性能の優位性はclippingではなく，論文に書かれていない実装の詳細 (code-level optimization; 下記)による 1. Value function clipping 2. Reward scaling 3. Orthogonal initialization and layer scaling 4. Adam learning rate annealing Adam 5. Reward Clipping [-5,5] [-10,10] 6. Observation Normalization 0, 7. Observation Clipping [-10,10] 8. Hyperbolic tan activations policy 9. Global Gradient Clipping 2020-06-19 value network l2 3

この論文の動機と貢献 • 特に強化学習は訓練時のパラメータに結果が左右される • 最近は多くの実装が公開されているが，それだけでは不十分論文ではしばしば異なる実装によるアルゴリズムが比較される実装による性能差なのか，アルゴリズムによる性能差なのか正確に評価できない • 統一的なon-policy algorithmの実装を用いて，50以上の設定 (choice)で大規模な実験を実施 2020-06-19 4

実験設定 • 連続空間に対するon-policy RL • 環境 Hopper-v1, Waker2d-v1, HalfCheetah-v1, Ant-v1, Humanoid-v1 • SEED (https://github.com/google-research/seed̲rl) をベースに， 50以上の設定が変更可能なon-policyの学習機構を作成 OpenAIのPPO設定と同条件の設定で同等の結果が得られることを確認 2020-06-19 5

https://github.com/google-research/seed_rl

実験の流れ • 全ての組み合わせを試すことは不可能 • 完全にランダムに実験するのものよくない相関する設定もある (e.g., batch sizeとlearning rate) • そこで，まずchoiceをグループに分ける Policy losses, Network architecture, Normalization and clipping, Advantage estimation, Training setup, Timesteps handling, Optimizers, Regularization あるchoiceのグループを実験するとき，そのグループからchoiceをランダムサンプリング，他のchoiceのグループは固定 2020-06-19 6

Choice (抜粋) 2020-06-19 ( : C68 ) 7

評価指標 • 各ハイパーパラメータ設定において，Hopper, HalfCheetah, Walker2d については100万step, Ant, Humanoidに対しては200万step, 別々の初期値で実験 • 1000stepごとに100episodesのundiscounted episode returnで評価した値の平均がスコア • 3シード中の中央値を最終的なスコアとして選択 • Conditional 95th percentile • Distribution of choice within top 5% configurations 2020-06-19 8

1. Policy Losses (95-th percentile of the average policy score) • Recommendation Use PPO with initial clipping threshold set to 0.25 2020-06-19 9

10.

2. Network architecture • ネットワークのサイズ，活性化関数，初期値，etc • value と policyでネットワークは分けた方が良い • 環境の複雑度に応じて最適なpolicy MLPの幅は異なる．一方value MLPの方は幅を大きくしてもあまり問題ない • policyの初期値が訓練に大きく影響する (!) → どうも最初のactionの分布の中心を0にしておくとよい (action space は [-1,1]) • Recommendation policy layerの最後の重みの初期値小さくし，softplusしたあとnegative offsetで初期のstandard deviationを小さくするように調整する 2020-06-19 10

11.

3. Normalization and clipping • input normalizationは重要 • value function normalizationもperformanceに大きく影響する (!) • Recommendation Observation normalizationは常に使い，value function normalizationが有効かどうか調べる Gradient clippingは有効かもしれないが，重要度は2番目 2020-06-19 11

12.

4. Advantage Estimation advantage estimator GAE λ • Recommendation Use GAE with λ = 0.9 but neighter Huber loss nor PPO-style value loss clipping 2020-06-19 12

13.

5. Training setup • 環境数，エポック数，minibatch size, データの分割方法, etc. 2020-06-19 13

14.

6. Timesteps handling • 割引率，frame skip • Recommendation 割引率γはもっとも重要なハイパーパラメータの一つ γ=0.99からはじめてtuneする 2020-06-19 14

15.

7. Optimizers • AdamとRMSprop • Optimizer間の差は小さい • learning rateは性能に大きく影響する • Recommendation adam learning rate Use adam with β1 = 0.9 and tune LR (0.0003 is a safe default) 2020-06-19 15

16.

8. Regularization • HalfCheetahを除いてあまりregularizationの効果は認められない • PPO policy lossのtrust regionがあるので，正則化項は冗長なのではないか 2020-06-19 16

17.

Summary • On-policy RLにおいて，50以上の設定項目に対して大規模に効果を実験 • 論文に細かい実験の内容が書いてあります • 実験設定及び実験の評価の参考に • 現実的には各Recommendationの項目を自分の実験設定と比較してみて，必要に応じてパラメータを調整してみると良さそう 2020-06-19 17

18.

2020-06-19 18