[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling

1K Views

July 09, 21

deep learning

スライド概要

2021/07/09
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 25.9K

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 22K

【DL輪読会】Generative Agents: Interactive Simulacra of Human Behavior

Deep Learning JP 13.4K

【DL輪読会】LLMベースの自律型エージェントシステムのサーベイ

Deep Learning JP 12.5K

【DL輪読会】4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Deep Learning JP 12.4K

【DL輪読会】LightGlue: Local Feature Matching at Light Speed

Deep Learning JP 10.1K

各ページのテキスト

DEEP LEARNING JP [DL Papers] Decision Transformer ： Reinforcement Learning via sequence modeling XIN ZHANG, Matsuo Lab http://deeplearning.jp/

http://deeplearning.jp/

書誌情報 ● タイトル： ○ Decision Transformer：Reinforcement Learning via sequence modeling ● 著者 ○ Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas*, Igor Mordatch ● 研究機関：UC Berkeley, Facebook AI Research, Google Brain ● 12 Jun 2021 ● 概要 ○ Transformerを用いて、RLを系列モデリングの手法として扱う手法を提案 ○ Model-free offline RLのベースラインのSOTAと同等な精度. 2

1. Introduction

Transformer ● 強力なTransformerをRLで使えないか？ ● Self-Attentionが長い系列のRLを扱いやすそう ● 誤差の累積と価値関数のオーバー予測が課題 ● Transformerを用いるには自然な設定 Offline RL From CS 285

http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-15.pdf

3. Method

Decision Transformer(DT) ● ● GPTアーキテクチャ ○ 次のActionを予測する ○ 離散値：cross-entropy ○ 連続値：mean-squared returns-to-go： ○ ある時点のActionは、それ以降の Rewardのみに影響を与える 1 timestep ○ ● Actionを予測するのに必要 Feed K timesteps (3K tokens)

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

DT Algorithm

Illustrative example ❏ 状態Stateと期待されているRewardについて、学習データに似たようなものがあれば、そのActionを出力する

4. Evaluations on Offline RL Benchmarks

10.

4.1 Atari(Breakout, Qbert, Pong, Seaquest) Qbert ❏ CQLと良い勝負。ただQbertが弱い。 ❏ K=30 (except K=50 for Pong)

11.

4.2 OpenAI Gym(HalfCheetah, Hopper, Walker, Reacher) ❏ OpenAI gymは大体勝ってる ❏ K=20 (except K=20 for Reacher)

12.

5. Discussion

13.

5.1 Does DT perform BC on a subset of the data? ❏ Percentile BC：最適のデータを使う（最適がわからないので、非現実） ❏ BCとの違いを示そうとしている。

14.

5.2 How well does DT model the distribution of returns? ❏ Rewardでとるべき行動の指定ができる。”最適な行動”だけではない。 ❏ 一方で、適切なRewardの入力が求められる。わからない場合は困る。

15.

5.3 What is the benefit of using a longer context length? ❏ When K = 1, such as RL, DT performs poorly. ❏ Kの設定が重要、タスクによって変わってるのでハイパーパラメータになる

16.

5.4 Does DT perform effective long-term credit assignment? Key-to-Doorの例（論文の図がない！） - Key room(左)でKeyを取得する - empty room(中) - door room(右)でDoor(青)を開ける ❏ Key-to-Doorの設定では、DTが重要なものを捉えられている。 ❏ データが増えるとBCでもできる。

17.

5.5 Can DT be accurate critics in sparse reward settings? ❏ DTのAttentionはうまく機能している。 ❏ （DTが得意そうな実験をデザインしている気がするが）

18.

5.6 Does DT perform well in sparse reward settings? ❏ Delayed reward：最後にまとめてRewardを受けとる設定 ❏ Decision Transformerへのダメージが最も小さい

19.

6. Related Work

20.

6.1 Offline and supervised reinforcement learning I. Distribution shift in offline RL. A. Constrain the policy action space. B. Incorporate value pessimism C. Incorporate pessimism into learned dynamics models. II. Learning wide behavior distibution A. Learning a task-agnostic set of skill, eigher with likelihood-based approaches. B. maximizing mutual information III. Return conditioning/’supervised RL’ A. similar to DT. DT benefit from the use of long contexts for behavior modeling as long-term credit assignment. ❏ Offline RLの分布シフト問題に取り組む研究がたくさんある！ ❏ 強化学習をSupervised Learningとして扱う研究

21.

6.2 Credit assignment(貢献度の分配) 1. Self-Attentional Credit Assignment for Transfer in Reinforcement Learning 2. Hindsight Credit Assignment 3. Counterfactual credit assignment in model-free reinforcement learning ❏ 報酬を最も重要なStepで与える必要があり、その分配を求める研究 ❏ 実験通じて、Transformerが良さそうことが分かった

22.

6.3 Conditional language generation 6.4 Attention and transformer models ❏ 条件付き言語生成、TransformerとAttentionなどの関連研究がたくさんある

23.

7. Conclusion

24.

Offline RL, Sequence modeling, goal condition by reward. Decision Transformer - Offline RL設定でGPT アーキテクチャを用いた。 - 適切なRewardを設定して、それを得られるActionを出力する。 - Model freeの手法(CQL)と比較し、うまくいってる。 Future work - Stochastic Decision Transformer - - Model-based Decision Transformer. - - conditioning on return distributions to model stochastic settings instead of deterministic returns Transformer models can also be used to model the state evolution of trajectory For Real-world application - Augmenting RL. ❏ アイデアが面白くて、関連研究がいっぱいでる予想 ❏ 適切な報酬が知らないと困るので、解決できそうなアイデアを考えたい

25.

Appendix - Youtuber Yannic の解説

https://youtu.be/-buULmf7dec