[DL輪読会]Learning to Reach Goals via Iterated Supervised Learning

>100 Views

March 19, 21

#deep learning #Deep Learning #Supervised Learning #Imitation Learning #Goal-Oriented Learning #Policy Learning

スライド概要

2021/03/12
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

各ページのテキスト

DEEP LEARNING JP Learning to Reach Goals via Iterated Supervised Learning [DL Papers] XIN ZHANG, Matsuo Lab http://deeplearning.jp/

http://deeplearning.jp/

目次 1. 書誌情報 2. Introduction 3. Preliminaries 4. Learning Goal-Conditioned Policies with Self-imitation 5. Related Works 6. Experiment Evaluation 7. Discussion 2

書誌情報 • タイトル： – Learning to Reach Goals via Iterated Supervised Learning • 著者 – Dibya Ghosh, Abhishek Gupta, Ashwin Reddy, Justin Fu, Coline Devin, Benjamin Eysenbach* , Sergey Levine • 所属：UC Berkeley, ( * : Carnegie Mellon University) • 投稿日：2020/10/02 (arXiv), ICRL Oral (7788) • 概要 – 教師ありの模倣学習はスパースな報酬に対応できるが、デモの作成コストが高い – デモや価値関数なしで、自身の軌道を模倣学習して方策を求める手法を提案 – あと知恵で、「自分の軌道が正しかった」のようなラベルをつけ、イテレーションごとに方策を更新していく 3

https://arxiv.org/pdf/1912.06088.pdf

Introduction サンプル効率問題 - Deep Reinforcement Learningはサンプル効率問題があり、不安定 - デモンストレーションから学習するのは良さそうだが、デモの作成にコスト 4

Goal-conditioned supervised learning（GCSL）ゴールが与えられた状態で方策を学習する設定において、手法を提案 - 方策に基づいて動かした軌道は、正しい軌道となるようにゴールを変える - これによって、Agentが自身の軌道に対する解釈が変えられる 5

Goal-conditioned supervised learning（GCSL） - つまりこういうこと - 人：Aまでに行きなさい - ロボット：実際Bに行ってしまった - 人：Bまで行きなさいというつもりだったから、よくやったね！ - 人：Bまでの行き方は覚えておいてね！ A B 6

GCSL 著者のブログ（トイデータ・コード付き） 7

https://dibyaghosh.com/blog/rl/gcsl.html

Goal-conditioned supervised learning（GCSL） - メリット： - simpler - more stable - less sensitive to hyperparameters than value-based methods - 価値関数、報酬が必要ない - デメリット： - 探索効率：自身の軌道が教師データとなり、学習進めているため 8

補足：定式化及び理論の証明 • Goal reachingにおける目的関数 • Goal-conditioned RLにおける目的関数 • Goal-conditioned imitation learningにおける目的関数 9

10.

補足：GCSLの定式化及び理論的な話（Appendix Bを参照） • GCSLにおいて、最大化するもの • GCSLで学習した収益は、通常のRLの下限となる • deterministic transitionsな環境において、JGCSL(π) の学習は真のパフーマンスの向上につながる 10

11.

Related Works • Hindsight Experience Replay（HER ）輪読会資料 – 離散な報酬設定において、報酬などをリラベリングし、データ効率を最大化する – ただし、価値関数の推定に苦しむ – GCSL does not maintain or estimate a value function, • Supervise imitation Leraning – 似たような論文はある人間で生成したデータでリラベリングしている – GCSLは自分の軌道から学習す • direct policy search – 報酬、価値関数を用いる • self-imitation algorithms – well-shaped reward を用いる 11

https://www.slideshare.net/DeepLearningJP2016/dlhindsight-experience-replay

12.

Experiments 1. Does GCSL effectively learn goal-conditioned policies from scratch? 2. Can GCSL learn behaviors more effectively than standard RL 3. Is GCSL less sensitive to hyperparameters than value-based methods? 4. Can GCSL incorporate demonstration data more effectively than value-based methods? 12

13.

Experiments：Learning Goal-Conditioned Policies 13

14.

Experiments：Analysis of Learning Progress and Learned Behaviors 14

15.

Experiments：Robustness to Hyperparameters Is GCSL less sensitive to hyperparameters than value-based methods? 15

16.

Experiments：Initializing with Demonstrations 16

17.

Discussion GCSLは、自身のデータを教師データと見なして、学習に成功している - 以下のものが必要ないため、シンプル - 報酬の設計、 - エキスパートの軌道 - 価値関数の学習 - PPO、TD 3-HERと複数の実験で比較し、効果がある - 理論的にも、決められた条件ではRLの最適化下限や精度の保証がある - ただし、GCSL is limited in exploration 17

18.

感想 - シンプルなアイデアで重要な課題（デモの作成コストとか）を解いているのが、筋の良さが感じられた - 疑問：状態が重複するような軌道は、どう処理するのか？ - 気になる関連論文 - Goal-conditioned Imitation Learning - Go-Explore - First return then explore (a new version of Go-Explore) 18

19.

参考文献 - https://zhuanlan.zhihu.com/p/313667439 - 著者のトイデータを用いた実装で説明 https://dibyaghosh.com/blog/rl/gcsl.html 19

[DL輪読会]Learning to Reach Goals via Iterated Supervised Learning

Deep Learning JP

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

【拡散モデル勉強会】拡散モデルの数理

【拡散モデル勉強会】Introduction to Diffusion Models

【DL輪読会】Generative Agents: Interactive Simulacra of Human Behavior

【DL輪読会】4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

各ページのテキスト