[DL輪読会]DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies

>100 Views

September 24, 21

#deep learning #Deep Learning #Reinforcement Learning #DisCo RL #Goal-Conditioned RL #Policy Learning

スライド概要

2021/09/24
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 89.4K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 64.1K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 60.7K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 45.3K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 44.4K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 43.3K

各ページのテキスト

DEEP LEARNING JP [DL Papers] DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies Presenter: Mitsuhiko Nakamoto, B4, The University of Tokyo http://deeplearning.jp/

http://deeplearning.jp/

書誌情報 DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies https://sites.google.com/view/disco-rl Author : Soroush Nasiriany*, Vitchyr H. Pong*, Ashvin Nair*, Alexander Khazatsky, Glen Berseth, Sergey Levine University of California, Berkeley Conference: 2021 International Conference on Robotics and Automation (ICRA 2021) 概要: 強化学習において、目標状態の分布を条件付けた方策を学習するDistribution-Conditioned RLを提案

https://sites.google.com/view/disco-rl

Goal-Conditioned RL - 通常の強化学習は方策 π(a | s) を学習 - Goal-Conditioned RLは、ある目標状態gを条件付けた方策π(a | s, g)を用いる - 目標状態gに到達するように方策を学習

Goal-Conditioned RLの問題点 - タスクは必ずしも単一の目標状態で表せるわけではない例：様々なアイテムを箱の中に配置するタスク → アイテムと箱の相対的な位置が重要であり、一つの状態では表せない → より一般化された目標が必要 → 目標状態の分布を利用！

提案手法: Distribution-Conditioned RL (Disco RL) - 単一の目標状態にではなく、目標状態の分布を条件付けた方策を学習することを提案

Goal Distribution - goal distribution のパラメータを ω として - goal distribution : pg(s; ω) : 𝒮 ↦ ℝ+ - 方策 : π( ⋅ ∣ s, ω) - Dico RLの目標：goal distributionの対数尤度が高い状態stに到達 max 𝔼τ∼π(⋅∣s,ω) π [∑ t ] γ log pg (st; ω) t

Goal Distributionの例 1. ガウス分布 : ω = (μ, Σ) 2. 混合ガウスモデル (4要素): 各要素のω = (μ, Σ)と、混合要素の重み 3. 潜在変数モデル pg(s; ω) = ∫𝒵 ω = (μz, σz) 𝒩 (z; μz, σz) pψd(s ∣ z)dz

Goal Distributionのパラメータ推定 - Goal Distributionのパラメータωを推定 - ガウス分布/混合ガウスモデル: 目標状態に達しているK個の観測（example set）を用いて最尤推定 ω* = arg max ω∈Ω K ∑ k=1 log pg (sk; ω) - 潜在変数モデル： 1 ω* = arg minDKL qψe (z; sk) ∥p(z; ω) ∑ K ( ) ω∈Ω k

Overall Algorithm Relabeling Strategy (RS): （s,w）のペアに対して高いpg (s; ω)を与えるω = (μ, Σ)にrelabelしたい → RS(s, (μ, Σ)) = (s, Σ′) 平均をsに置き換え Σ′ は replay buﬀer からobserved covarianceをランダムにサンプルして置き換え

10.

実験環境

11.

Experiment1: 30-50件のexamplesを用いてgoal distributionのパラメータを事前に推定

12.

Experiment1: - Red blockをトレーに入れるタスク video: https://sites.google.com/view/disco-rl?fbclid=IwAR1lnGWSIgpScundDTnZf3SMkuOiDLnKJFihIIOT̲qoXnYLOi3RfYcbVagg

13.

Conditional Distributions - ある複雑なﬁnal task 𝒯 を複数の sub-task s に分割する - conditional distributions : pgi (s ∣ 𝒯) - ﬁnal task 𝒯 により条件づけられた sub-task i の目標状態 s の分布下図の例：sub-task→red boxをﬁnal taskと同様の位置に配置する

14.

Conditional Distributionsのパラメータ - s(k): ﬁnal task 𝒯(k) を達成しようとした時に、sub-task i が達成された状態の例 ( 𝒯 = sf ) - sとsfの同時分布をガウシアンとして、下式でパラメータを最尤推定 K ∑ log pS,Sf (s(k), s(k) ; μ, Σ ) f - μ*, Σ* = arg max - μ*, Σ*が得られたら、新しいsf に対して、Gaussian conditional distribution pg (s ∣ sf ; μ*, Σ*)のパラメータ： μ,Σ k=1 µ1 and µ2 : the first and second half of µ∗ , and similarly for the covariance terms. - M個のsub-taskがある場合、それぞれのpgi (s ∣ sf)のパラメータを決定

15.

Overall Algorithm

16.

Experiment2: Sub-Task Decomposition - M個のsub taskがある場合、それぞれのpgi (s ∣ sf)をH/M time stepごとに方策の条件付けに使用

17.

Experiment2: Sub-Task Decomposition

18.

まとめ - 目標状態の分布を条件付けた方策を学習するDisco RLを提案 - 従来のGoal-Conditioned RLよりもより一般化されたゴールを達成できる - Conditional Diso RLで複雑なタスクをsub-taskに分解して効率的な学習に成功