[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.

191 Views

January 15, 21

スライド概要

ublished on Jan 15, 2021

2021/01/15
Deep Learning JP:
http://deeplearning.jp/seminar-2/

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

各ページのテキスト
1.

DEEP LEARNING JP [DL Papers] Why Deep RL fails? A brief survey of recent works. Presenter: Kei Ota (@ohtake_i). http://deeplearning.jp/ 1

2.

• ਂ૚‫ڧ‬Խֶशʢ%3-ʣ͸Ϋϥε෼ྨ΍ճ‫ؼ‬໰୊ʹൺ΂ෆ҆ఆͰ͋Δ͜ͱ͕ଟ͍ɽ • ຊൃදͰ͸ɼ%3-ͷෆ҆ఆ͞͸Կʹ‫ى‬Ҽ͢Δͷ͔ɼͲ͏͢Ε͹ܰ‫͖Ͱݮ‬Δ͔Λ ཧ࿦తɾ࣮‫ݧ‬తʹ‫͔͍ͭͨ͘͠ڀݚ‬ͷ࿦จΛઙ͘޿͘঺հ͢Δɽ • ঺հ࿦จ %FFQ3FJOGPSDFNFOU-FBSOJOHUIBU.BUUFST """* %FFQ3FJOGPSDFNFOU-FBSOJOHBOEUIF%FBEMZ5SJBE BS9JW %JBHOPTJOH#PUUMFOFDLTJO%FFQ2MFBSOJOH"MHPSJUINT *$.- 3FWJTJUJOH'VOEBNFOUBMTPG&YQFSJFODF3FQMBZ *$.- *NQMJDJUVOEFSQBSBNFUFSJ[BUJPOJOIJCJUTEBUBFGGJDJFOUEFFQSFJOGPSDFNFOUMFBSOJOH *$-3 – %3-%FFQ%FOTF"SDIJUFDUVSFTJO3FJOGPSDFNFOU-FBSOJOH BS9JW – – – – – 2

3.

Deep Reinforcement Learning that Matters • ࣮‫ݧ‬తʹ%3-͕ҎԼͷੑ࣭Λ࣋ͭ͜ͱΛࣔͨ͠ɿ – ϋΠύʔύϥϝʔλɾ‫׆‬ੑԽؔ਺ɾ࣮૷ʹΑΓ݁Ռ͕େ͖͘มΘΔ – ճ‫ؼ‬ɾΫϥε෼ྨ໰୊ͷΑ͏ʹ૚Λ૿΍ͯ͠΋ඞͣ͠΋ੑೳ͕޲্͢Δͱ͸‫ݶ‬Βͳ͍ Deep RL͸ͳͥ೉͍͠ʁ 3

4.

Deadly Triad • %FBEMZ5SJBE%2-ͷࣦഊͷ‫ݪ‬Ҽ͸ֶशʹ༻͍Δͭͷํ๏ʹ‫ى‬Ҽ͢Δ – 1G BA A8 DB 1  #PPUTUSBQQJOHʢ5%ֶशʣ – – g y P hg R it RW Pehg  'VODUJPOBQQSPYJNBUJPO – – s w uM ..Ni P df g L s c ]iw hg Vg  0GGQPMJDZ – – 2B B8K KD A x ao f mkl BD K G GB G 81 D D B AA gMv N d ar A BAB BG A3 D D [ Vg D A0 8 pi RnS D KA D I P g 0 A BD r Vg R g A , DA A . 1 4

5.
[beta]
Deadly Triad
• %FBEMZ5SJBEʹΑΓՁ஋ؔ਺͕ൃࢄ͢Δྫ
–

-O

2

i
l
– ! "# = 1, ! "' = 2 >
o
!
–
l
) " = * × ! -) "# = *, ) "' = 2*
–
D )("# )
c

• . > 1/2

*

p

• ͜ͷ໰୊ͷ௚‫ײ‬తղऍɿ
–
–

T

l
>

c

>
l

l
f

c
>
5

6.

Deadly Triad • ͦ΋ͦ΋%FBEMZ5SJBEΛճආͰ͖ͳ͍͔ʁ – – t D O Qs N p • #PPUTUSBQQJOH – T 22 - • – – e e • 0GGQPMJDZ – n Mr d d g Qo s iB MC 2 B Bl • y N r giac Q B • %FBEMZ5SJBE͔Βൈ͚ग़͢ͷ͸೉ͦ͠͏ͳͷͰɼͦͷੑ࣭Λ஌Γ͍ͨ – B QO D 6

7.

Deep Reinforcement Learning and the Deadly Triad • ͜ͷ࿦จͰ͸ɼ%FBEMZ5SJBEͷͦΕͧΕͷߏ੒ཁૉͷӨ‫ڹ‬Λ࣍ͷΑ͏ʹௐઅɽ • #PPUTUSBQQJOH – - 3 1 0 = • 'VODUJPOBQQSPYJNBUJPO – a 8 N 80 M d 83 - , , • 0GGQPMJDZ – e D8 • ͜ΕΒͷઃఆͰɼͦΕͧΕΛมߋͨ࣌͠ʹֶशͷ‫ڍ‬ಈ͕Ͳ͏มΘΔ͔Λ‫؍‬ଌ͠ ͍ͭΞϧΰϦζϜ͕ෆ҆ఆʹͳΔ͔ɼߏ੒ཁૉͱੑೳͷؔ܎Λ࣮‫ݧ‬తʹௐࠪ 7

8.

Deep Reinforcement Learning and the Deadly Triad • ࣮‫ݧ‬લʹɼҎԼͷԾઆΛஔ͍ͯ݁ՌΛ‫ͨ͠ূݕ‬ɽ ( ) . , , , . -1 bD 4 D 6 6 , B A3 : 23A 1 a i y osrxi urt mnlp Q rw a L- 36 2A:36a a a a :5 a 1 Q g b A i g g Q c e i Q lko g i Oi Ofe . .D 5 : A F: 3 : i g T Q e g , , d B A3 : 8

9.

Deep Reinforcement Learning and the Deadly Triad  '%2-Ͱ͸ VOCPVOEFEͳʢ2஋͕ࡍ‫͘ͳݶ‬େ͖͘ͳΔʣൃࢄ͸͠ʹ͍͘ – – – g no • • s Qi 0 1 0 1 r : 0 d b e - ! = 0.99 c e 0 c f a e sm 1/ 1 − ! = 100 9

10.

Deep Reinforcement Learning and the Deadly Triad  ##PPUTUSBQQJOHʹ5BSHFUOFUXPSLΛ࢖͏ͱൃࢄ͠ʹ͍͘ – T D – – - D 10

11.

Deep Reinforcement Learning and the Deadly Triad  #2஋ͷաେධՁΛमਖ਼͢Δͱൃࢄ͠ʹ͍͘ – – – a D D b Q - - - Q T 11

12.

Deep Reinforcement Learning and the Deadly Triad  #ϚϧνεςοϓΛ௕͘͢Δͱൃࢄ͠΍͍͢ʢόΠΞεখ͘͞෼ࢄେ͖͘ʣ – pMgbac • E .D LyYl Mijdcfeh ! rmMijdcfeh ! – – 0 s F ] A D A vWn E DI BE 2 A B t P E B , DB [Yok EB F R W Y H D A You EB F H D A You x s w - D BI A 0 A HEFA A AF E B C D A C -, 12

13.

Deep Reinforcement Learning and the Deadly Triad  'ωοτϫʔΫαΠζ͕େ͖͍΄͏͕ൃࢄ͠ʹ͍͘ – – – • • s c n f r o d e i e g 13

14.

Deep Reinforcement Learning and the Deadly Triad  0༏ઌ෇͖όοϑΝͷ༏ઌ౓߹͍Λେ͖͘͢Δͱൃࢄ͠΍͍͢ – – – R g P E ! ∈ {0,1,2} f dc dc g e - g i 2 / 14

15.

Deep Reinforcement Learning and the Deadly Triad ·ͱΊɿ%FBEMZ5SJBEͷߏ੒ཁૉ͸͍͔ͭ͘ͷख๏ʹΑΓͦͷӨ‫ڹ‬Λ؇࿨Մೳ  #PPUTUSBQQJOH – - - - D Q R  'VODUJPOBQQSPYJNBUJPO – E M – P !  0GGQPMJDZ Q E DT D 15

16.

Diagnosing Bottlenecks in Deep Q-learning Algorithms • %2-͕࣋ͭજࡏతͳ໰୊Λௐࠪ͢ΔͨΊʮϢχοτςετʯΛ࣮ࢪ͠ɼ Լ‫ͭه‬ͷٙ໰ʹ࣮‫ݧ‬తʹ౴͑ͨɽ F 2 . 1 3 42 5 4 2 O 2 B3 : : 2 3 4 B3 16

17.

Diagnosing Bottlenecks in Deep Q-learning Algorithms  ' ؔ਺ۙࣅ‫͕ث‬ऩଋʹ༩͑ΔӨ‫ڹ‬͸ʁ – – – J MT C D D C • C 17

18.

Diagnosing Bottlenecks in Deep Q-learning Algorithms  # աֶश͸ൃੜ͢Δͷ͔ʁ – – >R >R • – 4 6532 i o f OR oB nD O D u T nD la - r p ,063 110 e c u > • • T c i la n r 18

19.

Diagnosing Bottlenecks in Deep Q-learning Algorithms  # աֶशΛͲ͏΍ͬͯܰ‫͢ݮ‬Δ͔ʁ – – – EHHE=I .A P ) EM=H 2PI=M W cv z u ti sm [ fn ( kwyxz u pem [khl W n ∈ {0.5, 1.0, 2.0, 4.0, 8.0} j bg u [ f k h A=MH OKLLE C drY M= HA AOPM AHHI= -MMKM W n l ad go ]d go i[ .A P 2PI=M o A=MH OKLLE Cl u M=FEO =I= D= M= E D= D )C=MR=H K DP= A CEK 0PCK =MK DAHHA =MG KRH= E D= D )C=MR=H ,E = /DK D = AMCA A E A 1ILHE EO P AM L=M=IAOAME =OEK E DE EO EHH ,= A A E EOE C .P =IA O=H KB -SLAMEA =O= ABBE EA O AAL MAE BKM AIA O HA=M E C 1 A ALH= 1 19

20.

Diagnosing Bottlenecks in Deep Q-learning Algorithms  # ճ‫ؼ‬ઌͷඇఆৗੑͷӨ‫ڹ‬͸ʁ – • • - - • • - - - ! : – 20

21.

Diagnosing Bottlenecks in Deep Q-learning Algorithms • ·ͱΊɿ%FBEMZ5SJBEͷߏ੒ཁૉ͸͍͔ͭ͘ͷख๏ʹΑΓ؇࿨͞ΕΔ  #PPUTUSBQQJOH – - PQ D  'VODUJPOBQQSPYJNBUJPO –  0GGQPMJDZ – R a M - E T D D! E Q E 21

22.

Revisiting Fundamentals of Experience Replay • ϦϓϨΠόοϑΝͷύϥϝʔλ͕3-ʹ༩͑ΔӨ‫ڹ‬Λௐࠪʢ%5ͷ0ʹ૬౰ʣ – 04 1 ,1 12 – 4 :5 /: 2 – 04 1 01 : • • 1 l K pd A4 1 1 : 1 o RiK Rg fe nc a d * : 4= : 2 * . * : 4= : 2 M - P : : 2 KC 22

23.

Revisiting Fundamentals of Experience Replay • 0 ϦϓϨΠόοϑΝͷϋΠύʔύϥϝʔλͱ3-ͷੑೳͷؔ܎͸ʁ – 4.no • p t s daf cb – 21 32 . – 2 m 2 l w 32 . 3 4 .2 – R 3 A 4 .2 m 32 . i R A daf bd A r y gae .1 2 3 4 24 bd 1 21 f 4. - 23

24.

Implicit under-parameterization inhibits data-efficient deep reinforcement learning • ํࡦͱՁ஋ؔ਺Λ5%๏Ͱֶश͢Δͱɼ༗ޮϥϯΫ਺͕‫ݮ‬গ੍͠‫ޚ‬ੑೳ͕ѱԽ – – - - - I : 24

25.

Implicit under-parameterization inhibits data-efficient deep reinforcement learning • # ωοτϫʔΫͷ༗ޮϥϯΫ਺ͱ੍‫ޚ‬ੑೳʢऩӹʣͷؔ܎͸ʁ – – L • • • O R O = / T 25

26.

Implicit under-parameterization inhibits data-efficient deep reinforcement learning • # ωοτϫʔΫͷ༗ޮϥϯΫ਺ͱ੍‫ޚ‬ੑೳʢऩӹʣͷؔ܎͸ʁ – – / • • • = LR O 26

27.

Implicit under-parameterization inhibits data-efficient deep reinforcement learning • ##PPUTUSBQQJOH 5%๏ ͕ѱ͍ͷ͔ʁ – – L D • M D T BC S S 27

28.

Implicit under-parameterization inhibits data-efficient deep reinforcement learning • ༗ޮϥϯΫ਺͕མͪͳ͍Α͏ͳϩεΛೖΕͨΒྑ͍ͷͰ͸ʁ – 28

29.

D2RL: Deep Dense Architectures in Reinforcement Learning • ํࡦɾՁ஋ؔ਺ʹ%FOTF/FUΛ࠾༻ – – 29

30.

• ຊൃදͰ͸ɼ%FFQ3-ʢಛʹ%2-ʣ͕͏·ֶ͘शͰ͖ͳ͍ࣄྫ͔Βग़ൃ͠ɼ ͳֶͥशͰ͖ͳ͍͔ɼͦͷෆ҆ఆ͞͸Կʹ‫ى‬Ҽ͢Δͷ͔Λ঺հͨ͠ɽ • ಛʹɼz%FBEMZ5SJBEzͱ‫ݺ‬͹ΕΔͭͷཁૉͷ૊߹ֶ͕ͤशΛෆ҆ఆʹ͍ͯ͠Δ ͜ͱΛࣔ͠ɼ͜ΕΒͷཁૉʹΑΔӨ‫ڹ‬Λܰ‫͠ݮ‬ಘΔ‫ڀݚ‬Λ঺հͨ͠ɽ – #PPUTUSBQQJOH • .VMUJTUFQɼ5BSHFUOFUXPSLɼ%PVCMF2MFBSOJOHɼ%FOTF/FUɼFBSMZTUPQQJOHͷ࠾༻ • ༗ޮϥϯΫ਺ͰͷੑೳͷՄࢹԽ – 'VODUJPOBQQSPYJNBUJPO • ΑΓେ͖͍ωοτϫʔΫͷ࠾༻ – 0GGQPMJDZ • ΑΓPOQPMJDZͰଟ༷ͳαϯϓϧͷར༻ • ୠ͠ɼ͜ΕΒ͸‫ڥ؀‬ɾ3-ΞϧΰϦζϜɾϋΠύʔύϥϝʔλʹ ʢ৔߹ʹΑͬͯ͸‫͘ڧ‬ʣґଘ͢ΔͷͰɼͦͷ࣌ʑͰద੾ʹ࢖͍෼͚Δ΂͖ɽ 30

31.

DEEP LEARNING JP [DL Papers] Why Deep RL fails? A brief survey of recent works. Presenter: Kei Ota. http://deeplearning.jp/ 31