【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

>100 Views

June 23, 23

スライド概要

2023/6/23
Deep Learning JP
http://deeplearning.jp/seminar-2/

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

各ページのテキスト
1.

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation Shohei Taniguchi, Matsuo Lab 1

2.

Deep Transformers without Shortcuts ॻࢽ৘ใ ஶऀ • Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, Yee Whye Teh (DeepMind) ֓ཁ • TransformerΛlayer normalization΍skip connectionͳ͠ͰֶशͰ͖ΔΑ͏ʹվྑ • ICLR 2023 accepted 2

3.

ൃද֓ཁ • എ‫ܠ‬ • ؔ࿈‫ڀݚ‬ • ख๏ • ࣮‫݁ݧ‬Ռ • ·ͱΊ 3

4.

എ‫ܠ‬ Transformer • Transformer͸AttentionͱMLPͷ‫܁‬Γฦ͠ • ֤ϞδϡʔϧͰskip connectionͱlayer normalizationΛ ద༻͢Δͷ͕Ұൠత • ͜ΕΒ͕࣮ࡍʹͲ͏͍͏໾ׂΛՌ͍ͨͯ͠Δ͔͸ ‫ݱ‬ঢ়ෆ໌ • ͏·ֶ͘श͢ΔͨΊͷςΫχοΫͱ͍͏Ґஔ෇͚ 4

5.

ؔ࿈‫ڀݚ‬ Normalization-free Network • MLP΍CNNͰ͸ɼskip connection΍ਖ਼‫ن‬Խ͕ͳͯ͘΋ਂ͍ωοτϫʔΫΛֶश Ͱ͖Δํ๏͕஌ΒΕ͍ͯΔ • ‫ج‬ຊతʹ͸ɼޯ഑ফࣦ/രൃ͕‫͜ى‬Βͳ͍Α͏ʹద੾ʹॏΈͷॳ‫ظ‬ԽΛߦ͑͹ ਖ਼‫ن‬ԽͳͲΛ࢖Θͳͯ͘΋େৎ෉ • Dynamic isometryͱ͍͏֓೦͕ಛʹॏཁ 5

6.

Isometry ౳ํੑ L૚ͷMLPΛߟ͑Δͱɼೖྗ͔Βग़ྗ΁ͷϠίϏߦྻJ͸ɼ֤૚ͷॏΈͷߦྻੵ x = ϕ (h ), h = W x l l l l l−1 L L ∂x l l J= 0 = DW ∏ ∂h l=1 l ͨͩ͠ɼD ͸Dij = l ϕ′ (hi ) δijΛຬͨ͢ର֯ߦྻ  l 6 l +b

7.

􁉃 Isometry ౳ํੑ L L ∂x l l J= 0 = DW ∏ ∂h l=1 • ͜ͷϠίϏߦྻ͕ফࣦ/രൃ͍ͯ͠ͳ͚Ε͹ɼ҆ఆֶͯ͠शͰ͖Δ͸ͣ ߦྻͷಛҟ஋͕1෇ۙʹͳ͍ͬͯΕ͹ྑ͍ l • W ͷಛҟ஋ͷฏ‫͕ۉ‬1ͷͱ͖ɼ౳ํੑΛຬͨ͢ l • D ʹ͍ͭͯ͸ɼ‫׆‬ੑԽؔ਺͕‫߃Ͱۙ෇఺ݪ‬౳ؔ਺ͳΒ౳ํతʢtanhͳͲʣ 7

8.

􁉃 Dynamic Isometry ಈత౳ํੑ L L ∂x l l J= 0 = DW ∏ ∂h l=1 • ͞Βʹɼ͢΂ͯͷಛҟ஋͕1ͷͱ͖ɼಈత౳ํੑΛຬͨ͢ • ͜ΕΛຬͨ͢ͷ͸ɼॏΈ͕௚ަߦྻͷͱ͖ ௚ަॳ‫ظ‬ԽΛ͢Ε͹ɼޯ഑ফࣦ/രൃ͠ͳ͍ʂ 8

9.

ؔ࿈‫[ ڀݚ‬1] • MLPͰCIFAR-10ͷ෼ྨ • ௚ަॳ‫ظ‬Խ + tanh͸ଞΑΓ଎͘ऩଋ͢Δ 9

10.

ؔ࿈‫[ ڀݚ‬2] CNNͷ৔߹ • CNN΋ಈత౳ํੑΛຬͨ͢Α͏ʹॳ‫ظ‬Խ͢Ε͹ɼਂ͍ϞσϧΛਖ਼‫ن‬Խͳ͠Ͱ ֶशՄೳ • ৞ΈࠐΈΧʔωϧͷதԝͷΈΛ௚ަॳ‫ظ‬Խͯ͠ɼ࢒Γ͸͢΂ͯ0Ͱॳ‫ظ‬Խ • 1x1 convΛ௚ަॳ‫ظ‬Խͯ͠ɼͦͷपΓΛ0ຒΊ͢Δ‫ܗ‬ • ৞ΈࠐΈॲཧશମΛߦྻԋࢉͱ‫ʹ͖ͱͨݟ‬΋௚ަߦྻʹͳΔ 10

11.

ؔ࿈‫[ ڀݚ‬2] CNNͷ৔߹ • MNISTΛ4,000૚ͷCNNͰֶश • ਖ਼‫ن‬Խ΍skip connection͸ೖΕͳ͍ • ਖ਼‫ن‬෼෍Ͱॳ‫ظ‬Խ͢ΔΑΓ΋ֶश͕଎͘ͳΔ 11

12.

􁉃 ؔ࿈‫[ ڀݚ‬2] CNNͷ৔߹ • MNISTͱCIFAR-10Ͱ༷ʑͳਂ͞ͷϞσϧΛֶश • 10,000૚·Ͱ૿΍ͯ͠΋ֶशͰ͖Δ • ͨͩ͠ɼCIFAR-10Ͱ͸ςετͷਫ਼౓͕ανΔ ਖ਼‫ن‬Խ΍skip connection͸ֶशͷ҆ఆԽΑΓ΋ ൚Խʹ‫د‬༩͍ͯ͠Δ͜ͱΛࣔࠦ 12

13.

ؔ࿈‫[ ڀݚ‬3] ReZero • Skip connectionΛೖΕΔ৔߹Ͱ΋ɼಈత౳ํੑΛຬͨ͢Α͏ʹ ॳ‫ظ‬Խ͢Ε͹ɼ͞ΒʹੑೳΛ্͛ΒΕͦ͏ xi+1 = xi + αiF (xi) • ௨ৗ͸ αi = 1 ʹ͢Δ͕ɼαi = 0 Ͱॳ‫ظ‬Խͯ͠αi΋ֶशύϥϝʔλʹ͢Δ • ॳ‫ظ‬Խ࣌఺Ͱ͸ɼxi+1 = xi ͳͷͰɼ໌Β͔ʹಈత౳ํੑΛຬͨ͢ 13

14.

ؔ࿈‫[ ڀݚ‬3] ReZero • CIFAR-10Ͱ32૚ͷMLPΛֶश • ਖ਼‫ن‬Խͳ͠Ͱ΋͔ͳΓֶश͕଎͘ͳΔ 14

15.

ؔ࿈‫[ ڀݚ‬3] ReZero • CIFAR-10ͰResNetΛֶश • ֶश͕଎͘ͳΓɼੑೳ΋্͕Δ 15

16.

ؔ࿈‫[ ڀݚ‬4] ReLUͷ৔߹ • ReLUͷ৔߹͸ɼ௚ަॏΈͷҰ෦Λ൓సͤ͞Ε͹ಈత౳ํੑΛຬͨͤΔ • ௚‫ײ‬తʹ͸ɼReLUͰ͸ෛͷ஋ʹͳͬͨೖྗ৴߸͕͢΂ͯ0ʹःஅ͞ΕΔͷͰɼ ͦΕΛଧͪফ͢Α͏ʹූ߸Λ൓సͤ͞Ε͹ྑ͍ͱ͍͏͜ͱ 16

17.

ؔ࿈‫[ ڀݚ‬5] Transformerͷrank collapse • MLP, skip connection, LayerNormͷͳ͍ attentionͷΈͷTransformer͸ɼॳ‫ظ‬Խͷ ࣌఺ͰϞσϧશମͷߦྻ͕૚਺ʹରͯ͠ ࢦ਺తʹϥϯΫམͪ͢Δ͜ͱ͕ཧ࿦తʹ ΋ࣔͤΔ • AttentionͷΈͰ͸Transformer͸ֶशͰ͖ ͳ͍͜ͱΛࣔࠦ 17

18.

􁉃 Deep Transformers without Shortcuts • TransformerͰ΋ਖ਼‫ن‬Խ΍skip connectionͳ͠ͰֶशͰ͖Δʁ ‫ؤ‬ுΕ͹Ͱ͖Δ • ७ਮʹਖ਼‫ن‬ԽͱskipΛൈ͘ͱޯ഑͕ രൃ͢Δ • ఏҊ๏͸͍ͩͿ཈͑ΒΕ͍ͯΔ

19.

Deep Transformers without Shortcuts Γ͸े෼େ͖͍ఆ਺ Attn(X) = A(X)V(X) A(X) = softmax M ∘ ( 1 k d ⊤ Q(X)K(X) − Γ(1 − M) ) • ຊ࿦จͰ͸ɼGPT‫࢖Ͱܥ‬ΘΕΔΑ͏ͳCausal masked attentionΛର৅ʹ͢Δ • ະདྷͷ‫ྻܥ‬Λࢀর͠ͳ͍Α͏ʹMi,j = 1i≥jͰϚεΫ͢Δ

20.

Deep Transformers without Shortcuts • ·ͣ͸ɼMLPͷͳ͍attention-onlyͷϞσϧΛߟ͑ΔͱɼL૚໨ͷಛ௃ྔ͸ XL = [ALAL−1…A1] X0W, W = • Σl = ⊤ XlXl , L V O Wl Wl ∏ l=1 Πl = AlAl−1…A1ͱ͓͘ͱɼW͕௚ަߦྻͷͱ͖ Σl = Πl ⋅ Σ0 ⋅ ⊤ Πl

21.

􁉃 Deep Transformers without Shortcuts • Σl = ⊤ XlXl , Πl = AlAl−1…A1ͱ͓͘ͱɼW͕௚ަߦྻͷͱ͖ Σl = Πl ⋅ Σ0 ⋅ ⊤ Πl • Σl͕୯Ґߦྻʹ͚ۙΕ͹ɼޯ഑͕҆ఆ͢Δ ͦΕ͕‫͜ى‬ΔΑ͏ʹAlΛઃ‫͍ͨ͠ܭ‬ • ͨͩ͠ɼAl͸ཁૉ͕ඇෛͷԼࡾ֯ߦྻͱ͍͏੍໿෇͖

22.

􁉃 Deep Transformers without Shortcuts • Al = ⊤ −1 −1 −1 LlLl−1ͱ͓͘ͱɼL0 Σ0L0 = IT͕੒Γཱͭ΋ͱͰ Σl = ⊤ LlLl • ͜Ε͸ίϨεΩʔ෼ղʹ૬౰͢Δ ଥ౰ͳΣlΛઃ‫ͯ͠ܭ‬ɼͦͷίϨεΩʔ෼ղLlΛ‫ٻ‬ΊΕ͹ɼ৚݅Λຬͨ͢AlΛ ࡞ΕΔ

23.

Deep Transformers without Shortcuts U-SPA Σl (ρl) = (1 − ρl) IT + ρl11 ⊤ • ର֯੒෼͕1ͰͦΕҎ֎͕ρlͷߦྻ • 0 ≤ ρ0 ≤ ρ1 ≤ ⋯ ≤ ρL < 1Λຬͨͤ͹ɼ৚݅Λຬͨ͢ • ϥϯΫམͪ΋๷͛Δ

24.

Deep Transformers without Shortcuts E-SPA | | Σ γ = exp −γ i − j ( ) ( ) l l l ( ) i,j • ର֯੒෼͕1ͰͦΕҎ֎͸ର֯ઢ͔Βͷ‫Ͱ཭ڑ‬஋͕ఆ·Δߦྻ • γ0 ≥ γ1 ≥ ⋯ ≥ γL > 0Λຬͨͤ͹ɼ৚݅Λຬͨ͢ • ϥϯΫམͪ΋๷͛Δ

25.

Deep Transformers without Shortcuts Attentionͷ࠶ఆٛ • લड़ͷΣ͔Β‫ͨͬ࡞ͯ͠ࢉٯ‬AΛɼA = DPͱ෼ղ • D͸ਖ਼ͷର֯ߦྻɼP͸֤ߦͷ࿨͕1ͷԼࡾ֯ߦྻ • B = log(P)ͱ͓͍ͯɼҎԼͷΑ͏ʹattentionΛ࠶ఆٛ Q • Q(X)ͷॏΈW Λ0Ͱॳ‫ظ‬Խ͢Δ͜ͱͰɼॳ‫ظ‬஋ʹ͓͍ͯΣ͕ॴ๬ͷ‫ͳʹܗ‬Δ Attn(X) = DP(X)V(X), P(X) = softmax M ∘ [ Q(X)K(X) + B − Γ(1 − M) ] dk 1 ⊤

26.

࣮‫ݧ‬ WikiText-103 • 36૚ͷTransformerΛֶश • ૉ๿ʹskipΛͳͨ͘͠΋ͷ͸ɼશֶ͘शͰ͖ͳ͍ • ఏҊ๏͸ɼͪΌΜͱֶशͰ͖ͯΔ • ͨͩ͠ɼskip + LNΛೖΕͨ௨ৗͷ΋ͷΑΓ΋ ֶश͕͍ͩͿ஗͍

27.

࣮‫ݧ‬ C4σʔληοτ • 32૚ͷTransformerΛֶश • ֶश࣌ؒΛ৳͹ͤ͹ɼskip + LN͋Γͷੑೳʹ౸ୡ͢Δ • ໿5ഒ͘Β͍͕͔͔࣌ؒΔ • Transformerʹ͓͍ͯ͸ɼskip΍LN͸ֶशͷ ߴ଎Խʹ‫د‬༩͍ͯ͠Δʁ

28.

࣮‫ݧ‬ C4σʔληοτͰͷ࣮‫ݧ‬ • Skip connectionΛೖΕΔͱఏҊ๏͕ϕʔεϥΠϯͷskip + LNͷ΋ͷʹউͭ • ΍͸ΓTransformerͰ͸skip connection͕ ௒ॏཁʁ

29.

·ͱΊ • MLP΍CNNͰ͸ɼಈత౳ํੑΛຬͨ͢Α͏ʹॳ‫ظ‬ԽΛߦ͑͹ɼਖ਼‫ن‬Խ΍skip connectionͳ͠Ͱ΋ɼਂ͍ωοτϫʔΫΛֶशͰ͖Δ • TransformerͰ΋ɼಉ͡Α͏ʹॳ‫ظ‬ԽΛஸೡʹ΍Ε͹ɼskip΍LNͳ͠ͰֶशͰ ͖Δ͜ͱ͕Θ͔ͬͨ • ͨͩ͠ɼֶश͕͔࣌ؒͳΓ͔͔Δ

30.

‫ײ‬૝ • ए‫ׯ‬ແཧ΍Γ‫ײ‬͸൱Ίͳ͍ • ݁‫ظॳہ‬Խ࣌ͷattention͕୯Ґߦྻʹۙ͘ͳΔΑ͏ʹ͢Ε͹ྑ͍ͱ͍͏͜ͱ ͷ͸ͣ • ΋ͬͱγϯϓϧͳํ๏΋͋Γͦ͏ͳ‫͕͢ؾ‬Δ • ֶश͕஗͘ͳΔ‫ݪ‬Ҽ͕Ͳ͜ʹ͋Δͷ͔͕͋·ΓΘ͔͍ͬͯͳ͍

31.

ࢀߟจ‫ݙ‬ [1] Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. "Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice." Advances in neural information processing systems 30 (2017). [2] Xiao, Lechao, et al. "Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks." International Conference on Machine Learning. PMLR, 2018. [3] Bachlechner, Thomas, et al. "Rezero is all you need: Fast convergence at large depth." Uncertainty in Artificial Intelligence. PMLR, 2021. 31 APA

32.

ࢀߟจ‫ݙ‬ [4] Burkholz, Rebekka, and Alina Dubatovka. "Initialization of relus for dynamical isometry." Advances in Neural Information Processing Systems 32 (2019). [5] Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. "Attention is not all you need: Pure attention loses rank doubly exponentially with depth." International Conference on Machine Learning. PMLR, 2021. [6] He, Bobby, et al. "Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation." The Eleventh International Conference on Learning Representations. 2023. 32