[DL輪読会]Convolutional Sequence to Sequence Learning

DL輪読会 Convolutional Sequence to Sequence Learning 2017/05/19 松尾研究室 M1 中川⼤海

Agenda 1. Information 2. Introduction 3. Related Works 4. Proposed Model 5. Experiments & Results 6. Conclusion 2

1. Information • Author – Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin – FAIR(Facebook AI Research) • Submitted for arXiv on 8 May 2017 • Summary • fully convolutionalなseq2seqモデル • GLU, multi-hop attention, residual connectionなどの機構 • GNMTより精度⾼くて9倍くらい早い • 実装がGitHubに上がってます(https://github.com/facebookresearch/fairseq) 3

https://github.com/facebookresearch/fairseq

4.

2. Introduction • 翻訳界隈でGNMT(Google Neural Machine Translation)が話題 – encoder-decoder, bi-directional encoder, attention, LSTMブロック積んでresidual層で勾配消失防ぐ • ⼀⽅、⾃然⾔語処理界隈では最近は並列計算できるCNNを⽤いるモデルが流⾏り – RNN: 並列計算できない、系列が⻑くなると勾配消失しやすい – CNN: 並列計算できるため計算⾼速化が可能、離れた系列間の関係も学習しやすい • これまでもCNNを⽤いた⼿法は数々存在し、以下のような系譜をたどっている 1. 精度は勝てないけど計算は早くなる 2. 限られたデータセットでなら勝てる: [Bradbury et al. (2016), Kalchbrenner et al. (2016)] 3. 多様なデータセットで勝てる: [Gehring et al. (2016), Dauphin et al. (2016)] • not fully convolutional • not generative model like seq2seq 4

5.

3. Related Work: ⾃然⾔語処理におけるタスク • 識別 – language modeling(⾔語モデル) – sentence classification – sentiment analysis – etc • 評価指標 – Accuracy – PPL(Perplexity) • 単語の平均分岐数 • 2(1単語あたりのエントロピー) • ⽣成 – sequence to sequence learning • 翻訳, 要約 – caption generation – etc • どれぐらい単語を特定しにくいか(＝⼩さいほどよい) – BLEU(Bilingual Evaluation Understudy) • 正解(プロの翻訳)と予測の類似度的な指標 • ⼤きいほどよい – ... 5

6.

3. Related Work: GNMT [Wu et al. 2016] • encoder-decoderモデル – encode: 翻訳元の⾔語から潜在状態 – decode: 潜在状態から翻訳先の⾔語へ • ⼀層⽬のみbi-directional encoder – 初めの⽅の単語も⽂脈情報考慮できる – encoderの精度はattentionの効果にも影響 • attention – ⼊⼒系列のどこに注⽬して訳すればよいかまで学習 – 計算時間は増えるが⻑い系列に特に有効 • 各層をresidualに – 出⼒H(x)でなく残差関数F(x)=H(x)-xを学習 – 層を増やしても勾配消失しにくい – ⼊⼒をそのまま出⼒に加算するだけで実装できる 6

7.

3. Related Work: CNNを活⽤した⾃然⾔語処理モデル • Sentence Classification [Kim, 2014] • Character-level Text classification • Quasi-RNN [Zhang et al. 2015] [Bradbury et al. 2016] – LSTMライクにプーリング • その他いろいろあります – http://ksksksks2.hatenadiary.jp/entry/20170122/1485082800 – http://deeplearning.hatenablog.com/entry/neural_machine_translation_theory#seq2seq – https://www.slideshare.net/sheemap/convolutional-neural-netwoks • 計算は⾼速化されるが、LSTMベースより精度が良かったり悪かったり、有効なデータセットが限られていたり 7

8.

3. Related Work: CNNを活⽤した⾃然⾔語処理モデル • Language Modeling with Gated CNN [Dauphin et al. 2016] – Gated Linear Unitsをゲート関数として導⼊ – Residual処理 – WikiText-103のタスクでSoTAのPPL – LSTMベースの20倍の速度 8

9.

3. Related Work: CNNを活⽤したNMT • Language Modeling with Gated CNN [Dauphin et al. 2016] – Gated Linear Unitsをゲート関数として導⼊ – “allows the model to select which words or features are relevant to predict the next word.” – それまでの翻訳を踏まえて、その時点で⽂脈の特定の部分に着⽬するか広く⾒るか...などを表すゲート関数を学習できる – tanhベースのゲート関数よりも勾配が消失しにくい 9

10.

4. Proposed Model • やっていること 1. ⼊⼒をembedding→畳み込みしてGLUに通す • decoder側も同様 2. multi-hop attentionを計算 • allow machines to reference different parts of text to build understanding during encoding. 3. attentionつきの⼊⼒とdecoder contextsから予測 10

11.

4. Proposed Model • やっていること 1. ⼊⼒をembedding→畳み込みしてGLUに通す • decoder側も同様 2. multi-hop attentionを計算 • allow machines to reference different parts of text to build understanding during encoding. 3. attentionつきの⼊⼒とdecoder contextsから予測 11

12.

4. Proposed Model • やっていること 1. ⼊⼒をembedding→畳み込みしてGLUに通す • decoder側も同様 2. multi-hop attentionを計算 • allow machines to reference different parts of text to build understanding during encoding. 3. attentionつきの⼊⼒とdecoder contextsから予測 12

13.

4. Proposed Model • やっていること 1. ⼊⼒をembedding→畳み込みしてGLUに通す • decoder側も同様 2. multi-hop attentionを計算 • allow machines to reference different parts of text to build understanding during encoding. 3. attentionつきの⼊⼒とdecoder contextsから予測 →もう少し詳しく⾒ていきます 13

14.

4. Proposed Model • やっていること 1. ⼊⼒をembedding→畳み込みしてGLUに通す • decoder側も同様 • Position Embedding をによってにembedding。もconcatenateしてとする。 – inputやoutputが⽂のどの部分を扱っているかの情報 14

15.

4. Proposed Model • やっていること 1. ⼊⼒をembedding→畳み込みしてGLUに通す • decoder側も同様 • Convolution ⼊⼒ベクトルをのカーネルで畳み込んでとする。 – 各隠れ層でresidual処理を⾏っている 15

16.

4. Proposed Model • やっていること 1. ⼊⼒をembedding→畳み込みしてGLUに通す • decoder側も同様 • Gated Linear Units からへ変換。 – σ(B) controls which inputs A of the current context are relevant 16

17.

4. Proposed Model • やっていること 2. multi-hop attentionを計算 – current decoder state attention score – conditional input からを求める – decoder state summary からattention とprevious target element とoutput of the last encoder を求めるを求める z: large input context e: point information zがkey、z+eがvalueとして key-value memory networkのように働くらしい 17

18.

4. Proposed Model • やっていること 2. multi-hop attentionを計算 – decoder layer • はk-1個のattention historyにアクセスできるにが含まれるため – 過去のattention情報を反映しやすい • RNNだと消失しやすい – https://code.facebook.com/posts/1978007565818999/anovel-approach-to-neural-machine-translation/ 18

https://code.facebook.com/posts/1978007565818999/a-

19.

4. Proposed Model • やっていること 3. attentionつきの⼊⼒とdecoder contextsから予測 19

20.

5. Experiments & Results • Translation task – Datasets • WMTʼ16 English-Romanian, WMTʼ14 English-German, WMTʼ14 English-French – Experiment 1: Recurrent vs. Convolutional • LSTMベースのモデルたち, ByteNet, GNMT – Experiment2: Generation speed vs. GNMT – Experiment3: Effect of some architectures 20

21.

5. Experiments & Results • Summarization task – Datasets • Abstractive summarization (Gigaword corpus) – Compare Accuracy with RNN SoTA Models • Shen et al., 2016 • Suzuki & Nagata, 2017 21

22.

5. Results: Translation task • 1. Recurrent vs. Convolutional – いずれのデータセットでも最良のBLEU 22

23.

5. Results: Translation task • 2. Generation speed vs. GNMT – 提案モデルのGPU(K40)でGNMTの GPU(K80)より⾼精度で9.3倍の速さ • K80はK40⼆つ分みたいなもの • “We did not have such a GPU available” – ビームサーチ幅(b)を広げるとスピードは多少落ちるが、BLEUは上がる – CPUはコア数が違うので⽐較できないとのこと 23

24.

5. Results: Translation task • 3. Effect of position embedding – position embeddingはあまり影響なし 24

25.

5. Results: Translation task • 3. Effect of multi-step attention – decoder layer全てにattentionするのが最良 – 計算的なoverheadもほとんどない 25

26.

5. Results: Translation task • 3. Effect of kernel size & depth – 狭く、深くが良い – Encoderは結構深くできる – Decoderはあまり効果なし 26

27.

5. Results: Summarization task • Accuracy – 勝ってるorそんなに負けてない(らしい) – ⽐較対象のモデルはいろいろspecificな加⼯してる • けど、⼿を加えていない提案モデルでも同じくらいの精度を出せている、とのこと – 提案モデルにも同様の処理はできる(らしい) 27

28.

6. Conclusion • fully convolutionalなseq2seqモデルを提案 – GLU, residual connection, multi-hop attention(, position embedding)などの機構を活⽤ • seq2seqモデルでSoTAな精度&GNMTの9倍の速度を達成した 28

29.

感想 • CNNすごい • NMTとかattentionとか全然わかってなかったので勉強になりました – 参考資料⾒ると結構分かるようになると思います 29

30.

参考⽂献 • Gehring, Jonas, et al. "Convolutional Sequence to Sequence Learning." arXiv preprint arXiv:1705.03122 (2017). • Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014). • Zhang, Xiang, Junbo Zhao, and Yann LeCun. "Character-level convolutional networks for text classification." Advances in neural information processing systems. 2015. • Bradbury, James, et al. "Quasi-Recurrent Neural Networks." arXiv preprint arXiv:1611.01576 (2016). • Wu, Yonghui, et al. "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." arXiv preprint arXiv:1609.08144 (2016). • Dauphin, Yann N., et al. "Language Modeling with Gated Convolutional Networks." arXiv preprint arXiv:1612.08083 (2016). • Shen, Shiqi, et al. "Neural Headline Generation with Sentence-wise Optimization." arXiv preprint arXiv:1604.01904 (2016). • Suzuki, Jun, and Masaaki Nagata. "Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization." EACL 2017 (2017): 291. 30

31.

参考⽂献 • Facebook AI Researchによる説明 – https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/ • NMT⼀般の参考 – http://deeplearning.hatenablog.com/entry/neural_machine_translation_theory#seq2seq • GNMTの解説 – http://smerity.com/articles/2016/google_nmt_arch.html – http://www.yasuhisay.info/entry/2016/11/23/000000 • Residual層の解説 – http://terada-h.hatenablog.com/entry/2016/12/13/192940 • Attentionの解説 – https://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention – インタラクティブに理解できる→ http://distill.pub/2016/augmented-rnns/ 31

[DL輪読会]Convolutional Sequence to Sequence Learning

Deep Learning JP

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

【拡散モデル勉強会】拡散モデルの数理

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

【拡散モデル勉強会】Introduction to Diffusion Models

【DL輪読会】Conditional Flow Matching

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

各ページのテキスト