【DL輪読会】Diffusion Models Without Attention

11.5K Views

January 26, 24

#拡散モデル #Attention #Efficient State Space Models #S4D #画像生成

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 90.3K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 66.5K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 48.2K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 46.3K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 45.5K

各ページのテキスト

DEEP LEARNING JP [DL Papers] Diffusion Models Without Attention Yuta Oshima, Matsuo Lab http://deeplearning.jp/

http://deeplearning.jp/

Diffusion Models Without Attention 書誌情報著者 • Jing Nathan Yan*, Jiatao Gu*, Alexander M. Rush (*Equal contribution) 概要 • 従来の拡散モデルは，Attentionを⽤いており，計算量が⼤きい． • Attentionの代わりにSSMを⽤いることで，計算量を削減． 2

背景 Denoising Diffusion Probalistic Model (DDPM) [1] • データx0に徐々にガウシアンノイズを加えていき，完全なガウシアンノイズ xT~𝑁(𝑥𝑇; 0, 𝐼)にする（拡散過程）． • ⽣成は，拡散過程を逆向きに辿る（逆過程）．逆過程はNNでモデル化． 3

背景 Denoising Diffusion Probalistic Model (DDPM) [1] • 実装上は，U-Net (左図) [2] を⽤いるが，各解像度の畳み込み層の前に空間⽅向のAttention（右図） [3] を⼊れることが，⾼い⽣成性能のために必要． 4

背景 Scalable Diffusion Models with Transformers (DiT) [4] • Vision Transformer (ViT) で構成された拡散モデル． • 先述のようにAttentionが⽣成性能に寄与することに加え，ViT が持つスケーラビリティが動機． • パッチ化によって⾼解像度画像を効率的に扱うことができるが，それでもAttentionはパッチ数の⼆乗の計算コストがかかる． • 本研究のベースライン． 5

背景 Efficient State Space Models (SSMs) [5] [6] • ⼀⽅で，近年ではAttentionの有望な代替として，SSMsの研究が盛ん． • 特に初期化や学習を⼯夫することで，⻑期系列の依存関係を捉えられるようになった（S4[5], S4D[6]など）[7]．詳細は，野中さん，岩澤さんの輪読会資料参照． 6

https://www.docswell.com/s/DeepLearning2023/5VV8M3-dlefficiently-modeling-long-sequences-with-structured-state-spaces

背景 Efficient State Space Models (SSMs) [5] [6] • S4を始めとするSSMsが注⽬される理由は，その⾼い⻑期依存性の把握能⼒にも関わらず，系列⽅向の計算が並列で⾏え，計算量やメモリ消費量も少ないため． • 本研究では，SSMsの⼀つであるS4D [6] を⽤いて，Attentionを⽤いない拡散モデルを提案する． 7

⼿法 Diffusion State Space Model (DiffuSSM) • ViTにおけるTransformer Blockのように，DiffuSSM Blockを積み重ねる． • 画像をパッチにして⼊⼒している．パッチの⼤きさをPとする． 8

⼿法 Diffusion State Space Model (DiffuSSM) • ViTにおけるTransformer Blockのように， DiffuSSM Blockを積み重ねる． • S4DはBidirectionalにしている． • MLPを⽤いる際には，Down-scaleやUp-scale を⽤いて，系列⽅向に縮⼩している．縮⼩拡⼤率をMとする． 9

10.

⼿法 Diffusion State Space Model (DiffuSSM) • 計算量をFLOPsで⽐較する． • パッチサイズP=2の場合，DiTはスケールする． • しかしパッチ化を除くと，DiffuSSMの⽅がスケール性が⾼い． 10

11.

実験条件付け⽣成性能 • 29層のDiffuSSM Blockを積み上げた DiffuSSM-XLで検証． • 計算量の削減に成功． • 256x256ではいくつかの指標でDiT を上回る性能を，512x512では少ない学習で競争的な結果を達成． 11

12.

実験⾮条件付け⽣成性能 • こちらでもすでに存在しているベースラインより優れた⽣成性能を発揮． 12

13.

実験アブレーション • DiffuSSMにおいて，パッチサイズPや，時系列拡⼤縮⼩サイズMを変化させた場合の定性的結果．PやMを増やすと，画像に歪みが出ることが分かる． 13

14.

実験アブレーション • DiffuSSMにおいて，S4Dの隠れ層の次元数や，PMを変化させた場合の定量評価． 14

15.

まとめ • DiffuSSMはAttentionを必要としない拡散モデル． • 表現圧縮を⽤いずとも，⻑距離の隠れ状態を扱うことができる． • DiffuSSMはは，256x256においてより少ないFLOPsでDiTよりも優れた性能を達成．より⾼解像度においても，少ない学習で競争⼒のある結果を達成． 15

16.

感想興味深かった点 • SSMsで拡散モデルを構成することが可能（画像モデリングが可能）なこと． • Bi-directional SSMsを⽤いていることから，おそらく単⽅向のSSMsでは難しいことが伺える． • DiffuSSMの構造は⾔語処理にSSMsを⽤いる研究[7]から踏襲したもの，⾔語と画像で類似の構造が使えることから，SSMsがより汎⽤的に使える可能性が⽰唆される．考える余地のある点 • 上のようなアブレーションを明⽰的にやってほしかった． • 拡散モデルにSSMsを⽤いたおそらく初の研究なので，アーキテクチャ周りの知⾒がもっとあると嬉しいと感じた． • 512x512の設定では実験上勝ててない，解像度へのスケーラビリティに難がある可能性がある． 16

17.

参考⽂献 [1] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. [2] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing. [3] Shen et al., 2018 "Efficient Attention: Attention with Linear Complexities", arXiv:1812.01243 [4] William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. [5] Albert Gu, Karan Goel, and Christopher R.. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021. [6] Albert Gu, Karan Goel, Ankit Gupta, and Christopher R.. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022. [7] Junxiong Wang, Jing Nathan Yan, Albert Gu, and Alexander M Rush. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022. 17