【拡散モデル勉強会】DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

2.1K Views

June 04, 24

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 91.7K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 69.7K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 61.4K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 53K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 50.2K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 48.3K

各ページのテキスト

20240604 Diffusion勉強会 DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention M2 鄭晟徹

Agenda 書誌情報イントロダクション提案手法: DiG 実験まとめ

書誌情報 DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention Lianghui Zhu 1,2,◊ Zilong Huang 2 Bencheng Liao 1 Jun Hao Liew 2 Hanshu Yan 2 Jiashi Feng 2 Xinggang Wang 1 1 School of EIC, Huazhong University of Science & Technology 2 ByteDance Code & Models: hustvl/DiG • Preprint: https://arxiv.org/abs/2405.18428 • Publish Data: 28 May 2024 • Code: https://github.com/hustvl/DiG DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 4

Agenda 書誌情報イントロダクション提案手法: DiG 実験まとめ

イントロダクション Vit-Baseの問題点 • Self-Attentionを使用しているため、入力シーケンス長に比例して2次関数的に計算コスト(GPUメモリ)が増大する • 高解像度画像生成やビデオ生成で膨大なリソースが必要 • subquadratic-timeな手法が登場：Mamba、RWKV、Gated Linear Attention Transformer (GLA) ◦ RNN likeなアーキテクチャ+ハードウェアを意識したアルゴリズム DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 5

イントロダクション Gated Linear Attention Transformer (GLA) • 自然言語処理で成功 • 画像生成で使うには2つの課題が残っていた ◦ unidirectional scanning modeling ◦ lack of local awareness • 本研究ではこれを解決するモデルを提案 DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 5

イントロダクション本論文の貢献 • 拡散GLA（DiG）を提案。Linear Attention Transformerを用いた拡散バックボーンの最初のモデル。 • DiGはDiTよりも2.5倍高速であり、1792×1792の解像度において75.7%のGPUメモリを節約。 • ImageNetデータセットで広範な実験。DiTと比較してスケーラブルな能力を示し、優れた性能を達成。 DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 5

イントロダクション 6.44 DiT-S/2 DiS-S/2 DiG-S/2 6 5.88 4.85 4 3.97 3.30 4.33 3.57 2.77 2.21 2 2.76 2.10 1.38 1.71 1.21 0 0.78 OOM 512 1024 1536 2048 Resolution (a) Speed Comparison 80 DiT-S/2 DiS-S/2 DiG-S/2 60 40 20 4.20 5.07 7.08 12.10 22.16 38.98 65.73 4.09 4.64 5.88 7.53 9.90 12.56 15.95 19.64 0 512 1024 1536 2048 Resolution (b) GPU Memory Comparison Figure 1: Efficiency comparison among DiT [39], DiS [16], and our DiG model. DiG achieves higher training speed while costs lower GPU memory in dealing with high-resolution images. For example, DiG is 2.5× faster than DiT and saves 75.7% GPU memory with a resolution of 1792 × 1792, i.e., 12544 tokens per image. Patch size for all models is 2. DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

10.

イントロダクション 60 DiS DiT Flash-DiT DiG 45 30 15 0 S/2 B/2 L/2 XL/2 Model Size (a) FPS Comparison w/ image size = 1024 9 DiS Flash-DiT DiG 6 3 0 S/2 B/2 L/2 XL/2 Model Size (b) FPS Comparison w/ image size = 2048 Figure 2: FPS comparison among DiS [16], DiT [39], DiT with Flash Attention-2 (Flash-DiT) [11] and our DiG model varying from different model sizes. We take DiG as a baseline. With a resolution of 1024×1024, DiG is 2.0× faster than DiS at small size while 4.2× faster at XL size. Furthermore, DiG- XL/2 is 1.8× faster than the most well-designed high-optimized Flash-DiT-XL/2 with a resolution of 2048 × 2048. DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

11.

Agenda 書誌情報イントロダクション提案手法: DiG 実験まとめ

12.

提案手法 Spatial Reorient & Enhancement Module Noise Σ 32 x 32 x 4 32 x 32 x 4 Linear and Reshape Layer Norm N x DiT Block Patchify Embed Noised Latent 32 x 32 x 4 Timestep t Label y Scale α2 Pointwise Feedforward Scale, Shift γ2,β2 Layer Norm Scale α1 Multi-Head Self-Attention Scale, Shift γ1,β1 Layer Norm MLP Input Tokens Conditioning Latent Diffusion Transformer Dit[1] [1]William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195-4205, 2023. DIT Block with adaLN-Zero Noise Σ 32 x 32 x 4 32 x 32 x 4 Linear and Reshape RMS Norm N x DiG Block Patchify Embed Noised Latent 32 x 32 x 4 Timestep t Label y SREM Scale α2 Feedforward Scale & Shift γ2,β2 RMS Norm Scale α1 Gated Linear Attention Scale & Shift γ1,β1 RMS Norm MLP Input Tokens Conditioning Latent DiG DiG Block Figure 4: The overview of the proposed DiG model. The figure presents the whole Latent DiG, DiG block, details of spatial reorient & enhancement module (SREM), and layer-wise DiG scanning directions controlled by the SREM. We mark the scanning order and indices on each patch. DiG(提案手法) DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

13.

[beta]

提案手法
Gated Linear Attention Transformer(GLA)[2]の導入
• 視覚生成のためにLinear Attention Transformerを直接使用すると、一方向のモ
デリングのためにパフォーマンスが低下する（論文より）。
Q = XW_Q ∈ R^{L×d_k}, K = XW_K ∈ R^{L×d_k}, V = XW_V ∈ R^{L×d_v}, S_t = Σ_{i=1}^t φ(K_i)V_i
G_t = α_t^T β_t ∈ R^{d_k×d_v}, α = σ(XW_α + b_α) / τ ∈ R^{L×d_k}, β = σ(XW_β + b_β) / τ ∈ R^{L×d_v}, (5)
S'_{t-1} = G_t ⊙ S_{t-1} ∈ R^{d_k×d_v}, (6)
S_t = S'_{t-1} + K_t^T V_t ∈ R^{d_k×d_v}, (7)
O_t = Q_t^T S_t ∈ R^{1×d_v}, (8)
R_t = Swish(X_t W_r + b_r) ∈ R^{1×d_v}, (9)
Y_t = (R_t ⊙ LN(O_t))W_O ∈ R^{1×d_v}, (10)
[2]Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
Figure 3: Pipeline of GLA.
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

14.

提案手法 SREM (Spatial Reorient & Enhancement Module) の導入 • シーケンスを2Dに再形成し、局所的な空間情報を知覚するために、軽量の 3x3 Depth-Wiseコンボリューション（DWconv2d）層を起動する。 • 具体的には、DWConv2dに対して従来の初期化を用いると、畳み込みの重みが周囲に分散するため、収束が遅くなる。この問題に対処するため、畳み込みカーネルの中心のみを1とし、周囲を0とするidentity initializationを提案する。 • 最後に、2ブロックごとに2次元トークン行列を転置(Transpose)し、平坦化 (Flatten)されたシーケンスを反転(Flip)させて、次のブロックの走査方向を制御する。右下の図(DiG Scanning Directions)に示すように、各レイヤーは一方向の走査しか処理しない。 DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

15.

提案手法計算効率の分析(理解できていないので論文たばり) • SRAMをフルに活用し、並列形式でシーケンスをモデリングするために、GLAに従ってシーケンス全体をSRAM上で計算を完了できる多くのチャンクに分割します。 • チャンクサイズをMとすると、学習複雑度(training complexity)は O(T/M (M^2 D + MD^2)) = O(TMD + TD^2) となり、T > Dの場合、従来のアテンションの複雑度O(T2D)よりも小さくなる。 M: Parameters, D: Hidden Size DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

16.

提案手法提案手法の詳細 Table 1: Details of DiG models. We follow DiT [39] model configurations for the Small (S), Base (B), Large (L), and XLarge (XL) variants. Given I = 32, p = 4. Model Layers N Hidden Size D Heads Parameters (M) Gflops Gflops_DiG / Gflops_DiT DiG-S 12 384 6 31.5 1.09 77.9% DiG-B 12 768 12 124.6 4.31 77.0% DiG-L 24 1024 16 443.4 15.54 78.9% DiG-XL 28 1152 16 644.6 22.53 77.4% DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

17.

Agenda 書誌情報イントロダクション提案手法: DiG 実験まとめ

18.

実験 6.44 DiT-S/2 DiS-S/2 DiG-S/2 6 5.88 4.85 4 3.97 3.30 4.33 3.57 2.77 2.21 2 2.76 2.10 1.38 1.71 1.21 0 0.78 OOM 512 1024 1536 2048 Resolution (a) Speed Comparison 80 DiT-S/2 DiS-S/2 DiG-S/2 60 40 20 4.20 5.07 7.08 12.10 22.16 38.98 65.73 4.09 4.64 5.88 7.53 9.90 12.56 15.95 19.64 0 512 1024 1536 2048 Resolution (b) GPU Memory Comparison Figure 1: Efficiency comparison among DiT [39], DiS [16], and our DiG model. DiG achieves higher training speed while costs lower GPU memory in dealing with high-resolution images. For example, DiG is 2.5× faster than DiT and saves 75.7% GPU memory with a resolution of 1792 × 1792, i.e., 12544 tokens per image. Patch size for all models is 2. DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

19.

実験 60 DiS DiT Flash-DiT DiG 45 30 15 0 S/2 B/2 L/2 XL/2 Model Size (a) FPS Comparison w/ image size = 1024 9 DiS Flash-DiT DiG 6 3 0 S/2 B/2 L/2 XL/2 Model Size (b) FPS Comparison w/ image size = 2048 Figure 2: FPS comparison among DiS [16], DiT [39], DiT with Flash Attention-2 (Flash-DiT) [11] and our DiG model varying from different model sizes. We take DiG as a baseline. With a resolution of 1024×1024, DiG is 2.0× faster than DiS at small size while 4.2× faster at XL size. Furthermore, DiG- XL/2 is 1.8× faster than the most well-designed high-optimized Flash-DiT-XL/2 with a resolution of 2048 × 2048. DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

20.

実験 Table 2: Ablation of the proposed Spatial Reorient & Enhancement Module (SREM). We validate the effectiveness of each SREM component and use the same hyperparameters for all models. The “half right and half wrong symbol” means use DWConv2d without the proposed identity initialization. Model Spatial Reorient & Enhancement Module Flops (G) Params (M) FID-50K Bidirectional DWConv2d Crisscross Baseline Method. DiT-S/2 6.06 33.0 68.4 Ours. DiG-S/2 4.29 33.0 175.84 DiG-S/2 ✔ 4.29 33.0 69.28 DiG-S/2 ✔ ✘ 4.30 33.1 96.83 DiG-S/2 ✔ ✔ 4.30 33.1 63.84 DiG-S/2 ✔ ✔ ✔ 4.30 33.1 62.06 DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

21.

実験 Table 3: Benchmarking class-conditional image generation on ImageNet 256 × 256. DiG models adopt the same hyperparameters as DiT [39] for fair comparison. We mark the best results in bold. Model FID↓ sFID↓ IS↑ Precision↑ Recall↑ Previous state-of-the-art diffusion methods. ADM [13] 10.94 6.02 100.98 0.69 0.63 ADM-U 7.49 5.13 127.49 0.72 0.63 ADM-G 4.59 5.25 186.70 0.82 0.52 ADM-G, ADM-U 3.94 6.14 215.84 0.83 0.53 CDM [21] 4.88 - 158.71 - - LDM-8 [46] 15.51 - 79.03 0.65 0.63 LDM-8-G 7.76 - 209.52 0.84 0.35 LDM-4-G (cfg=1.25) 3.95 - 178.22 0.81 0.55 LDM-4-G (cfg=1.50) 3.60 - 247.67 0.87 0.48 Baselines and Ours. DiT-S/2-400K [39] 68.40 - - - - DiG-S/2-400K 62.06 11.77 22.81 0.39 0.56 DiT-B/2-400K 43.47 - - - - DiG-B/2-400K 39.50 8.50 37.21 0.51 0.63 DiT-L/2-400K 23.33 - - - - DiG-L/2-400K 22.90 6.91 59.87 0.60 0.64 DiT-XL/2-400K 19.47 - - - - DiG-XL/2-400K 18.53 6.06 68.53 0.63 0.64 DiG-XL/2-1200K 11.96 7.39 106.65 0.65 0.67 DiG-XL/2-1200K (cfg=1.5) 2.84 5.47 250.36 0.82 0.56 DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

22.

実験 120 S/2 B/2 L/2 XL/2 100 80 60 40 20 0 100K 200K 300K 400K Iteration (a) Scaling DiG w/ Model Size 200 160 120 80 40 0 DiG-S/8 DiG-S/4 DiG-S/2 100K 200K 300K 400K Iteration (b) Scaling DiG w/ Patch Size Figure 5: The scaling analysis with DiG model sizes and patch sizes. DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

23.

実験 Figure 6: Image results generated from the proposed DiG-XL/2 model. DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 11

24.

Agenda 書誌情報イントロダクション提案手法: DiG 実験まとめ

25.

まとめと感想まとめ • DitにGated Liner Attentionを組み込んだDiGを提案 • Ditより高速でGPUメモリも効率的 • モデルサイズとパッチサイズでスケーリング能力を確認 • 動画生成タスクでは検証されていない感想 • DitにLinear Attention入れたら良さそうというのをやってくれた • 動画生成でかなりブレークスルーになる？ DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention 18