[DL輪読会]Learning What and Where to Draw (NIPS’16)

>100 Views

February 15, 17

deep learning

スライド概要

2017/2/15
Deep learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 26.4K

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 24K

【DL輪読会】Generative Agents: Interactive Simulacra of Human Behavior

Deep Learning JP 13.4K

【DL輪読会】LLMベースの自律型エージェントシステムのサーベイ

Deep Learning JP 12.6K

【DL輪読会】4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Deep Learning JP 12.5K

【DL輪読会】LightGlue: Local Feature Matching at Light Speed

Deep Learning JP 10.3K

各ページのテキスト

論⽂輪読 Learning What and Where to Draw (NIPS’16) 2017/1/20 1

書誌情報 • Learning What and Where to Draw • Scott Reed (Google), Zeynep Akata (MPI), Santosh Mohan (umich), Samuel Tenka (umich), Bernt Schiele (MPI), Honglak Lee (umich) • NIPS‘16 (Conference Event Type: Poster) • https://papers.nips.cc/paper/6111-learning-what-and-where-to-draw 2017/1/20 2

https://papers.nips.cc/paper/6111-learning-what-and-where-to-draw

c.f. Generative Adversarial Text to Image Synthesis • ICML’16 • http://www.slideshare.net/mmisono/generative-adversarial-text-toimage-synthesis 2017/1/20 3

http://www.slideshare.net/mmisono/generative-adversarial-text-to-

2017/1/20 4

2017/1/20 5

Generative Adversarial What-Where Network (GAWWN) • 「なに」を「どこ」に描くか指定する GAN ⽂章 2017/1/20 bonding box / keypoint 6

Bounding-box-conditional text-to-image model 1. text embeddingをM x M x T に変換 2. bounding boxに合うように正規化. 周りは0で埋める 0でマスク 2017/1/20 MxMxT 0でマスク 7

Keypoint-conditional text-to-image model Key Pointはグリッド座標で指定それぞれがhead, left foot, などに対応 2017/1/20 8

Conditional keypoint generation model • 全てのキーポイントを⼊⼒するのは⾯倒 • 今回の実験では，⿃は15個のキーポイントを持つ • ここではConditional GANでキーポイントを⽣成 • キーポイント : • x,y : 座標, v: visible flag • v = 0 なら x = y = 0 • Generator: • Dは 2017/1/20 s: ユーザが指定したキーポイントに対応する箇所が1 を1, 合成したものを0とするよう学習 9

10.

Experiments : Dataset • USB Birds dataset • 200種類の⿃，11,788 枚の画像 • 1枚の画像に10のキャプション, 1つのbounding box, 15のkeypoints • MHP • 25k image, 410種類の動作 • 各画像3キャプション • 複数⼈が写っている画像を除くと19k 2017/1/20 10

11.

Experiments : Misc • text encoder : char-CNN-GRU • Generative Adversarial Text To Image Synthesisと多分同じ • Solver: Adam • Batchsize 16 • Learning rate 0.0002 • 実装 : torch • spatial transform: https://github.com/qassemoquab/stnbhwd • loosely based on dcgan.torch 2017/1/20 11

https://github.com/qassemoquab/stnbhwd

12.

Conditional bird location via bounding boxes 2017/1/20 ・背景は似ている3つの画像で同じではない textとnoiseは3つとも同じ・bounding boxが変わっても⿃の向きは同じ・zは背景や向きなど制御できない情報を担当しているのでは 12

13.

Conditional individual part locations via keypoints ・keypoints は ground truthに固定 (合成でない) ・noiseは各例で別 2017/1/20 ・keypointsはnoiseに対してinvaliant ・背景等はnoiseで変化 13

14.

Using keypoints condition 2017/1/20 ・くちばしと尻尾を指定・全ての⿃が左を向いている (c.f. condition on bounding box) 14

15.

Generating both bird keypoints and images from text alone 2017/1/20 ・textだけからkeypointsを⽣成，その後画像⽣成・全部keypointsを⽣成するようにすると質は下がる 15

16.

先⾏研究との⽐較 2017/1/20 ・先⾏研究はtextはほぼ正確に捉えているものの，くちばちなどが⽋けることがある (64x64) ・提案⼿法は128x128でほぼ正確な画像を⽣成 16

17.

Generating Human 2017/1/20 ・⿃より質が下がる 17 ・textが似ているものが少ない，複雑なポーズは難しい (ヨガぐらいならまぁまぁできてる)

18.

まとめ • GAWWN : bounding boxとkey pointsでどこに描くかを条件付け • CUB datasetでは128x128で質の⾼い画像が⽣成可能 • Future work • 物体の位置を unsupervised or weekly supervised な⽅法で学習 • better text-to-human generation 2017/1/20 18

19.

所感 • 「どこ」の情報をどうエンコードするか，という点が新しい • bounding box • keypoints • ⽂章だけだと任意性が⾼すぎる．位置情報を与えてあげることで画像が⽣成しやすくなる • 細かいネットワーク構成に関しては，なぜそういう設計にしたか説明がないため不明 • もう少し何か理論的根拠が欲しいところ 2017/1/20 19