[DL輪読会]Generative Models of Visually Grounded Imagination

Generative Models of Visually Grounded Imagination 輪読担当者：鈴⽊雅⼤ 2017/06/02

本論⽂について ¤ 著者：Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, Kevin Murphy ¤ 第⼀著者はバージニア⼯科⼤，残りはGoogleの⼈ ¤ Murphyさんは「Machine Learning」の著者としても有名 ¤ 2017/05/30, arXiv. NIPSに投稿中． ¤ 本論⽂の貢献： ¤ 属性->画像の⽣成において，属性の推論分布がエキスパートの積で表される深層⽣成モデルを提案． ¤ 属性->画像が上⼿くできているかを測る新しい指標（3C）を提案． ¤ 選んだ理由： ¤ 提案⼿法がJMVAE[Suzuki+ 2017]とほぼ⼀緒（論⽂内で引⽤&⽐較実験されている） ¤ 画像と属性情報の接地は⾯⽩いテーマ（というか僕の研究テーマ）

3.

背景 ¤ ⼈間は「big bird」と聞くと，何らかの画像を想像する． ¤ それらの画像には，テキストで指定していないような詳細な画像が⼊⼒されるため，バリエーションが⽣じる． big bird ¤ 「big red bird」のように特徴を追加することで，より正確な概念を定義できる． big red bird

4.

属性ベースの概念の記述 specify novel concepts. For instance, we might never have seen any birds with the attributes “size”: small, “color”: white, but we can easily imagine such a concept.1 We can also use attributes to automatically create a subsumption (is-a) hierarchy or taxonomy: each node is a concept (a partial specification of certain attribute values), and its children are refinements of this concept, obtained by specifying the value of one or more “missing” attributes. Conversely, the parents of a node are abstractions of that node, obtained by unspecifying one or more attributes. For example, suppose the root of the tree corresponds to the abstract concept “bird”; the next level, which is slightly less abstract, contains the concepts “small birds”, “big birds”, “red birds”, “white birds”; below this we have “small white birds”, etc. The leaf nodes of the tree correspond to maximally specific, or concrete, concepts, where all attributes are specified. See Figure 1 for an illustration. We call this a compositional concept hierarchy. It differs from standard concept hierarchies, such as Wordnet (G. A. Miller, 1990) (which is the basis for Imagenet (Russakovsky et al., 2015)), in that it is algorithmically created by combining primitive attributes in all combinatorially possible ways. This is similar to the capacity of natural language to “make infinite use of finite means”, in the words of Noam Chomsky (1965). ¤ 概念を記述する⽅法は様々あるが，本研究では属性ベースに着⽬． ¤ 概念を単数もしくは複数の属性で定義する． ¤ 属性は，組み合わせや階層構造（compositional concept hierarchy）によって新しい概念を設計することができる． Figure 1: Illustration of a compositional concept hierarchy related to birds, derived from two independent attributes, size and color. Moving up the graph corresponds to abstraction of a concept; moving down the graph corresponds to refinement or specialization of a concept. The leaf nodes correspond to concrete, or maximally specific, concepts. ¤ 本研究では，⾔語的な記述が与えられると，概念階層から概念を「想像」し，その中間表現を画像に「翻訳」するモデルを提案する． We would like to construct models that are able to “imagine” concepts from this concept hierarchy, given a linguistic description, and then “translate” this internal representation into an image. We call this ability visually grounded semantic semantic imagination. ¤ この能⼒をvisually grounded imaginationと呼ぶ． We argue that a good method for visually grounded semantic imagination should satisfy the following three criteria: (1) Correctness: The representation we create should match all the properties (attributes) that were mentioned (this is known as the intension of the concept). For example, if we say “red bird”, all the images we generate should contain birds that are red. (2) Coverage: The representation should only match the specified attributes, and should be indifferent to the values of the other unspecified attributes. This allows the representation to cover all of the different object

5.

3C：良い「想像」と「翻訳」の基準 ¤ 正確性（Correctness）： ¤ ⽣成した画像には，指定した属性が含まれていなければならない． ¤ 例）「red bird」と指定したら，⾚い⿃の画像のみを⽣成してほしい． ¤ 網羅性（Coverage）： ¤ ⽣成した画像は，特定していない属性とは無関係であり，その集合に属するすべてのバリエーションをカバーしなければならない． ¤ 例）「red bird」と指定したら，⼩さいものや⼤きいものなど，様々な⾚い⿃が⽣成されていてほしい． ¤ 合成性（Compositionality）： ¤ すでに学習した属性を⾜したり引いたりすることで，新しい概念を想像できてほしい．具体的な概念から，概念の⼀般化も⾏いたい． ¤ 例）⼤きな⾚い⿃や⼩さい⽩い⿃をみたことがあれば，⼤きな⽩い⿃や⼩さな⾚い⿃を想像できる．また，⼤きな⾚い⿃や⼤きな⽩い⿃をみたことがあれば，⿃の⾊を取り除くことを学んで，⼤きな⿃がどのように⾒えるかを想像できるはず．

6.

Variational Autoencoder（VAE） ¤ Variational autoencoder [Kingma+ 13] ¤ 周辺分布𝑝(𝑥)をモデル化 ¤ 𝑝% 𝑥 𝑧 と𝑞( 𝑧 𝑥 はそれぞれ多層ニューラルネットワークで定義 ¤ 下界ℒ(𝑥)は次のように求まり，これを最⼤化する． ℒ 𝑥 = −𝐷-. [𝑞( (𝑧|𝑥)| 𝑝 𝑧 + 𝐸34(5|6) log 𝑝% 𝑥 𝑧 ¤ 本研究では，VAEを画像と属性が扱えるように拡張

7.

8.

[beta]

When we have a uni-modality VAE, the scaled objective becomes
space (using q(z|x) and q(z|y),
respectively);
this lets us “translate”
images int
R
R
ctive becomes
x
L = and
[log p(x|z)]
vice versa, by computing p(y|x) = dz p(y|z)q(z|x)
p(x|y)
= dzKL(q(z|x),
p(x|z)q(p
x Eq(z|x)
q(z|x)p(z))
and q(z|y) inference(6)networks, that can handle
missing
modalities,
also lp
] KL(q(z|x),
= Eq(z|x)
[log p(x|z)]
x KL(q(z|x),
imagesp(z))
without
descriptions,(7)and descriptions
without images, i.e., we can perform
x KL(q(z|x),
エキスパートの積
x
where x = 1/ x . This is equivalent to the -VAE approach of (Higg
learning.
that by using
1, they could encourage the learning of a “disenta

x
pproach of (Higgins et al., 2017). They showed
posterior is forced to be closer to the N (0, I) prior, which is already
ing of aOur
“disentangled”
representation,
since
the
third
extension
is
a
tractable
way to handle partially specified concepts (e.g.
¤ ある属性が⽋損した場合についても推論したい．
which is already disentangled.
When
have a multi-modality
the scaling terms
affect
not
not specify
the color). To do this,
weweintroduce
the notionVAE,
of|𝒜|
observed
versus
missi
¤ しかし，すべての推論分布を作るとなると，属性集合𝒜について2
もの推論分布が
towards the prior, but also how much each modality influences the late
terms affect
how muchthe
we regularize
O ✓not
Ajust
represent
set of observed
attributes,
and
yO their
values. toSince
wesca
必要となる！
have
different
amounts
of
information,
it
is
important
tune
these
nfluences the latent space. When the modalities
are conditionally independent, the corresponding likelihood is given by p(yO |z) w=
t to tune these scaling parameters appropriately, In particular, since an image is more informative than an attribute,
そこで本研究では，エキスパートの積（PoE）[Hinton,
xy
y
x
We¤also
the
inference
toensures
handle
inputs.
Unlike
th
and yx
thatmissing
the2012]に基づく
latent space
is shared
betwee
an an attribute,
weneed
need totosetmodify
y / network
x > 1; this
y > 1, x  1,
⼿法を提案する．
as we
illustrate ininference
Section 5.1. network for all 2|A| possible
is shared
between
the attributes
the images,
modalities,
it is notand
feasible
to fit
a separate

¤ 属性の近似分布は次のようになる．
patterns.
Instead, we use a method based
product of experts (Hinton, 200
Qon the
Product of Experts
(PoE). In order to handle missing attributesQ
a
our model has the form q(z|yOq(z|y)
) / p(z)
q(z|y
).
If
no
attributes
are
specifi
k following form: q(z|yO ) / p(z)
inference k2O
network has the
ssing attributesQ
at test time, we assume that the
equal to q(z|y
the kprior.
condition on more attributes, the posterior becomes
z|yO ) is
/ p(z)
), whereAs
q(z|ywe
k ) = N (z|µk (yk ), Ck (yk )) is the k’th Gaussian “expert”, and p(z) = N (
k2O
This isprecise
similar toconcept.
the product of
model
proposed
in (Ht
corresponds
a more
Inexperts
Section
5,model
we show
that
ただし，𝒪
, and p(z)
= N (z|µ
0, Cspecifying
= 𝒜は観測した属性集合．
I) is the prior.
0 = to
0⊆
we include
the “universal
expert”
p(z), forconcrete
reasons weco
d
l proposed
in (Hinton,
2002), images
but differsrepresenting
in two ways. First,
generate
diverse
abstract
concepts,
as well
as novel
¤ 属性が指定されなければ，事後分布は事前分布と等しくなる．
for reasons
we discuss
in Section
5.1. Second, we apply this model to represent the distribution of the latent variable
not
seen
during
training.
¤ 指定する属性の数が増えれば，事後分布が狭くなり，より正確な概念を指定できる．
he latent variables, not the visible variables.

In (Hinton, 2002), each
expert
a distribution
over binary va
¤ [Hinton, 2012]と異なり，本研究ではuniversal
expert
𝑝 𝑧produced
= 𝑁(0, 𝐼)
を導⼊している．

summary,
the contributions
of this and
paper
follows.constant
First, of
wethedefine
n
henceare
the as
normalization
product the
was in
on over In
binary
variables, similar
to a Boltzmann machine,
¤ 具体的な事後分布は次のとおり．
(normalized!)
productP
ofmodels
all the Gaussian
expertsgrou
has th
e product
was intractable.
In our case, and
it cancompositionality)
be shown that the P
(consistency,
coverage,
for
evaluating
of
visually
1
1
1
sian experts
has
the
form
q(z|y
)
=
N
(z|µ,
C),
where
C
=
C
and
µ
=
C(
C
µk train
), wherejoint
the sum
is o
O
k k to
k
in an objective way. Second, we propose a knovel
loss function
multi
here the sum is over all observed attributes plus the 0 term corresponding to p(z). This equation has a simple intuiti
which
weinterpretation:
show outperforms
approaches
grounded
visual
imagination
observe any
attributes, thefor
posterior
reduces to
the prior.
As we observ
s a simple
intuitive
If we do notprevious

9.

y y y p(x, y|z) = [p(x|z)p(y|z)] = p(x|z) observe a attribute, and to nohandle image, the conditional likelihood becomes p(y|z). ) network. Similarly, the p(y|z) = p(x|z) (4) |y) network. Sowe overall fit observe threenetwork Similarly, if we just a attribute, and q(z|x, no image, thehandle conditional likelihood becomes p(y|z). nerative model, fit anwe inference of the form y). To vectors, one predicting the posterior nly have an image asSimilarly, input, weiffit q(z|x) network. Similarly, wea just observe a attribute, andto nohandle image,the the conditional likelihood becomes p(y|z). In addition to the generative model, we fit an inference network of the form q(z|x, y). To handle sume the covariance is diagonal for have a attributes as input, we also fit a q(z|y) network. So overall we fit three In addition to the generative model, we fitwe an inference of the form q(z|x, y). To the handle where we only have an image aspredicting input, a q(z|x)network network. Similarly, to handle The outputthe of case each of these networks is two vectors, one thefit posterior the case where we only have an image we asisinput, network. Similarly, case where we only have a attributes input, also fitwea fit q(z|y) network. So overall wetofithandle three the predicting itsweposterior covariance (we assume theas covariance diagonal fora q(z|x) orm, which call the triple ELBO: case whereThe we output only have a attributes as networks input, we is also fitvectors, a q(z|y) one network. So overall we fit three inference networks. of each of these two predicting the posterior inference networks. Theitsoutput of each of these networks is twothe vectors, one predicting the posterior )] forKL(q(z|x, y), p(z)) meandatapoint of z, and one predicting posterior covariance (we assume covariance is diagonal for on a single has the following form, which we call the triple ELBO: mean of z, and one predicting its posterior covariance (we assume the covariance is diagonal for ¤ ⽣成分布と3つの推論分布を学習するために，本研究ではtriple ELBO simplicity). )) simplicity). = Eq(z|x,y) [ xy p(x|z) + yx KL(q(z|x, y), p(z)) x log y log p(y|z)] という下界を導⼊． Our objective for a single datapoint has the following form, which we call the triple ELBO: ) (5) Our function objective function + Eq(z|x) [ xx log p(x|z)] KL(q(z|x), p(z))for a single datapoint has the following form, which we call the triple ELBO: y y y Triple ELBO xy yx xy yx (5) ofKL(q(z|y), each L(x, y) =type Eq(z|x,y) [ q(z|x,y) + y log KL(q(z|x, y), p(z)) +likelihood Eq(z|y) [ yycontribution log p(y|z)] p(z)) L(x, y) =E + p(y|z)] KL(q(z|x, y), p(z)) x log[p(x|z) x log p(x|z) y log p(y|z)] x Eq(z|x) log p(x|z)] KL(q(z|x), p(z)) p(z)) +[ likelihood Exq(z|x) [ xxcontribution log p(x|z)] are scaling terms, chosen to + balance the of KL(q(z|x), each type these terms in more detail later.) +[ Ey log p(y|z)] [ y log p(y|z)] KL(q(z|y), +E KL(q(z|y), p(z)) p(z)) yx y (5) (5) ve justification for the triple ELBOq(z|y) y q(z|y) y ed by each attribute vector y matches x yx y xy yx x y xy where areterms, scalingchosen terms, chosen to balance the likelihood contribution of each e triple ELBO. We now give intuitive justification for the ELBO ただし， where , , totriple balance the likelihood contribution of each typetype x, , yan y , are x ,scaling y はスケーリングパラメータ． x y x x, y) pairs, where x 2 D is drawn xy (We discuss these in more vector detail ylater.) ensure that the region latent space occupied byterms each attribute matches Pofofdata. of data. 1 (We discuss these terms in more detail later.) q(z|D q(z|x, y) xy ) = |Dby pace occupied ofxycorresponding (x, y) pairs, where P x 2 Dxy is drawn x2D xy |the set 1 pt. We would like q(z|y) to match es corresponding to y. Moreが経験分布 let q(z|D ) = |Dxy | Wex2D Justification for the triplexyELBO. now give any)intuitive justification for the triple ELBO ¤ xy Justification for precisely, the triple ELBO. We now give anq(z|x, intuitiveと⼀致するようにしたい justification for the triple ELBO has seen. However, we don’t want objective. We want to ensure that the region of latent space occupied by each attribute vector y matches osterior” objective. for this grounding oftothe concept. Weregion wouldoflike q(z|y) to occupied match （今まで⾒た概念をすべてカバーするようにする） We want ensure that the latent space by each attribute vector yDmatches observed want the region of latent by the set corresponding (x, y) pairs, where x 2 xy is drawn ers” all theexample, examplesso ofwe thealso concept thatspace it hasoccupied seen. However, weofdon’t want P the region of latent space occupied by the set of corresponding (x, y) pairs, where x 2 D is drawn 1 ¤ しかし，単⼀の事例に影響されたくないので，エントロピックにしたい xy ossible. (Theexamples, issue of how much from set of imagesobserved corresponding to y. precisely, q(z|x, y) P the observed orthe worse atosingle example, so More we also want let q(z|Dxy ) 1= |D x2D | xy xy from the of corresponding y. issue Moreofprecisely, lettoq(z|Dxy ) = |Dxy | x2Dxy q(z|x, y) （例えばN(0,1)に近づけたい）． g which feature dimensions, is prior) also pic (i.e., as close toset the N images (0, I) asposterior” possible. to (The how much be the “empirical for this grounding of the concept. We would like q(z|y) to match mall set ofbe examples of a concept, and along which feature dimensions, is also the ¤ “empirical for all thisthegrounding theconcept concept. would q(z|y)we to don’t match q(z|D so it “covers” examples of of the thatWe it has seen.like However, want xy ),posterior” パラメータは，各モダリティが潜在空間に与える影響を制御する． aum, 1999).) soonly it “covers” of the concept that has seen. However, wesodon’t want put massall onthe theexamples observed examples, or worse aitsingle observed example, we also want xy ), to ctives forq(z|D encouraging this behavior. F ¤ 画像は属性よりも情報量が⼤きいので𝜆 は⼩さくする． to bethe astoobserved entropic (i.e., closeor to worse thethis N (0, I)F prior) as possible. (The issue how want much to putq(z|y) mass on examples, abehavior. single observed example, so weofalso tion 4, different adopt different objectives forasencouraging q(z|y) totobeonly as papers close as possible generalize from(i.e., a small set oftoto examples ofI) a concept, and along which rely on both the triple ELBO. This encourages bethe as N close as possible q(z|y) be as entropic asq(z|y) close (0, prior) asto possible. (The feature issue ofdimensions, how muchistoalso nerating the to paired and unpaired in (Tenenbaum, 1999).) ame decoder networkdiscussed p(y|z)aissmall used for bothof thea paired and and unpaired generalize from set generating of examples concept, along which feature dimensions, is also discussed in 1999).) 4, different papers adopt different objectives for encouraging this behavior. As(Tenenbaum, we explain in Section

10.

関連研究：条件付きモデル ¤ 条件付きVAEや条件付きGAN 𝑦 𝑞( 𝑧 𝑥, 𝑦 𝑧 𝑝% 𝑥 𝑦, 𝑧 𝑥 ¤ xとyが対称ではないため，双⽅向に⽣成ができない． ¤ ⽋損した⼊⼒を処理できない．このため「⼤きな⾚い⿃」でなく，「⼤きな⿃」のような部分的に抽象的な概念を⽣成できない．

11.

(Suzuki et al., 2017) y|z; z|x, y) RecentlyJMVAE there has been a trend towardsp(z)p(x|z)p(y|z) learning from elbo(x, unaligned multiple modality data (see e.g., ↵KL(q(z|x, y), q(z|x)) (Aytar et al., 2016)). However, this can cause problems when fittingy),VAEs. ↵KL(q(z|x, q(z|y)) In particular, VAEs with (Wang et al.,(such 2016) as p(z)p(x|z)p(y|z) elbo(x, and y|z; z|x) powerfulbi-VCCA stochastic decoders pixelCNNs for µp(x|z) RNNs for p(y|z)), can excel at +(1learn µ)elbo(x, y|z; z|y) learning good single modality generative models, but may to ignore z, as pointed out in (Chen JVAE-Pu (Pu et al., 2016) p(z)p(x|z)p(y|z) elbo(x, y|z; z|x) + elbo(x|z; z|x) et al., 2017). This cannot happen with paired data, since the only way to explain the correlation (Kingmais et via al., 2014) p(z)p(y)p(x|z, y) elbo(x|y, z; z|x, y) + log p(y) 4 between JVAE-Kingma the two modalities the shared latent factors. 関連研究：同時分布のVAE CVAE-Yan (Yan et al., 2016) p(z)p(x|y, z) elbo(x|y, z; z|x, y) CVAE-Sohn (Sohn et al., 2015) p(z|x)p(y|x, z) elbo(y|x, z; z|x, y; z|x) TrainingCMMA objectives for single modality inference networks. In a JVAE, (Pandey et al., 2017) p(z|y)p(x|z) See text. we may train on aligned data, but at test time, we usually only observe a single modality. Hence we must fit up to 3 inference VAEq(z|x, y),JVAE-Kingma JMVAE/ triple networks, q(z|x) and q(z|y).JVAE-Pu Several differentbi-VCCA objective functions haveELBO been proposed for this. To explain them all concisely, we define a modified version of the ELBO, where we distinguish between the variables we pass to the decoder, p(a|b), the encoder, q(c|d), and the prior, p(c|e): z y z z z elbo(a|b; c|d; c|e) = Eq(c|d) [log p(a|b)] KL(q(c|d), p(c|e)). (9) If we omit the p(c|e) term, we assume we are using the unconditional p(c) prior. The objective functions used by various papers are shown in Table 1. In particular, the bi-VCCA objective of (Wang et al., 2016) has the form x µ x y x y x y Figure 2: Summary of different (joint) VAEs. Circles are random variables, downward pointing arrows represent the generative (decoding) upnward pointing arrows (with dotted lines) represent the inference (encoding) Eq(z|x) [log p(x, y|z)]process, KL(q(z|x), p(z)) +(1 µ) E KL(q(z|y), p(z)) q(z|y) [log p(x, y|z)] process, and black squares represent “inference factors”. Method names are defined in Table 1. ¤ JMVAE [Suzuki+ 2017]とTriple ELBOのモデルは同じ（下界が違う）． (10) and¤theJMVAEの下界： JMVAE objective of (Suzuki et al., 2017) has the form Conditional models. Conditional VAEs (e.g., (Yan et al., 2016; Pandey and Dukkipati, 2017)) Eq(z|x,y) [log p(x, y|z)] KL(q(z|x, y), p(z)) ↵KL(q(z|x, y),learn q(z|y)) ↵KL(q(z|x, and conditional GANs (e.g., (Reed et al., 2016; Mansimov et al., 2016)) a stochastic mapping y), q(z|x)) (11) p(x|y) from semantics or attributes y to images x.3 See Table 1 for a summary. Since these models ¤treat JMVAEは3つ⽬の項のせいで，網羅率が落ちる可能性がある． x and y asymmetrically, they cannot compute both p(x|y) and p(y|x), and do not support We see that the main difference between our approach and previous joint models is our use of the semi-supervised learning, unlike our joint method. triple ELBO objective, which provides a different way to train the single modality inference networks ¤ しかし，実験的にはこの問題は⽣じていない． More importantly, these conditional models cannot handle missing inputs, so they cannot be used to q(z|x) and q(z|y). generate abstract, partially specified concepts, such as “big bird”, as opposed to “big red bird”. One heuristic that is commonly usedal., tomethod adapt discriminative they can handle missing inputs problem with the bi-VCCA of (Wangmodels et al., so 2016) is that the Eq(z|y) logisp(x, y|z) term ¤Thebi-VCCA[Wang et 2016]の下界や問題点等は論⽂参照． to set the unspecified input attributes, such as the bird’s color, to a special “UNK” value, and hope causes the generation of blurry samples, and hence low correctness. The reason is that a single y the network learns to “do the right thing”. Alternatively, if we have a joint model over inputs, we can is required all thevalues different x’s to which it isimage, matched, so it ends up being associated estimateto or “explain” impute the missing when predicting the output as follows: with their average. This problem can be partially compensated for by increasing µ, but that reduces x̂(yO ) = arg max log p(x|y) + log p(y|yO ) (8) the KL(q(z|y), p(z)) penalty, which is required to ensure q(z|y) is a broad distribution with good y

12.

VAEとなかまたち Table 1: Summary of VAE variants. x represents some form of image, and y represents some form of annotation. For notational simplicity, we omit scaling factors for the ELBO terms. The objective in (Pandey and Dukkipati, 2017) cannot be expressed using our notation, since it does not correspond to a log likelihood of their model, even after rescaling. Name Ref Model Objective VAE (Kingma et al., 2014) p(z)p(x|z) elbo(x|z; z|x) triple ELBO This p(z)p(x|z)p(y|z) JMVAE (Suzuki et al., 2017) p(z)p(x|z)p(y|z) bi-VCCA (Wang et al., 2016) p(z)p(x|z)p(y|z) JVAE-Pu (Pu et al., 2016) p(z)p(x|z)p(y|z) elbo(x, y|z; z|x, y) +elbo(x|z; z|x) + elbo(y|z; z|y) elbo(x, y|z; z|x, y) ↵KL(q(z|x, y), q(z|x)) ↵KL(q(z|x, y), q(z|y)) µ elbo(x, y|z; z|x) +(1 µ)elbo(x, y|z; z|y) elbo(x, y|z; z|x) + elbo(x|z; z|x) JVAE-Kingma (Kingma et al., 2014) p(z)p(y)p(x|z, y) elbo(x|y, z; z|x, y) + log p(y) CVAE-Yan CVAE-Sohn CMMA (Yan et al., 2016) (Sohn et al., 2015) (Pandey et al., 2017) p(z)p(x|y, z) p(z|x)p(y|x, z) p(z|y)p(x|z) elbo(x|y, z; z|x, y) elbo(y|x, z; z|x, y; z|x) See text. VAE JVAE-Kingma z y JVAE-Pu bi-VCCA JMVAE/ triple ELBO z z z

13.

関連研究：もつれを解く表現 ¤ Beta-VAEやInfoGANでは，属性情報などを使わないで，もつれを解いた（disentangled）表現を獲得できる． ¤ Beta-VAEは，KL項の係数を⼤きくすることで獲得する．回転移動⾊ disentangled ⽣成データ ¤ しかし，これらがどのように意味構造を学習するのかは明らかではない． ¤ 良い表現を獲得するには，いくつかのラベルや属性が必要である [Soatto and Chiuso, 2016]

14.

実験 ¤ MNISTに基づくデータセットで実験 ¤ MNIST-2bit： ¤ MNIST画像に「small or large」と「even or odd」のタグをつけたもの ¤ MNIST-a（MNIST with attributes）： ¤ MNIST画像をアフィン変換してclass label，location，orientation ， scaleのタグをつけたもの 6, small, upright, top-right 1, big, counter-clockwise, bottom-left 7, small, clockwise, bottom-left 4, big, counter-clockwise, bottom-left 9, big, clockwise, top-right 2, small, upright, bottom-left 3, big, upright, top-left 0, small, upright, bottom-right Figure 11: Example binary images from our MNIST-a dataset. ¤ 評価は3Cで⾏う（次のページで3Cの定式化をする） • Image decoder, p(x|z): Our architecture for the image decoder exactly follows the standard DCGAN architecture from (Radford et al., 2015), where the input to the model is the latent state of the VAE. • Q Label decoder, p(y|z): Our label decoder assumes a factorized output space p(y|z) = k2A p(yk |z), where yk is each individual attribute. We parameterize each p(yk |z) with a two-layer MLP with 128 hidden units each. We optionally apply L1 regularization on the first layer of the MLP, which consumes as input the samples z from inference networks. • Image and Label encoder, q(z|x, y): Our architecture (Figure 12) for the image-label encoder first separately processes the images and the labels, and then concatenates them downstream in the network and then passes the concatenated features through a multi-layered perceptron. More specifically, we have convolutional layers which process image into 32, 64, 128, 16 feature maps with strides 1, 2, 2, 2 in the corresponding layers. We use batch normalization

15.

[beta]

ability to perform abstraction and compositional generalization.

Given
this classifier,
we can
measure
theofquality
set of generated
images
ttributes
are likelihood.)
observed.
We
discuss
how
wemeasure
create
such
concepts
below.
likelihood.)
Given this
classifier,
we can
the
quality
the setofofthe
generated
images using
theusin
3 C’s,
asdiscuss.
we now discuss.
3
C’s,
as
we
now
g visual semantic imagination
Coverage.
For
coverage,
we
want
to measure
theasdiversity
of values
thegenerated
unspecified
or missing
Correctness.
We
define
correctness
the fraction
of attributes
for
each generated
image
that m
Correctness.
define
correctness
the fraction
forfor
each
image
that
match
n is
the act of creating aWe
latent
representation
of someasconcept.
But howof
canattributes
we
ttributes,
Mthose
= Ain
\ the
O. concept’s
One
compute
specified
inapproach
the
concept’s
description:
ty those
these internal
representations?
Some
papers
(e.g.,would
(Chen
etbe
al.,to
2016;
Higgins the entropy of the distributions qk for
specified
description:
ach
M, assess
where
distribution
overifvalues
for attribute k induced by set S.
ieu etkal.,22016))
the q
quality
of anempirical
internal representation
by checking
it
k is the
roperties, such
as being
“disentangled”.
However, we prefer
to usethe
an evaluation
However,
since
there
may be correlation
amongst
attributes
if most big birds are red), we
X(e.g.,
uses on externally observable data, so that we can compare methodsX
objectively
1X
1 X
nstead
compare
qk toinspiration
pk , which
true
distribution
kkfor
correctness(S,
= 1 values for attribute
I(ŷ(x)
= yall
1 yO )over
k ).images in the
ured properties.
We draw
fromis
thethe
field
of education,
which
similarly
|S|
|O|
correctness(S,
yO )between
= a concept
I(ŷ(x)
(1)
k = yk ).
ge
of assessing
whether
a student
has successfully
“understood”
(c.f.,
x2S
k2O using the Jensen-Shannon
xtension
of¤
yO正確性：
. We
measure
the difference
these
distributions
|S| x2S
|O|
k2O
5)).
With visualsince
concepts,
a symmetric
natural approach
is to
give the student
a description
ivergence,
it
is
and
satisfies
0

JS(p,
q)

1. We then define:
d ask them to generate N images that match that description, by creating a set of

3Cの定式化

tension), which we ただし，
denote by S(yO ) = {x(n) ⇠ p(x|yO ) : n = 1 : N }.で，𝑦Gは事前に学習した分類器での予測結果．

We compute the correctness for a random sample of concrete (leaf node) concepts, for whic
Weofcompute
theimages,
correctness
aWe
random
sample
concrete
node)below.
concepts, for which all
uality
these generated
apply for
a multi-label
classifier
toX
each
to such(leaf
attributes
arewe
observed.
discuss
weofone,
create
concepts
1how

dicted
attribute ¤
vector,
ŷ(x). This classifier,
which we
call
isJS(pk , qbelow.
coverage(S,
yO
) observation
=
(1 concepts
k )).
attributes
are網羅性：
observed.
We
discuss
how
we createclassifier,
such

|M|

(2)

k2M

more common term “multi-modal”, since it may get confused with the notion of a probability
Coverage. For coverage, we want to measure the
ultiple modes.

diversity of values for the unspecified or mis
Coverage.attributes,
For coverage,
we\ want
to measure
diversity
values for
unspecified
or missing q
ただし，
M=A
O.は⽋損した属性．
One
approachthe
would
be toof
compute
the the
entropy
of the distributions
Weattributes,
compute each
the
for
a random
sample
ofbeabstract
(non-leaf)
concepts,
where
at least
one
M¤coverage
=k𝑞=A
One
approach
would
todistribution
compute
the
entropy
the
distributions
qk forby s
2 \M,
qk is the
empirical
over
valuesoffor
attribute
k induced
3O. where
は集合Sからの属性kの値に対する経験分布
ttribute
missing.
However,
may be correlation
amongst
the attributes
(e.g., ifkmost
big birds
each kis 2
M,
wheresince
qk isthere
the empirical
distribution
over values
for attribute
induced
by setareS.red)
¤ 𝑝= は𝑦𝒪 のすべての画像の属性𝑘の値に対する真の分布
compare
pk , which amongst
is the truethe
distribution
values
for big
attribute
forred),
all images
However, instead
since there
may qbe
correlation
attributesover
(e.g.,
if most
birdskare
we i
k to
¤ JSはJensen-Shannonダイバージェンスで，
𝑞=values
と𝑝these
= の差を計測．
extension
ofpykO, which
. We measure
thedistribution
difference between
distributions
using
the Jensen-Shan
instead
compare
q
to
is
the
true
over
for
attribute
k for
alltest
images
in the
train
k
Compositionality.
In
standard
supervised
learning,
we
train
on
D
and
test
on
D
,
which
xy
xy
since it isthe
symmetric
and
satisfiesthese
0  JS(p,
q)  1.using
We then
extension divergence,
of ¤
yO⽋損した多様性を計測したいが，他の属性との被りも計測したいという意図．
. We measure
difference
between
distributions
the define:
Jensen-Shannon
re two disjoint labeled datasets of the form {(x, y) ⇠ ptrue (x, y)}. We call this the iid setting. We
divergence, since it is symmetric and satisfies
0  JS(p, q)  1. Wetrain
then define:
test
sually assume
that
every
class
label
in
D
has
already
been
seen
in
D
xy
xy ; the iid setting therefore
¤ 合成性：
X
1
ests our ability to generalize across visual
variation within
(attribute
combinations).
coverage(S,
yO ) = known categories
(1 JS(p
k , qk )).

¤ iid設定：訓練とテストで同じ属性の組み合わせを観測．
X |M| k2M
1
o test our ability
generalize
beyond the
categories,
to novel
of attributes, (2)
coverage(S,
yOknown
)=
(1 JS(p
)).
k , qkcombinations
¤ to
comp設定：訓練で学習していない属性の組み合わせでテスト．

(2)
we partition the label space Y into two disjoint |M|
subsets,
2 Y2 ,
k2MY1 and Y2 , such that for each y
(1)
here is no yWe
2 Y1 thatthe
is coverage
identically
but sample
there isofsome
y (1)(non-leaf)
which shares
at least
oneat least
compute
forequal,
a random
abstract
concepts,
where
(2)

16.

実験1-1：属性の必要性 ¤ Beta-VAE（もつれを解くVAE）と⽐較 ¤ 2次元の潜在空間を可視化．⾊は属性情報に対応． (a) (b) Figure 3: Visualization of the benefit of semantic annotations for learning a good latent space. (a) -VAE fit to (a)：beta-VAE，(b)：JVAE images¤without annotations. Note that the red region (corresponding to the concept of large and even digits) is almost non existent. (b) Joint-VAE with yx y = 50. ¤ 属性情報があった⽅が，上⼿く分離されている． latent space, we proceed as follows: we embed each training image x (with label y(x)) into latent space, by computing ẑ(x) = Eq(z|x) [z]. We then associate label y(x) with this point in space. To

17.

実験1-2：universal expertの必要性 when all attributes are present (as is the case during training), the individual Gaussians get multiplied together, to produce a well defined results (as shown by the small ellipses in each quadrant), but at test time, when some attributes are missing, the product can be rather poorly behaved. We can solve this problem by always including the universal expert p(z) in the product. The benefits ¤ エキスパートの積において𝑝(𝑧)は必要なのかを検証 of this are shown in Figure 4(c). We now get very nicely shaped posteriors for both concrete and ¤ (a)：エキスパートの積ではない場合，(b)：エキスパートの積（ただし abstract queries, since we always multiply each Gaussian expert, which may be long and thin, by𝑝(𝑧) the なし）， universal expert, which is a fixed sized circle. (c)：𝑝(𝑧)ありのエキスパートの積 (a) ¤ (b) (c) Figure 4: Visualization of the effect of using different inference networks. (For each experiment, we show the y yx result obtained using the best hyperparameters.) (a) Product of Experts disabled, xx = xy x = 1, y = y = 10. y yx 𝑝(𝑧)がないと，𝑞(𝑧|𝑦 (b) Product of Experts enabled,= )が⼤きく縦や横に分布が伸びてしまう（b）． but no universal expert, xx = xy x = 1, y = y = 50. (c) Product of Experts x xy y yx enabled, with universal expert. x = 0.001, x = 0.1, y = y = 100. Figure best viewed by zooming in. ¤ 𝑝(𝑧)によって，事前分布の範囲に収まるようになる（c）． 5.1.4 Why we need likelihood scaling terms. In Figure 5(a), we show what happens if we optimize the unscaled triple ELBO objective. Without

18.

実験1-3：スケーリングパラメータの必要性 ¤ λの調節が重要であることを⽰す． (a) (b) (c) y Figure 5: Visualization of the impact of likelihood scaling terms on the latent space. (a) xx = xy x = y = yx x xy y yx x xy y yx y = 1. (b) x = x = 1, y = y = 100. (c) x = 0.01, x = 1, y = y = 100. Figure best viewed by zooming in. 5.1.6 Understanding the JMVAE objective. ¤ 𝜆FF が⼩さくないと，うまく潜在空間上で学習ができない上に，上⼿くp(x|z)を⽣成できない． Figure 6(c) shows the results of fitting the JMVAE model using ↵ = 0.1. JMVAE consistently generates Gaussians with high variance. When ↵ is small, all of the Gaussians overlap heavily. In this case, the red Gaussian (large, even) is positioned over the black region (small, even) more fully than the white Gaussian, and vice-versa. The fact that p(x|z) and p(y|z) still perform well indicates that they are relying almost entirely on the signal from the q(z|x, y) network to solve the task. Figure 6(d) shows the results of fitting the JMVAE model using ↵ = 10. When ↵ is large, the Gaussians correspond well to the attribute regions output by p(y|z), meaning that larger values of ↵ lead to better alignment in our MNIST-2bit world. Note, however, that there is still more overlap begween the concepts than when using triple ELBO (compare Figure 6(d) with Figure 5(c)).

19.

実験1-4：JMVAEの解釈 ¤ JMVAEが潜在空間でどのように学習しているのかを確認する（biVCCAは省略）． ¤ (c)：λ=0.1，(d)：λ=10 (a) (b) (c) (d) ¤ Figure λが⼩さいと， q(z|x, y)に依存した状態になり，q(z|yk)の分散は⼤きくなる（し 6: Effect of hyper-parameters on bi-VCCA and JMVAE. (a) bi-VCCA, µ = 0.1. (b) bi-VCCA, µ = 0.9. (c) JMVAE, ↵ = 0.1. (d) JMVAE, ↵ = 10. Figure best viewed by zooming in. かし，p(x|z)の⽣成はうまくできている）． ¤ λが⼤きいと，上⼿くq(z|yk)が学習できている（しかし，分布同⼠が重なるところもある） 13

20.

[beta]

実験2-1：MNIST-aの実験結果
Table 2: ComparisonELBO，JMVAE，
of different approaches on MNIST-abi-VCCAを⽐較実験．
test set. Higher numbers are better. Error bars (in
¤ MNIST-aでtriple
parentheses) are standard error of the mean. For concrete concepts (where all 4 attributes are specified), we do
not use a PoE inference network, and we do not report coverage. Hyperparameter settings for each result are
¤ 係数のパラメータ調整は検証データセットで⾏った．
discussed in the supplementary material.

Method

#Attributes

Coverage (%)

Correctness (%)

PoE?

Training set

triple ELBO
JMVAE
bi-VCCA

4
4
4

-

90.76 (0.11)
86.38 (0.14)
80.57 (0.26)

N
N
N

iid
iid
iid

triple ELBO
JMVAE
bi-VCCA

3
3
3

90.76 (0.21)
89.99 (0.20)
85.60 (0.34)

77.79 (0.30)
79.30 (0.26)
75.52 (0.43)

Y
Y
Y

iid
iid
iid

triple ELBO
JMVAE
bi-VCCA

2
2
2

90.58 (0.17)
89.55 (0.30)
85.75 (0.32)

80.10 (0.47)
77.32 (0.44)
75.98 (0.78)

Y
Y
Y

iid
iid
iid

triple ELBO
JMVAE
bi-VCCA

1
1
1

91.55 (0.05)
89.50 (0.09)
87.77 (0.10)

81.90 (0.48)
81.06 (0.23)
76.33 (0.67)

Y
Y
Y

iid
iid
iid

triple ELBO
JMVAE
bi-VCCA

4
4
4

-

83.10 (0.07)
79.34 (0.52)
75.18 (0.51)

N
N
N

comp
comp
comp

¤ bi-VCCAがダメダメ
5.2.4 Hyper-parameters.
ForELBOがJMVAEを上回る結果
each model, we have to choose various hyperparameters: the label likelihood weighting
¤ triple

yx
y

2
{1, 10, 50} (we keep
= 1 fixed throughout), and whether to use PoE or not for q(z|y). In addition,
each way of training the model has its own method-specific hyperparameters: for JMVAE, we choose
↵ 2 {0.01, 0.1, 1.0} (the same set of values used in (Suzuki et al., 2017)); for bi-VCCA, we choose
xy
x

21.

[beta]

512
512
concat (1536)
512

flatten (1024)

JMVAE提案者としての⾔い分
512

8x8x16

concat (2048)

16x16x128
32x32x64
64x64x32

512

512

512

512

512

512

512

512

Image [64x64x1]
Class [10]
Scale [2] Orientation [3] Location [4]
¤ パラメータ調整の結果，JMVAEではパラメータ𝛼をすべて1に設定し
ているが，triple ELBOでは実験毎に異なる値となっている．
Figure 12: Architecture for the q(z|y) network in our JVAE models for MNIST-a. Images are (64x64x1), class
has 10 possible values, scale has 2 possible values, orientation has 3 possible values, and location has 4 possible
values.

¤ どう考えてもtriple ELBOの⽅がパラメータ依存性が⾼く，調整が⼤変では？？
Table 3: Here we list the hyperparameters used by each method to produce the results in Table 2. (Recall that we
x
xy
fix xy
x = 1 for all methods, and x = x for triple ELBO.)

Method

#Attributes

L1

POE?

Training set

triple ELBO
JMVAE
bi-VCCA

4
4
4

10
50
10

5e-05
0
5e-07

N
N
N

iid
iid
iid

triple ELBO
JMVAE
bi-VCCA

3
3
3

50
50
50

=1
↵=1
µ = 0.7

5e-03
5e-03
5e-04

Y
Y
Y

iid
iid
iid

triple ELBO
JMVAE
bi-VCCA

2
2
2

50
50
50

y
y =1
↵=1
µ = 0.7

5e-03
5e-03
5e-04

Y
Y
Y

iid
iid
iid

triple ELBO
JMVAE
bi-VCCA

1
1
1

50
50
50

y
y =1
↵=1
µ = 0.7

5e-3
5e-06
5e-04

Y
Y
Y

iid
iid
iid

triple ELBO
JMVAE
bi-VCCA

4
4
4

10
50
10

0
5e-03
5e-05

Y
Y
Y

comp
comp
comp

yx
y

Private Hyperparameter
y
y

= 100
↵=1
µ = 0.7
y
y

y
y

= 100
↵=1
µ = 0.7

¤ 多分JMVAEのパラメータ𝛼をもっと⼤きくすれば（10とか），
We choose the best hyperparameters based on performance on the corresponding validation set. More
precisely, when evaluating concrete test concepts, we choose the values that maximize the correctness
JMVAEの⽅が上回っている気がする．
score on concrete validation concepts. But when evaluating abstract test concepts, we choose the
µ 2 {0.3, 0.5, 0.7}; for triple ELBO, we choose yy 2 {1, 50, 100} (we keep
all methods have the same number of hyperparameters.

x
x

=

xy
x

= 1). Thus

values that maximize the coverage scores on the abstract validation set. If there are multiple values

with very similar coverage scores (within one∈
standard
error), we
break1.0}
ties by picking
the values
¤ わざわざ「もとの論⽂に従って𝛼
{0.01,
0.1,
にした」と書いてある．
which give better correctness. The resulting hyperparameters are shown in Table 3.
21

22.

実験2-2：正確性の定性評価 ¤ 様々な属性から画像を⽣成して，その属性を予測する． ¤ ⾚い四⾓が間違って属性が予測された画像． Concept: 4, big, upright, bottom-left triple ELBO JMVAE bi-VCCA Concept: 1, big, counter-clockwise, top-left triple ELBO JMVAE bi-VCCA Figure 7: Samples of 2 (previously seen) concrete concepts using 3 different models. For each concept, we draw 4 samples from the posterior, zi ⇠ q(z|y), convert each one to a mean image, µi = E[x|zi ], and then show the results. The caption at the top of each image (in small font) is the predicted attribute values. (The observation classifier is fed sampled images, not the mean image that we are showing here.) The border of the image is black if all attributes are correct, otherwise the border is red. ¤ bi-VCCAがダメダメ ¤ JMVAEでも⼀部ミスがあったり，上⼿く⽣成できていなかったりする． 3 methods for 2 different concrete concepts, one chosen at random (bottom row), and one chosen where the discrepancy in correctness between triple ELBO and bi-VCCA was maximal. We see that the images generated by bi-VCCA are much blurrier than the other methods, and are considered incorrect by the observation classifier. The blurriness is because of the Eq(z|y) [log p(x, y|z)] term, as we discussed above. For all of our experiments, we use a value of µ = 0.7, which reduces blurriness, and yields the best correctness score on the validation set. Nevertheless, this does not completely eliminate bluriness, as we can see. From Figure 7 we see that the JMVAE samples look good. However, it sometimes makes mistakes by

23.

In this section, we evaluate how well each method covers a concept, in terms of the diversity of the samples it generates. Note that we only apply this to abstract concepts, since concrete concepts fully specify all attributes, and hence a single sample automatically covers the entire concept. 実験2-3：網羅性の定性評価 ¤ The results are shown in Table 2. Once again, we see that triple ELBO outperforms JMVAE (although the gap is small), and both methods outperform bi-VCCA. To get a better understanding of how well the methods work, Figure 8 shows some sampled images for concepts at different levels of abstraction. In general we see that the samples are correct (consistent with the attributes that were specified), yet relatively diverse, as desired. (Note, however, that we manually selected 6 from 10 samples to make the figure. Automatically generating a diverse set of samples is left to future work.) triple ELBOで網羅性が⾼いことを確認． 3-bit concept big, counter-clockwise, bottom-right (Digit unspecified) 2-bit concept 1-bit concept 3, big bottom-left (location, orientation unspecified) (digit, location, orientation unspecified) Figure 8: Samples of 3 abstract concepts using the triple ELBO model. More precisely, for each concept, we draw 10 samples from the posterior, zi ⇠ q(z|y), convert each one to a mean image, µi = E[x|zi ], and then ¤ ⼊⼒として与えていない属性について，様々なバリエーションが⽣成できている． manually pick the 6 most diverse ones to show here. 5.2.7 Evaluating compositionality. In this section, we evaluate how well methods handle compositionally novel, concrete concepts. The results are shown in Table 2. We see that the correctness scores are lower for all methods than

24.

実験2-4：合成性の定性評価 ¤ 未知の属性の組み合わせをうまく⽣成できるか検証 Concept: 0, big, upright, top-right triple ELBO JMVAE bi-VCCA Concept: 2, big, clockwise, bottom-left triple ELBO JMVAE bi-VCCA Figure 9: Samples of 2 compositionally novel concrete concepts using 3 different models. For each concept, we draw 4 samples from the posterior, zi ⇠ q(z|y), convert each one to a mean image, µi = E[x|zi ], and then show the results. The color coding is the same as Figure 7, namely red border means one or more attributes are incorrect (according to the observation classifier), black border means all attributes are correct. ¤ bi-VCCAがダメダメ．JMVAEも同様に上⼿く⽣成できていない． ¤ triple ELBOだとそれなりに⽣成できている． 6 Conclusions and future work We have shown how we can learn to represent the semantic content of images and descriptions using probability distributions over random vectors in a shared latent space. We use this to “imagine”

25.

まとめ ¤ 共有する潜在空間上において確率分布を⽤いて，画像と記述の意味内容の表現を学習する⼿法を提案した． ¤ triple ELBO ¤ これによって具体的，抽象的な概念を「想像」し，画像に「接地」することができる． ¤ 感想： ¤ 検証の仕⽅がさすがという感じ． ¤ ⾃分でも本論⽂の結果を再現して検証したい． ¤ 論⽂読んでて⾃分の名前が出るのが気持ち悪い．

[DL輪読会]Generative Models of Visually Grounded Imagination

Deep Learning JP

関連スライド

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

【DL輪読会】KAN: Kolmogorov–Arnold Networks

【DL輪読会】Generative Agents: Interactive Simulacra of Human Behavior

【DL輪読会】LLMベースの自律型エージェントシステムのサーベイ

【DL輪読会】4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

【DL輪読会】LightGlue: Local Feature Matching at Light Speed

各ページのテキスト