【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood

>100 Views

July 01, 22

スライド概要

2022/7/1
Deep Learning JP
http://deeplearning.jp/seminar-2/

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

各ページのテキスト
1.

Unbiased Gradient Estimation for Marginal Log-likelihood Shohei Taniguchi, Matsuo Lab

2.

എ‫ܠ‬ • ਂ૚ੜ੒Ϟσϧͷֶश͸ɼपลର਺໬౓ log pθ (x) ͷ࠷େԽͰఆࣜԽ͞ΕΔ ∫ જࡏม਺Ϟσϧʢe.g., VAE)ɿlog pθ (x) = log pθ (x, z) dz ΤωϧΪʔϕʔεϞσϧɿlog pθ (x) = − Eθ (x) − log exp (−Eθ (x)) dx ∫ • ଟ͘ͷ৔߹Ͱɼपลର਺໬౓͸ղੳతʹ‫͍ͳ͖Ͱࢉܭ‬ 2

3.

എ‫ܠ‬ ྫ̍ɿVAE ର਺पล໬౓ͷม෼ԼքͰۙࣅ 𝔼 log pθ (x) = log pθ (x, z) dz ∫ pθ (x, z) ≥ q(z) log = ℒ (θ, q) [ q (z) ] 3

4.

എ‫ܠ‬ ྫ̎ɿEBM ର਺पล໬౓ͷޯ഑ΛMCMCͰۙࣅ ∇log pθ (x+) = − ∇Eθ (x+) + ≈ − ∇Eθ (x+) + (x ) ∇E [ x−∼pθ(x) θ − ] xT ∼q(xT) [ ∇Eθ (xT)] 𝔼 𝔼 q (xT)͸MCMCΛTεςοϓճͨ͠෼෍ͰɼT → ∞Ͱ౳߸੒ཱ 4

5.

എ‫ܠ‬ • ม෼Լք΍༗‫ݶ‬εςοϓͷMCMCʹΑΔޯ഑ͷۙࣅʹ͸ɼόΠΞε͕͋Δ όΠΞεɿਪఆྔͷ‫ظ‬଴஋ͱਅ஋ͱͷͣΕ ɹɹɹɹɹόΠΞε͕0ͷਪఆྔΛෆภਪఆྔͱ͍͏ • Ͱ͖Ε͹ɼର਺पล໬౓ͷޯ഑ͷෆภਪఆྔΛ࢖ֶͬͯश͍ͨ͠ • ର਺पล໬౓Λෆภਪఆ͢Δख๏ͷྫʢա‫ڈ‬ͷྠಡʣ https://www.slideshare.net/DeepLearningJP2016/dlsumo-unbiased-estimationof-log-marginal-probability-for-latent-variable-models-250013351 5

6.

Outline पลର਺໬౓ͷෆภਪఆ๏ 1. पลର਺໬౓ࣗମΛਪఆ͢Δํ๏ • On Multilevel Monte Carlo Unbiased Gradient Estimation for Deep Latent Variable Models (AISTATS 2021) • Efficient Debiased Evidence Estimation by Multilevel Monte Carlo Sampling (UAI 2021) 2. पลର਺໬౓ͷޯ഑Λਪఆ͢Δํ๏ • Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models (ICLR 2020) 6

7.

Outline पลର਺໬౓ͷෆภਪఆ๏ 1. पลର਺໬౓ࣗମΛਪఆ͢Δํ๏ • On Multilevel Monte Carlo Unbiased Gradient Estimation for Deep Latent Variable Models (AISTATS 2021) • Efficient Debiased Evidence Estimation by Multilevel Monte Carlo Sampling (UAI 2021) 2. पลର਺໬౓ͷޯ഑Λਪఆ͢Δํ๏ • Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models (ICLR 2020) 7

8.

IWAE Importance Weighted Autoencoder pθ (x, z ) 1 log pθ (x) = log z(1),…,z(k)∼q(z) ∑ k [ i=1 q (z (i)) ] k (i) pθ (x, z ) 1 ≥ z(1),…,z(k)∼q(z) log ∑ (i) k [ ] q z ( ) i=1 𝔼 k 𝔼 = z (1),…,z (k)∼q(z) [ℒk (θ, q)] 𝔼 k = 1 ͰVAEͱҰகɼk → ∞ Ͱ౳߸੒ཱ (i)

9.

Δk = ∞ ͱ͓͘ͱɺ‫਺ڃ‬ ∑ k=1 ℒ1 (θ, q) (k = 1) {ℒk (θ, q) − ℒk−1 (θ, q) (k ≥ 2) Δk ͷ‫ظ‬଴஋͸ର਺पล໬౓ͱҰக͢Δ ∞ Δk = ∑ [ k=1 ] 𝔼 𝔼 पลର਺໬౓ͷ‫਺ڃ‬ද‫ه‬ (x) ℒ θ, q = log p ( ) ∞ θ [ ]

10.

Russian Roulette Estimator ҎԼͷΑ͏ͳ ŷ Λߟ͑Δ y ̂ = Δ1 + ∞ ∑k=2 Δk μ ⋅ b, b ∼ Bernoulli (μ) 1. ֬཰ μ Ͱද͕ग़ΔίΠϯΛৼΔ 2. ද͕ग़ͨΒ k = 2 Ҏ߱Λ‫͠ࢉܭ‬ɺμ Ͱׂͬͨ΋ͷΛΔ1ʹ଍͢ ཪ͕ग़ͨΒ Δ1 ͚ͩΛ‫͢ࢉܭ‬Δ

11.

ŷ ͸ ∞ ∑ k=1 Δk ͷෆภਪఆྔͰ͋Δ͜ͱ͕Θ͔Δ ŷ = Δ1 + ∞ ∑k=2 Δk μ [y]̂ = Δ1 + 𝔼 𝔼 Russian Roulette Estimator ⋅ b, b ∼ Bernoulli (b; μ) ∞ ∑k=2 Δk μ ⋅ ∞ [b] = ∑ Δk k=1

12.

Russian Roulette Estimator ಉ͜͡ͱΛ k = 2 Ҏ߱΋‫܁‬Γฦ͢ͱɺҎԼͷ ŷ ΋ ∞ ∑ k=1 Δk ͷෆภਪఆྔʹͳΔ K Δk ŷ = , K ∼ Geometric K; 1 − μ ( ) ∑ μ k−1 k=1 K ͸࠷ॳʹཪ͕ग़Δ·ͰʹίΠϯΛৼͬͨճ਺ʢ‫ز‬Կ෼෍ʹै͏ʣ ͜ͷ ŷ Λ࢖͑͹ɺର਺पล໬౓ͷෆภਪఆྔ͕ಘΒΕΔ

13.

𝔼 Single Sample Estimator ಉ༷ʹͯ͠ɼҎԼͷ ŷ ΋ ∞ ∑ k=1 Δk ͷෆภਪఆྔʹͳΔ ΔK ΔK ŷ = = p (K) μ K−1 (1 − μ) ∞ ∞ ΔK = Δk [y]̂ = ∑ p (K) ⋅ ∑ (K) p K=1 k=1

14.

SUMO Stochastically Unbiased Marginalization Objective K Δk log pθ (x) = K∼p(K) k−1 ∑ μ [ k=1 ] 𝔼 = ℒ1 (θ, qϕ) + VAEͱಉ͡ K K∼p(K) ∑ ℒk (θ, qϕ) − ℒk−1 (θ, qϕ) μ k−1 k=2 𝔼 ิਖ਼߲

15.

SUMOͷ՝୊ • ਪఆྔͷ෼ࢄ͕େ͖͘ͳΓ΍͍͢ • ࠷ѱͷ৔߹ɼ෼ࢄ͕ແ‫ݶ‬େʹൃࢄ͢Δ • ෼ࢄ͸ p (K) ͷબͼํͰ੍‫͖Ͱޚ‬Δ͕ɼ෼ࢄΛԼ͛Α͏ͱ͢Δͱɼ K ͕େ͖͘ͳΓɼ‫͕ྔࢉܭ‬େ͖͘ͳΔ • ෼ࢄ͕༗‫ྔࢉܭ͔ͭݶ‬ͷ‫ظ‬଴஋΋༗‫͋Ͱݶ‬Δ͜ͱ͕ཧ૝ 15

16.

Δk = ∞ ͱ͓͘ͱɺ‫਺ڃ‬ ∑ k=1 ℒ1 (θ, q) (k = 1) {ℒk (θ, q) − ℒk−1 (θ, q) (k ≥ 2) Δk ͷ‫ظ‬଴஋͸ର਺पล໬౓ͱҰக͢Δ ∞ Δk = ∑ [ k=1 ] 𝔼 𝔼 पลର਺໬౓ͷ‫਺ڃ‬ද‫ه‬ʢ࠶ʣ (x) ℒ θ, q = log p ( ) ∞ θ [ ] Δkͷઃ‫ํܭ‬๏͸ଞʹ΋ߟ͑ΒΕΔ

17.

पลର਺໬౓ͷ‫਺ڃ‬ද‫ه‬ʢվʣ Δk = ℒ1 (θ, q) 1 ℒ2k (θ, q)− 2 ∞ ͜ͷ৔߹΋ɼ‫਺ڃ‬ ∑ k=1 (1) ℒ θ, q k−1 ( ) ( 2 + (2) ℒ2k−1 (θ, q)) (k = 1) (k ≥ 2) Δk ͷ‫ظ‬଴஋͸ର਺पล໬౓ͱҰக͢Δ ͜ΕΛ࢖ͬͯߏ੒ͨ͠ਪఆྔ͸ɼ෼ࢄ༗‫ྔࢉܭ͔ͭݶ‬༗‫ݶ‬Λ࣮‫͖Ͱݱ‬Δ

18.

࣮‫ݧ‬ ը૾ੜ੒ ఏҊ๏͸IWAE΍SUMOΑΓ΋ੑೳ͕վળ͢Δ

19.

Outline पลର਺໬౓ͷෆภਪఆ๏ 1. पลର਺໬౓ࣗମΛਪఆ͢Δํ๏ • On Multilevel Monte Carlo Unbiased Gradient Estimation for Deep Latent Variable Models (AISTATS 2021) • Efficient Debiased Evidence Estimation by Multilevel Monte Carlo Sampling (UAI 2021) 2. पลର਺໬౓ͷޯ഑Λਪఆ͢Δํ๏ • Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models (ICLR 2020) 19

20.

EBMͷֶश ର਺पล໬౓ͷޯ഑ΛMCMCͰۙࣅ ∇log pθ (x+) = − ∇Eθ (x+) + ≈ − ∇Eθ (x+) + x−∼pθ(x) [ ∇Eθ (x−)] xT ∼q(xT) [ ∇Eθ (xT)] q (xT)͸MCMCΛTεςοϓճͨ͠෼෍ͰɼT → ∞Ͱ౳߸੒ཱ 𝔼 𝔼 T͕༗‫ͱͩݶ‬ɼޯ഑ͷਪఆʹόΠΞε͕ͷΔ 20

21.

ޯ഑ͷ‫਺ڃ‬ల։ (x ) ∇E [ x−∼pθ(x) θ − ] ∑ t=k+1 ∇E x − ( ) θ t+1 ] xt+1∼q(xt+1) [ ∇E x ( θ t)] xt∼q(xt) [ 𝔼 ∇E x + ( ) θ k ] xk∼q(xk) [ ∞ pθ (x) = q (x∞)Ͱ͋Δ͜ͱΛ༻͍Δͱɼ‫਺ڃ‬ͷ‫ʹܗ‬ม‫͖Ͱܗ‬Δ 𝔼 𝔼 𝔼 = x∞∼q(x∞) [ ∇Eθ (x∞)] 𝔼 = 21

22.

ޯ഑ͷ‫਺ڃ‬ల։ (x ) ∇E [ x−∼pθ(x) θ − ] t=k+1 𝔼 ∑ ∇E x − ( ) θ t+1 ] xt+1∼q(xt+1) [ t=k+1 ∞ 𝔼 ∇E x + ( ) θ k ] xk∼q(xk) [ ∑ ∇E x − ( ) θ t+1 ] xt+1∼q(xt+1) [ 𝔼 ͨͩ͠ɼyt ͸ xt ͱಉ͡पล෼෍ʹै͏ 𝔼 = ∇E x + ( ) θ k ] xk∼q(xk) [ 𝔼 𝔼 𝔼 = ∞ 22 ∇E x ( θ t)] xt∼q(xt) [ ∇E y ( θ t)] yt∼q(yt) [

23.

ޯ഑ͷ‫਺ڃ‬ల։ (x ) ∇E [ x−∼pθ(x) θ − ] t=k+1 ∞ ∇E x − ∇E y ( ) ( ) θ t+1 θ t ( )] ∑ t=k+1 ͨͩ͠ɼyt ͸ xt ͱಉ͡पล෼෍ʹै͏ 𝔼 [ ∇Eθ (xk) + ∑ ∇E x − ( ) θ t+1 ] xt+1∼q(xt+1) [ 𝔼 = ∇E x + ( ) θ k ] xk∼q(xk) [ 𝔼 𝔼 𝔼 = ∞ 23 ∇E y ( θ t)] yt∼q(yt) [

24.

ޯ഑ͷ‫਺ڃ‬ల։ (x ) ∇E [ x−∼pθ(x) θ − ] 𝔼 = [ ∇Eθ (xk) + ∇Eθ (xt+1) − ∇Eθ (yt)) ( ∑ ] t=k+1 τ−1 ∇Eθ (xt+1) − ∇Eθ (yt)) ( ∑ ] t=k+1 𝔼 = [ ∇Eθ (xk) + ∞ 𝔼 ͨͩ͠ɼyt ͸ xt ͱಉ͡पล෼෍ʹै͍ɼt ≥ τ Ͱ xt+1 = yt Ͱ͋Δͱ͢Δ 24

25.

ޯ഑ͷ‫਺ڃ‬ల։ (x ) ∇E [ x−∼pθ(x) θ − ] = [ ∇Eθ (xk) + τ−1 ∇E x − ∇E y ( ) ( ) θ t+1 θ t ( )] ∑ t=k+1 ͔͠͠ɼt ≥ τ Ͱ xt+1 = ytΛຬͨ͢ yt ͸ͲͷΑ͏ʹ࡞Δʁ 𝔼 𝔼 ແ‫਺ڃݶ‬Λ༗‫ݶ‬࿨ʹॻ͖‫͑׵‬ΒΕͨʂ 25

26.

ΧοϓϦϯά • ૬ؔͷ͋Δ2ͭͷMCMCΛճͯ͠ 1εςοϓͣΕͨαϯϓϧಉ͕࢜ Ұக͢Δ·Ͱճ͢ • MCMCΛಠཱʹճͯ͠͠·͏ͱ αϯϓϧಉ͕࢜Ұக͢Δ͜ͱ͕΄ͱΜͲ‫͜ى‬Βͳ͍ͷͰ ͏·͘૬ؔͤ͞Δ͜ͱ͕ٕज़తͳ؊ 26

27.

ΧοϓϦϯά • ξ ͱ η ΛͦΕͧΕp (ξ), q (η)ʹै͏֬཰ม਺ͱ͢Δ • ͨͩ͠ɼ྆ऀ͸ಠཱͰ͋Δඞཁ͸ͳ͍ ∫ min{p(x), q(x)}dx • ͜ͷͱ͖ɼξ ͱ η͕Ұக͢Δ֬཰ P (ξ = η) ʹ͍ͭͯɼҎԼͷෆ౳͕ࣜ੒ཱ P(ξ = η) ≤ 1 − ∥p − q∥TV = min{p(x), q(x)}dx ∫ 27

28.

Maximal Coupling https://commons.wikimedia.org/wiki/File:Total_variation_distance.svg P(ξ = η) ≤ 1 − ∥p − q∥TV = min{p(x), q(x)}dx ∫ • ҎԼͷखॱͰ ξ ͱ η ΛαϯϓϦϯά͢Δͱɼ͜ͷ্քΛୡ੒Ͱ͖Δ 28

29.

Maximal Coupling https://commons.wikimedia.org/wiki/File:Total_variation_distance.svg खॱ 1. ෼෍ p ͔Β ξ ΛαϯϓϦϯά 2. ֬཰ α = min (q (ξ) /p (ξ), 1) Ͱ η = ξ ͱ͢Δ • ਤͷ੺͍෦෼͔Β η Λαϯϓϧ 3. ֬཰ 1 − α Ͱ q̃ (η) ∝ min (q (η) − p (η), 0) ͔Βαϯϓϧ͢Δ • q ͷ࢒Γͷ෦෼͔Β η Λαϯϓϧ 29

30.

Maximal Coupling ྫɿͲͪΒ΋ඪ४ਖ਼‫ن‬෼෍ͷ৔߹ 1. p ͔Βαϯϓϧ ξ ΛಘΔ 2. η = ξ ͱͯ͠ɼq ͔Βͷ αϯϓϧͱ͢Δ https://colcarroll.github.io/couplings/static/maximal_couplings.html • ͜ͷͱ͖ɼξ ͱ η ͸ͱ΋ʹඪ४ਖ਼‫ن‬෼෍ʹै͍ɼ֬཰1Ͱ ξ = η 30

31.

Maximal Coupling ͦͷଞͷྫ https://colcarroll.github.io/couplings/static/maximal_couplings.html 31

32.

ΧοϓϦϯά • MCMCͷ֤εςοϓΛmaximal couplingͰαϯϓϦϯά͢Δ͜ͱͰ ߴ͍֬཰Ͱ2ͭͷαϯϓϧΛҰகͤ͞Δ͜ͱ͕Ͱ͖Δ • ΧοϓϦϯάΛ࢖ͬͨMCMCͷෆภԽͷॳग़͸[Jacob, et al., 2017] https://arxiv.org/abs/1708.03625 • ͜ͷ࿦จ͸ɼ͜ΕΛEBMʢಛʹRBMʣͷ ֶशʹ༻͍ͨ΋ͷ 32

33.

࣮‫ݧ‬ τΠσʔλ ී௨ͷMCMCͰ͸్த͔Βੑೳ͕ྼԽ͢Δ͕ఏҊ๏͸ͦΕ͕‫͜ى‬Βͳ͍ ΧοϓϦϯάʹ͔͔Δεςοϓ਺͸ଟͯ͘10εςοϓఔ౓ 33

34.

࣮‫ݧ‬ ը૾ੜ੒ʢFashion MNISTʣ ը૾ੜ੒Ͱ΋ಉ༷ͷ܏޲ 34

35.

·ͱΊ • पลର਺໬౓͓Αͼͦͷޯ഑Λෆภਪఆ͢ΔͨΊͷςΫχοΫΛ঺հ 1. ϩγΞϯϧʔϨοτਪఆΛ༗‫ݶ‬෼ࢄɾ༗‫ํ͏ߦͰྔࢉܭݶ‬๏ 2. ΧοϓϦϯάΛ࢖ͬͯMCMCΛෆภʹ͢Δํ๏ • ͲͪΒ΋ςΫχΧϧʹ໘ന͘ɼ৭ʑԠ༻͕ޮ͖ͦ͏ • ࣮‫͕ݧ‬খ‫ن‬໛ͳͷͰɼେ͖͍ϞσϧͰ΋࢖͑Δͷ͔͕‫ͳʹؾ‬Δ 35