247 Views

September 24, 17

#nmf
#source separation
#music
#bss
#ica
#ilrma
#generative model
#student's t
#Blind source separation
#Audio signal processing
#Independent Low-Rank Matrix Analysis
#Frequency-domain independent component analysis
#Independent vector analysis

スライド概要

Daichi Kitamura, "Blind source separation based on independent low-rank matrix analysis and its extension to Student's t-distribution," Télécom ParisTech, Invited Lecture, September 4th, 2017.

http://d-kitamura.net/links_en.html

1.

Télécom ParisTech Visiting September 4th Blind source separation based on independent low-rank matrix analysis and its extension to Student's t-distribution The University of Tokyo, Japan Project Research Associate Daichi Kitamura

2.

Self introduction • Name: Daichi Kitamura • Age: 27 (born in 1990) Japan – Kagawa Pref. in Japan • Background: – NAIST, Japan • Master degree (received in 2014) – SOKENDAI, Japan • Ph.D. degree (received in 2017) – The University of Tokyo, Japan • Project Research Associate Tokyo Kagawa • Research topics – Acoustic signal processing, statistical signal processing, audio source separation, etc. 2

3.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – – – – Employ low-rank TF structures of each source in BSS Gaussian source model with TF-varying variance Relationship between ILRMA and multichannel NMF Student’s t source model with TF-varying scale parameters • Conclusion 3

4.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – – – – Employ low-rank TF structures of each source in BSS Gaussian source model with TF-varying variance Relationship between ILRMA and multichannel NMF Student’s t source model with TF-varying scale parameters • Conclusion 4

5.

Background • Blind source separation (BSS) for audio signals BSS Recording mixture Separated guitar – separates original audio sources – does not require prior information of recording conditions • locations of mics and sources, room geometry, timbres, etc. – can be available for many audio app. • Consider only “determined” situation # of mics # of sources Sources Observed Mixing system Estimated Demixing system 5

6.

History of BSS for audio signals • Basic theories and their evolution 1994 Independent component analysis (ICA) 1998 Frequency-domain ICA (FDICA) Age 1999 2006 Many permutation solvers for FDICA Independent vector analysis (IVA) 2009 Auxiliary-function-based IVA (AuxIVA) 2012 Time-varying Gaussian IVA 2016 Nonnegative matrix factorization (NMF) Apply NMF to many tasks Generative models in NMF Many extensions of NMF Itakura–Saito NMF (ISNMF) 2011 2013 *Depicting only popular methods Multichannel NMF Independent low-rank matrix analysis (ILRMA) 6

7.

Motivation of ILRMA • Conventional BSS techniques based on ICA – ☺ Minimum distortion (linear demixing) Frequency-wise mixing matrix Source signals Observed signal : frequency bins : time frames Frequency-wise demixing matrix Estimated signal – ☺ Relatively fast and stable optimization • FastICA [A. Hyvarinen, 1999], natural gradient [S. Amari, 1996], and auxiliary function technique [N. Ono+, 2010], [N. Ono, 2011] – Could not use “specific” assumption of sources • Only assumes non-Gaussian p.d.f. for sources – Permutation problem is crucial and still difficult to solve • IVA often fails causing a “block permutation problem” [Y. Liang+, 2012] • Better to use a “specific source model” in TF domain – Independent low-rank matrix analysis (ILRMA) employs a low-rank property 7

8.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – – – – Employ low-rank TF structures of each source in BSS Gaussian source model with TF-varying variance Relationship between ILRMA and multichannel NMF Student’s t source model with TF-varying scale parameters • Conclusion 8

9.

Related methods: ICA • Independent component analysis (ICA) [P. Comon, 1994] – estimates without knowing Sources Source model Mixing matrix Observed Demixing Estimated matrix Spatial model – Source model (scalar) • is non-Gaussian and mutually independent – Spatial model • Mixing system is a time-invariant matrix • Mixing system in audio signals – Convolutive mixture with room reverberation 9

10.

Related methods: FDICA • Frequency-domain ICA (FDICA) [P. Smaragdis, 1998] – estimates frequency-wise demixing matrix Spectrograms – Source model (scalar) • Frequency-wise mixing matrix is time-invariant … – Spatial model ICA1 ICA2 … is complex-valued, non-Gaussian, and mutually independent Frequency bin • ICA I Time frame – Instantaneous mixture in each frequency band – A.k.a. rank-1 spatial model [N.Q.K. Duong, 2010] • Permutation problem? – Order of estimated signals cannot be determined by ICA – Alignment of frequency-wise estimated signals is required • Many permutation solvers were proposed 10

11.

Permutation problem • FDICA requires signal alignment for all frequency – Order of estimated signals cannot be determined by ICA* Estimated signal 1 Source 1 Observed 1 ICA Time Source 2 Permutation Solver Estimated signal 2 Observed 2 All frequency components *Signal scale should also be restored by a back-projection technique 11

12.

Related methods: IVA • Independent vector analysis (IVA) [A. Hiroe, 2006], [T. Kim, 2006] – extends ICA to multivariate probabilistic model to consider sourcewise frequency vector as a variable Source vector Multivariate nonGaussian dist. … … Permutation-free estimation of … … … Have higher-order correlations Observed vector Estimated vector Mixing matrix Demixing matrix is achieved! – Source model (vector) • is multivariate, spherical, complex-valued, non-Gaussian, and mutually independent – Spatial model • Mixing system is a time-invariant matrix (rank-1 spatial model) 12

13.

Higher-order correlation assumed in IVA • Spherical multivariate distribution [T. Kim, 2007] Mutually independent two Laplace dist.s x1 and x2 are mutually independent Probability depends on only the norm Spherical Laplace dist. x1 and x2 have higher-order correlation • Why spherical distribution? – Frequency bands that have similar activations will be merged together as one source avoid permutation problem 13

14.

Comparison of source models • Frequency-domain ICA (FDICA) [P. Smaragdis, 1998] Scalar r.v.s Frequency Demixing matrix Source obeys nonGaussian dist. Estimated STFT Update separation filter so that the estimated signals obey non-Gaussian distribution we assumed Current empirical dist. Time Non-Gaussian source dist. Mutually independent Frequency Mixture is close to Gaussian signal because of CLT Observed Time • Independent vector analysis (IVA) [A. Hiroe, 2006], [T. Kim, 2006] Estimated Demixing matrix STFT Update separation filter so that the estimated signals obey non-Gaussian distribution we assumed Time Frequency Observed Frequency Vector (multivariate) r.v.s Time Current empirical dist. Non-Gaussian spherical source dist. Mutually independent 14

15.

Related method: NMF • Nonnegative matrix factorization (NMF) [D. D. Lee, 1999] – Low-rank decomposition with nonnegative constraint • Limited number of nonnegative bases and their coefficients – Spectrogram is decomposed in acoustic signal processing • Frequently appearing spectral patterns and their activations Amplitude Basis matrix Activation matrix (spectral patterns) (time-varying gains) Frequency Frequency Nonnegative matrix (power spectrogram) Time Time Amplitude : # of freq. bins : # of time frames : # of bases 15

16.

Related method: ISNMF • ISNMF [C. Févotte, 2009] Equivalent Circularly symmetric complex Gaussian dist. Complex-valued observed signal Nonnegative variance – can be decomposed using “stable property” of • If we define , Variance is also decomposed! 16

17.

Related method: ISNMF Small value of power Frequency bin • Power spectrogram corresponds to variances in TF plane : Power spectrogram Grayscale shows the value of variance Time frame Large value of power Complex Gaussian distribution with TF-varying variance If we marginalize in terms of time or frequency, the distribution becomes non-Gaussian even though each TF grid is defined in Gaussian distribution 17

18.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – – – – Employ low-rank TF structures of each source in BSS Gaussian source model with TF-varying variance Relationship between ILRMA and multichannel NMF Student’s t source model with TF-varying scale parameters • Conclusion 18

19.

Extension of source model in IVA – has a frequency-uniform scale • Multivariate Laplace with fixed scale • Since scale cannot be determined, it is not equivalent to the flat spectral basis Frequency • Source model in IVA – Almost an NMF with only one basis Time • Extend to ISNMF-based source model • can represent complicated TF structures – can learn “co-occurrence” of each source in TF domain Frequency – NMF with arbitrary number of bases • Co-occurrence is captured as the variance – The structure can easily be estimated by NMF Time 19

20.

Extension of source model in IVA • Spherical Laplace distribution in IVA Frequency vector (I-dimensional) Frequency-uniform scale Spherical Laplace (bivariate) • Gaussian distribution with TF-varying variance in ISNMF [C. Févotte+, 2009] Complex-valued Gaussian in each TF bin Time-frequency matrix (IJ-dimensional) Low-rank decomposition with NMF Time-frequency-varying variance 20

21.

Cost function in ILRMA and partitioning function • Negative log-likelihood in ILRMA Estimated signal: Cost function in ICA (estimates demixing matrix) Update rules in ICA Update rules in ISNMF Cost function in ISNMF (estimates low-rank source model) All the variables can easily be optimized by an alternative update 21

22.

Update rules of ILRMA • ML-based iterative update rules – Update rule for is based on iterative projection [N. Ono, 2011] – Update rules for NMF variables is based on MM algorithm Spatial model (demixing matrix) Source model (NMF source model) where and is a one-hot vector that has 1 at th element – Pseudo code is available at • http://d-kitamura.net/pdf/misc/AlgorithmsForIndependentLowRankMatrixAnalysis.pdf 22

23.

Cost function in ILRMA and partitioning function • ILRMA with partitioning function – Appropriate number of bases for each source can automatically be determined where and – Useful when various types of sources are mixed • Ex. drums are very low-rank but vocals are not so low-rank 23

24.

Update rules of ILRMA • ML-based iterative update rules – Update rule for is based on iterative projection [N. Ono, 2011] – Update rules for NMF variables is based on MM algorithm Spatial model (demixing matrix) Source model (NMF source model) where and is a one-hot vector that has 1 at th element 24

25.

Optimization process in ILRMA • Demixing matrix and source model are alternatively updated Estimating demixing matrix Estimating Source model NMF variables NMF Update NMF Mixture Separated – The precise modeling of low-rank TF structures will improve the estimation accuracy of demixing matrix 25

26.

Comparison of source models FDICA source model Non-Gaussian scalar variable IVA source model Non-Gaussian vector variable with higher-order correlation ILRMA source model Non-Gaussian matrix variable with low-rank time-frequency structure Rank of TF matrix of mixture Rank of TF matrix of each source 26

27.

Multichannel extension of NMF • Multichannel NMF [A. Ozerov+, 2010], [H. Sawada+, 2013] Multichannel vector Spatial covariances in Spatial covariances each time-frequency slot of each source Simultaneous spatial covariance Partitioning function Basis matrix Activation matrix Gains Spectral patterns Observed multichannel signal Spatial model Source model Spatial property of each source Timber patterns of all sources 27

28.

Relationship b/w ILRMA and multichannel NMF • Difference b/w ILRMA and multichannel NMF? – Source distribution: complex Gaussian distribution (same) – ILRMA assumes – Multichannel NMF assumes full-rank spatial covariance • Assumption: rank-1 spatial model – Spatial covariance of each source is rank-1 matrix Sourcewise steering vector – Equivalent to simultaneous mixing assumption , 28

29.

Relationship b/w ILRMA and multichannel NMF • Multichannel NMF with rank-1 spatial model : observation, : parameter Substitute rank-1 spatial model where 29

30.

Relationship b/w ILRMA and multichannel NMF • Multichannel NMF with rank-1 spatial model Substitute into the cost function Transform the variables as 30

31.

Relationship b/w MNMF, IVA, and ILRMA • From multichannel NMF side, – Rank-1 spatial model is introduced, transform the problem from the estimation of mixing system to that of demixing matrix • From IVA side, Flexible Multichannel NMF Rank-1 spatial model Limited Spatial model – Increase the number of spectral bases in source model IVA NMF source model ILRMA Limited Source model Flexible 31

32.

Experimental evaluation • Conditions Music signals obtained from SiSEC Convolve impulse response, two microphones and two sources 512 ms of Hamming window Source signals Window length Shift length Number of bases Evaluation score 128 ms (1/4 shift) 30 per each source (ILRMA w/o partitioning function) 60 for all source (ILRMA with partitioning function) Improvement ot signal-to-distortion ratio (SDR) Impulse response E2A (reverberation time: 300 ms) Source 1 Source 2 Impulse response JR2 (reverberation time: 470 ms) Source 1 Source 2 2m 50 50 5.66 cm 2m 60 60 5.66 cm 32

33.

Results: fort_minor-remember_the_name SDR improvement [dB] Good E2A （300 ms） 16 Violin synth. 12 8 4 0 -4 -8 Good 16 SDR improvement [dB] Poor JR2 （470 ms） Poor Vocals Directional clustering IVA Violin synth. Ozerov’s Ozerov’s Sawada’s ILRMA w/o ILRMA with Sawada’s MNMF MNMF with MNMF partitioning partitioning MNMF random function function initialized initialization by ILRMA Vocals 12 8 4 0 -4 -8 Directional clustering IVA Ozerov’s Ozerov’s Sawada’s ILRMA w/o ILRMA with Sawada’s MNMF MNMF with MNMF partitioning partitioning MNMF random function function initialized initialization by ILRMA 33

34.

Results: ultimate_nz_tour SDR improvement [dB] Good E2A （300 ms） 20 Guitar 15 10 5 0 -5 Good 20 SDR improvement [dB] Poor JR2 （470 ms） Poor Synth. Directional clustering IVA Guitar Ozerov’s Ozerov’s Sawada’s ILRMA w/o ILRMA with Sawada’s MNMF MNMF with MNMF partitioning partitioning MNMF random function function initialized initialization by ILRMA Synth. 15 10 5 0 -5 Directional clustering IVA Ozerov’s Ozerov’s Sawada’s ILRMA w/o ILRMA with Sawada’s MNMF MNMF with MNMF partitioning partitioning MNMF random function function initialized initialization by ILRMA 34

35.

Results: bearlin-roads • Signal length: 14 s Good 12 15.1 s SDR improvement [dB] 10 60.7 s 8 11.5 s 7647.3 s 6 4 IVA MNMF ILRMA without Z ILRMA with Z 2 0 Poor -2 0 100 200 Iteration steps 300 400 35

36.

Demonstration: music source separation • Music source separation Keyboard Guitar Source separation Vocal Vocal Keyboard Pay attention to listen three parts in the mixture Guitar Another demo is available at http://d-kitamura.net/en/index_en.html 36

37.

Stable and Student’s t-distributions • Source model based on Symmetric a-stable (SaS) distribution [A. Liutkus+, 2015], [U. Şimşekli+, 2015], [S. Leglaive+, 2017], [M. Fontaine+, 2017] – which can validate the decomposition of complex-valued r.v.s as the decomposition of their parameters – Heavy tail (sparse) when a approaches to 0 • Student’s t-distribution is also used as a source model [C. Févotte+, 2006], [K. Yoshii+, 2016], [K. Kitamura+, 2016], [S. Leglaive+, 2017] – that includes Cauchy distribution ( ) and Gaussian distribution ( ) Student’s t (partially stable) SaS (stable family) Cauchy Gauss 37

38.

Source model of Student’s t-distribution • Degree-of-freedom parameter – Heavy tail when approaches to 0 • Complex Student’s t-dist. – Circularly symmetric – Student’s t NMF (t-NMF) [K. Yoshii+ 2016] Defined in each TF slot Scale corresponds to NMF model Phase is assumed to be uniform 38

39.

Motivation for using Student’s t-dist. • Better separation with t-NMF was reported [K. Yoshii+, 2016] – in a very simple experiment using only C4, E4, and G4 piano tones • NMF with heavy tail distribution – tends to provide excessive low-rank approximation • Sparse components (which may increase the rank of model data) are considered as outliers • ILRMA based on Student’s t source model (t-ILRMA) – may improve the separation accuracy by forcing NMF source model to be excessively low-rank – will be presented at MLSP2017! (preprint is available on arXiv) • https://arxiv.org/abs/1708.04795 39

40.

Source model based on Student’s t-distribution th power spectrogram corresponds to scales in TF plane : th power spectrogram Small value of power Frequency bin • Grayscale shows the value of scale Time frame Complex Student’s t-distribution with TF-varying scale Large value of power 40

41.

Cost function in ILRMA based on Student’s t-dist. • Negative log-likelihood in ILRMA Gaussian ILRMA modeling power spectrogram by variance Generalization of p.d.f. and model domain Student’s t ILRMA modeling pth power spectrogram by scale 41

42.

Experimental results: randomized t-ILRMA • Examples – Improved when – Stable when but score is not sufficient – Root spectrogram ( ) is preferable for speech signals • In the case of Music signals Speech signals – Source model is over-fitted to mixture 42

43.

Tempering parameter • Random initialization (previous result) Identity matrix t-ILRMA Uniform random values (iteration: 200) • Initialization based on Gaussian ILRMA – (Tempering approach of parameter) arbitrary val. Identity matrix Gauss ILRMA t-ILRMA Uniform random values (iteration: 100) (iteration: 100) t-NMF Uniform random values (iteration: 100) 43

44.

Experimental results: initialized t-ILRMA • Examples Music signals – Improved for all value of – Could avoid overfitting problem in the case • Best parameter? – Completely depending on data Speech signals 44

45.

Average results: music signals 45

46.

Average results: speech signals 46

47.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation • Related Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Itakura–Saito nonnegative matrix factorization (ISNMF) • Independent Low-Rank Matrix Analysis (ILRMA) – – – – Employ low-rank TF structures of each source in BSS Gaussian source model with TF-varying variance Relationship between ILRMA and multichannel NMF Student’s t source model with TF-varying scale parameters • Conclusion 47

48.

Conclusion • Independent low-rank matrix analysis (ILRMA) – Assumption • Statistical independence between sources • Low-rank time-frequency structure of each source – Equivalent to multichannel NMF • when the mixing assumption is valid • Student’s t-distribution is newly introduced – including two symmetric a-stable distributions • Complex Cauchy distribution ( • Complex Gaussian distribution ( ) ) • Further extensions – Relaxation of rank-1 spatial model? – Employ another distribution? – Supervised ILRMA? User-guided ILRMA? 48