>100 Views

August 31, 17

スライド概要

Daichi Kitamura, Nobutaka Ono, and Hiroshi Saruwatari, "Experimental analysis of optimal window length for independent low-rank matrix analysis," Proceedings of The 2017 European Signal Processing Conference (EUSIPCO 2017), pp. 1210–1214, Kos, Greece, August 2017 (Invited Special Session).

Presented at 25th European Signal Processing Conference (EUSIPCO) 2017, "SS14: Multivariate Analysis for Audio Signal Source Enhancement," 14:30-16:10, August 30, 2017.

http://d-kitamura.net/links_en.html

1.

25th European Signal Processing Conference (EUSIPCO) 2017 SS14: Multivariate Analysis for Audio Signal Source Enhancement August 30, 14:30-16:10 Experimental analysis of optimal window length for independent low-rank matrix analysis Daichi Kitamura The University of Tokyo, Japan Nobutaka Ono National Institute of Informatics, Japan Hiroshi Saruwatari The University of Tokyo, Japan

2.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation: fundamental limitation in frequency-domain BSS • Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Experimental analysis – Optimal window length • Music signals and speech signals • Ideal case and more practical case • Conclusion 2

3.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation: fundamental limitation in frequency-domain BSS • Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Experimental analysis – Optimal window length • Music signals and speech signals • Ideal case and more practical case • Conclusion 3

4.

Background • Blind source separation (BSS) for audio signals BSS Recording mixture Separated guitar – separates original audio sources – does not require prior information of recording conditions • locations of mics and sources, room geometry, timbres, etc. – can be available for many audio app. • Consider only “determined” situation # of mics # of sources Sources Observed Mixing system Estimated Demixing system 4

5.

History of BSS for audio signals • Basic theories and their evolution 1994 Independent component analysis (ICA) 1998 Frequency-domain ICA (FDICA) Age 1999 2006 Many permutation solvers for FDICA Independent vector analysis (IVA) 2009 Auxiliary-function-based IVA (AuxIVA) 2012 Time-varying Gaussian IVA 2016 Nonnegative matrix factorization (NMF) Apply NMF to many tasks Generative models in NMF Many extensions of NMF Itakura–Saito NMF (ISNMF) 2011 2013 *Depicting only popular methods Multichannel NMF Independent low-rank matrix analysis (ILRMA) 5

6.

History of BSS for audio signals • Basic theories and their evolution 1994 Independent component analysis (ICA) 1998 Frequency-domain ICA (FDICA) Age 1999 2006 Many permutation solvers for FDICA Independent vector analysis (IVA) 2009 Auxiliary-function-based IVA (AuxIVA) 2012 Time-varying Gaussian IVA 2016 Nonnegative matrix factorization (NMF) Apply NMF to many tasks Generative models in NMF Many extensions of NMF Itakura–Saito NMF (ISNMF) 2011 2013 *Depicting only popular methods Multichannel NMF Independent low-rank matrix analysis (ILRMA) 6

7.

Motivation: fundamental limitation of BSS • Mixing assumption in frequency-domain BSS : frequency bins : time frames Observed multichannel signal Frequency-wise mixing matrix Source signals – “Linear time-invariant mixture” or “rank-1 spatial model” – Valid only when window length used in STFT length of room reverberation • Too long window also causes another problem – Number of time frames (samples) decreases Statistical bias will increase and estimation becomes unstable – FDICA suffers from the trade-off – What about for BSS methods with structural source model? • IVA and ILRMA Performance • Trade-off between short and long window [S. Araki+, 2003] Optimal length Window length 7

8.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation: fundamental limitation in frequency-domain BSS • Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Experimental analysis – Optimal window length • Music signals and speech signals • Ideal case and more practical case • Conclusion 8

9.

BSS methods: FDICA and IVA • Frequency-domain ICA (FDICA) [P. Smaragdis, 1998] Scalar r.v.s Frequency Demixing matrix Source obeys nonGaussian dist. Estimated STFT Update separation filter so that the estimated signals obey non-Gaussian distribution we assumed Current empirical dist. Time Non-Gaussian source dist. Mutually independent Frequency Mixture is close to Gaussian signal because of CLT Observed Time • Independent vector analysis (IVA) [A. Hiroe, 2006], [T. Kim, 2006] Estimated Demixing matrix STFT Update separation filter so that the estimated signals obey non-Gaussian distribution we assumed Time Frequency Observed Frequency Vector (multivariate) r.v.s Time Current empirical dist. Non-Gaussian spherical source dist. Mutually independent 9

10.

Extension of source distribution in IVA • Spherical Laplace distribution in IVA Frequency vector (I-dimensional) Frequency-uniform scale Spherical Laplace (bivariate) Extended to a more flexible model • Zero-mean complex Gaussian distribution with TFvarying variance (Itakura-Saito NMF) [C. Févotte+, 2009] Zero-mean complex Gaussian in each TF bin Time-frequency matrix (IJ-dimensional) Time-frequency-varying variance Low-rank decomposition with NMF 10

11.

Generative source model in ISNMF Small value of power Frequency bin • Power spectrogram corresponds to variances in TF plane : Power spectrogram Grayscale shows the value of variance Time frame Large value of power Complex Gaussian distribution with TF-varying variance If we marginalize in terms of time or frequency, the distribution becomes non-Gaussian even though each TF grid is defined in Gaussian distribution 11

12.

BSS methods: ILRMA • Independent low-rank matrix analysis (ILRMA) [D. Kitamura+,2016] – Unification of IVA and ISNMF Low-rank decomposition Estimated Demixing matrix STFT Update demixing matrix so that estimated signals have low-rank structure in time-frequency domain Frequency Frequency Observed Time Time Basis Frequency Frequency – Source model in ILRMA Time Time Basis Number of bases can be set to arbitrary value 12

13.

Comparison of source models FDICA source model Non-Gaussian scalar variable IVA source model Non-Gaussian vector variable with higher-order correlation ILRMA source model Non-Gaussian matrix variable with low-rank time-frequency structure Rank of TF matrix of mixture Rank of TF matrix of each source 13

14.

Contents • Background – Blind source separation (BSS) for audio signals – Motivation: fundamental limitation in frequency-domain BSS • Methods – Frequency-domain independent component analysis (FDICA) – Independent vector analysis (IVA) – Independent low-rank matrix analysis (ILRMA) • Experimental analysis – Optimal window length • Music signals and speech signals • Ideal case and more practical case • Conclusion 14

15.

Experimental analysis • Window length in STFT Waveform DFT Window function … Shift length DFT Frequency Spectrogram DFT … Time Window length (= DFT length) – If window length is too short • Mixing assumption does not hold anymore – If window length is too long • Estimation becomes unstable (# of time frames decreases) • Our expectation – Full time-frequency modeling of sources in ILRMA may improve the robustness to a decrease in the number of time frames 15

16.

Experimental analysis • Dataset: 4 music and 4 speech from SiSEC [S. Araki+, 2012] Signal Data name Source (1/2) Length [s] Music bearlin-roads acoustic_guit_main/vocals 14.6 Music another_dreamer-the_ones_we_love guitar/vocals 25.6 Music fort_minor-remember_the_name violins_synth/vocals 24.6 Music ultimate_nz_tour guitar/synth 18.6 Speech dev1_female4 src_1/src_2 10.0 Speech dev1_female4 src_3/src_4 10.0 Speech dev1_male4 src_1/src_2 10.0 Speech dev1_male4 src_3/src_4 10.0 • Mixing: convolution with RIR in RWCP [S. Nakamura+, 2000] Impulse response E2A (reverberation time: T60 = 300 Source 1 ms) Source 2 Impulse response JR2 (reverberation time: T60 = 470 Source 1 Source 2 2m 50 50 5.66 cm ms) 2m 60 60 5.66 cm 16

17.

Experimental analysis • Compared methods – FDICA+IPS (ideal permutation solver) • Align permutation of estimated components using the reference (oracle) source spectrogram (upper limit performance of FDICA) – FDICA+DOA (DOA-based permutation solver) [S. Kurita+, 2000] • Align permutation of estimated components using DOA after FDICA – IVA [N. Ono, 2011] • using auxiliary function method (a.k.a. MM algorithm) in optimization – ILRMA [D. Kitamura+, 2016] • with several numbers of bases • Other conditions – Window function: Hamming window – Window length: 32 ~ 2048 ms – Shift length: Always quarter of window length 17

18.

Comparison using ideal initialization: condition • Set initial value of demixing matrix to oracle: – This initial value provides the best separation performance under the assumption • Set initial value of source model as oracle Power spectrogram of th source (only for ILRMA): FDICA+DOA & IVA: spatial oracle initialization FDICA+IPS & ILRMA: spatial and spectral oracle initialization 18

19.

Comparison using ideal initialization: results Speech T60 = 0.30 s Music T60 = 0.30 s Speech T60 = 0.47 s Music T60 = 0.47 s 19

20.

Comparison using random initialization: condition • Set initial value of demixing matrix to identity matrix • Set initial value of source model to uniform random value between [0,1] (only for ILRMA) FDICA+DOA, IVA, & ILRMA: fully blind method FDICA+IPS: using oracle spectrogram 20

21.

Comparison using random initialization: results Speech T60 = 0.30 s Music T60 = 0.30 s Music T60 = 0.47 s Speech T60 = 0.47 s 21

22.

Conclusion • In the case of ILRMA with oracle initialization, the robustness to long windows (fewer time frames) can be improved – optimal window length is longer than that in FDICA or IVA – thanks to employing not only the independence between sources but also a full modeling of time-frequency structure for the estimation of the demixing matrix • In a practical situation (fully blind case), – optimal window length is similar to that in FDICA or IVA – difficulty of the blind estimation of a precise spectral model in ILRMA Thank you for your attention! 22