>100 Views

September 07, 15

スライド概要

Presented at The 2015 European Signal Processing Conference (EUSIPCO 2015, international conference)

Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, Hiroshi Saruwatari, "Relaxation of rank-1 spatial constraint in overdetermined blind source separation," Proceedings of The 2015 European Signal Processing Conference (EUSIPCO 2015), pp.1271-1275, Nice, France, September 2015 (Invited Special Session).

http://d-kitamura.net/links_en.html

1.

EUSIPCO 2015, 2 Sept.,14:30 - 16:10, SS30 Acoustic scene analysis using microphone array Relaxation of Rank-1 Spatial Constraint in Overdetermined Blind Source Separation Daichi Kitamura Nobutaka Ono Hiroshi Sawada Hirokazu Kameoka Hiroshi Saruwatari (SOKENDAI) (NII/SOKENDAI) (NTT) (The Univ. of Tokyo/NTT) (The Univ. of Tokyo)

2.

Research Background • Blind source separation (BSS) – Estimation of original sources from the mixture signal Original sources Observation (mixture) Mixing system Estimated sources BSS Unknown – We only focus on overdetermined situations • Number of sources Number of microphones • Ex) Independent component analysis, independent vector analysis • Applications of BSS – Acoustic scene analysis, speech enhancement, music analysis, reproduction of sound field, etc. 2/21

3.

Problems and Motivations • For reverberant signals – ICA-based methods cannot separate sources well because Linear time-invariant mixing system is assumed Instantaneous mixing in time-frequency domain – When the number of microphones is grater than the number of sources, PCA is often applied before BSS To remove weak (reverberant) components of all the sources Original sources Observed signals Mixing PCA Dimensionreduced signals Estimated sources BSS • Reverberation is also important information to analyze acoustic scenes – We should separate the sources with their own reverberations. 3/21

4.

Conventional Methods (1/4) • Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] – assumes independence between source vectors – assumes linear time-invariant mixing system • The mixing system can be represented by mixing matrix in each frequency bin. – can efficiently be optimized [Ono, 2011] Original sources Observed Estimated Mixing signals Demixing sources matrices matrices … … … … … 4/21

5.

Conventional Methods (2/4) • Nonnegative matrix factorization (NMF) [Lee, 2001] – decomposes spectrogram into spectral bases Frequency Time Amplitude Time Basis Activation matrix (Time-varying gain) Amplitude Basis matrix (spectral patterns) Frequency Observed matrix (power spectrogram) : Number of frequency bins : Number of time frames : Number of bases – Decomposed bases should be clustered into each source. • Very difficult problem – Multichannel extension of NMF has been proposed. 5/21

6.

Conventional Methods (3/4) • Multichannel NMF (MNMF) [Ozerov, 2010], [Sawada, 2013] Multichannel vector Time-frequency-wise Source-frequency-wise channel correlations spatial covariances Clusterindicator Multichannel observation Spatial model Instantaneous covariance Basis matrix Activation matrix Gains Spectral patterns Source model 6/21

7.

Conventional Methods (4/4) • MNMF with rank-1 spatial model (Rank-1 MNMF) Time-frequency-wise Source-frequency-wise channel correlations spatial covariances Clusterindicator [Kitamura, ICASSP 2015] Basis matrix Activation matrix Gains Spectral patterns Modeled by rank-1 matrices (constraint) = Linear mixing assumption as well as IVA – Spatial model – Source model can be optimized by IVA and can be optimized by simple NMF We can optimize all the variables using update rules of IVA and simple NMF 7/21

8.

Rank-1 Spatial Constraint • Rank-1 spatial constraint Linear mixing assumption Frequency – Instantaneous mixture in a time-frequency domain – Mixing system can be represented by mixing matrix Source Observed signal signal Observed spectrogram Time-invariant mixing matrix Time 1. Sources can be modeled as point sources 2. Reverberation time is shorter than FFT length 8/21

9.

Problem of Rank-1 Spatial Model • When reverberation time is longer than FFT length, Frequency – the impulse response becomes long – reverberant components leak into the next time frame Source Observed signal signal Observed spectrogram Leaked components Time Mixing system cannot be represented by using only The separation performance markedly degrades. . 9/21

10.

Summary of Conventional methods • MNMF [Ozerov, 2010], [Sawada, 2013] – Full-rank spatial model • does not use rank-1 spatial constraint – – much computational costs strong dependence on initial values • IVA [Hiroe, 2006], [Kim, 2006] & Rank-1 MNMF [Kitamura, 2015] – Rank-1 spatial constraint (linear mixing assumption) • Separation performance degrades for the reverberant signals – Faster and more stable optimization To achieve good and stable separation even for the reverberant signals, Relax the rank-1 spatial constraint while maintaining efficient optimization 10/21

11.

Proposed Approach • Utilize extra observations to model direct and reverberant components simultaneously. – microphones for Ex. Original sources sources, sources, where microphones ( ) Observed signals Dimension-reduced Estimated signals sources Mixing PCA BSS • Dimensionality reduction with principal component analysis (PCA) – remove reverberant components of all the sources by PCA – But the reverberant components are important! 11/21

12.

Proposed Approach • Utilize extra observations to model direct and reverberant components simultaneously. – microphones for Ex. Original sources sources, sources, where microphones ( ) Observed signalsSeparated components Estimated Direct sources Reverb. Mixing BSS Direct Reverb. Reconstruction • We assume the independence between not only sources but also the direct and reverberant IVA or Rank-1 MNMF components of the same sources. 12/21

13.

Clustering of Separated Components • Permutation problem of separated components – Order of separated components depends on initial values Separated components Clustering Clustered components Estimated Direct component source of source 1 Reverb. Whichcomponent separated components of source 1 belong to which source? Direct component of source 2 Reverb. component Reconstruction of source 2 • We propose two methods to cluster the components – 1. Using cross-correlations for IVA – 2. Sharing basis matrices for Rank-1 MNMF 13/21

14.

Clustering Using Spectrogram Correlation • Direct and reverberant components of the same source have a strong cross-correlation. Power spectrogram of Power spectrogram of ・・・ • Cross-correlation of two power spectrograms – Calculate for all combination of separated components – Merge the components in a descending order of 14/21

15.

Auto-Clustering by Sharing Basis Matrix • Direct and reverberant components can be modeled by the same bases (spectral patterns) • Estimate signals with Basis-Shared Rank-1 MNMF Separated components Source model of BasisShared Rank-1 MNMF Shared basis matrix for source 1 Shared basis matrix for source 2 Direct component of source 1 Reverb. component of source 1 Direct component of source 2 Reverb. component of source 2 Estimated sources Reconstruction – Only for Rank-1 MNMF • because IVA doesn’t have NMF source model – By imposing basis-shared source model, Rank-1 MNMF can automatically cluster the components. 15/21

16.

Experiments • Conditions Professionally-produced music signals from SiSEC database JR2 impulse response in RWCP database is used Original source Two sources and four microphones Sampling frequency Down sampled from 44.1 kHz to 16 kHz FFT length in STFT Shift length in STFT Number of bases Number of iterations Number of trials Evaluation criterion 8192 points (128 ms, Hamming window) 2048 points (64 ms) 15 bases for each source (30 bases for all the sources) 200 10 times with various seeds of random initialization Average SDR improvement and its deviation – JR2 impulse response Source 1 Source 2 Reverberation time: 470 ms Microphone spacing: 2.83 cm 2m 80 60 16/21

17.

Experiments • Compared methods (7 methods) – PCA + 2ch IVA Conventional methods • Apply PCA before IVA – PCA + 2ch Rank-1 MNMF • Apply PCA before Rank-1 MNMF – 4ch IVA + Clustering • Apply IVA without PCA, and cluster the components Proposed methods – 4ch Basis-Shared Rank-1 MNMF • Apply Basis-Shared Rank-1 MNMF without PCA – 4ch MNMF-based BF (beam forming) • Apply maximum SNR beam forming (time-invariant filtering) using full-rank covariance estimated by 4ch MNMF – 4ch MNMF Conventional methods • Apply conventional MNMF (full-rank model), and apply multichannel Wiener filtering (time-variant filtering) – Ideal time-invariant filtering • The upper limit of time-invariant filtering (supervised) Reference score 17/21

18.

Experiments • Results (song: ultimate_nz_tour__snip_43_61) – Source 1: Guitar – Source 2: Vocals Rank-1 spatial model SDR improvement [dB] 16 14 12 10 8 6 4 2 0 Full-rank spatial model Time-variant filter (1/src) Time-invariant filter (2/src) Upper limit of time-invariant filter (1/src) Full-rank model Rank-1 spatial model Time-invariant Time-invariant filter (1/src) filter (1/src) PCA+ 2ch IVA PCA+ 2ch Rank1 MNMF 4ch IVA+ 4ch Basis- 4ch MNMF- 4ch MNMF Ideal timeClustering Shared Rank-1 based BF invariant filtering MNMF (supervised) : Source 1 : Source 2 18/21

19.

Experiments • Results (song: bearlin-roads__snip_85_99) – Source 1: Acoustic guitar – Source 2: Piano SDR improvement [dB] 12 10 8 6 4 2 0 -2 -4 PCA+ 2ch IVA PCA+ 2ch Rank1 MNMF 4ch IVA+ 4ch Basis- 4ch MNMF- 4ch MNMF Ideal timeClustering Shared Rank-1 based BF invariant filtering MNMF (supervised) : Source 1 : Source 2 19/21

20.

Experiments • Comparison of computational times – Conditions • CPU: Intel Core i7-4790 (3.60GHz) • MATLAB 8.3 (64-bit) • Song: ultimate_nz_tour__snip_43_61 (18 s, 16 kHz sampling) PCA + 2ch IVA PCA + 2ch Rank1MNMF 4ch IVA+ Clustering 4ch BasisShared Rank1 MNMF 4ch MNMF 23.4 s 29.4 s 60.1 s 143.9 s 3611.8 s 2.4m 1h! Achieve efficient optimization compared with MNMF (The performance is comparable with MNMF) 20/21

21.

Conclusion • For the case of reverberant signals – Achieve both good performance and efficient optimization • The proposed method – Can be applied when the number of microphones is grater than twice the number of sources – separately estimates direct and reverberant components utilizing extra observations – can be thought as a relaxation of rank-1 spatial constraint • Experimental results show better performance – The proposed method outperforms the upper limit of timeinvariant filtering in some cases Thank you for your attention! 21/21