>100 Views

May 30, 17

スライド概要

Lecture at Nagoya University, May 30, 2017 presented by Daichi Kitamura

http://d-kitamura.net/links_en.html

1.

Nagoya University, Lecture May 30, 2017 Audio Source Separation Based on Low-Rank Structure and Statistical Independence The University of Tokyo Research Associate Daichi Kitamura

2.

Introduction • Daichi Kitamura (北村大地) • Research Associate of The University of Tokyo • Academic background – Kagawa National Collage of Technology （2005 ~ 2012） • B.S. in Engineering (March 2012) – Nara Institute of Science and Technology （2012 ~ 2014） • M.S. in Engineering (March 2014) – SOKENDAI （2014 ~ 2017） • Ph.D. in Informatics (March 2017) • Research topics – Media signal processing – Audio source separation 2

3.

Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 3

4.

Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 4

5.

Research background • Audio source separation – Signal processing – Separation of speech, music sounds, background noise, … – Cocktail party effect by a computer 5

6.

Research background • Audio source separation – Signal processing – Separation of speech, music sounds, background noise, … – Cocktail party effect by a computer 6

7.

Research background • Application of audio source separation – Hearing aid • Easy to talk in a loud environment – Speech recognition systems • Siri, Google search, Cortana, Amazon Echo, … – Automatic music transcription • Musical part separation (Vo., Gt., Ba., …) CD Separate Automatic transcription – Remix of live-recorded music • Professional use (improving quality), personal use (DJ remixing), … 7

8.

Demonstration: music source separation • Music source separation Keyboard Guitar Source separation Vocal Vocal Keyboard Pay attention to listen three parts in the mixture. Guitar 8

9.

Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation For monaural signals • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence For stereo or multichannel signals • Conclusion and future works 9

10.

Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation For monaural signals • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence For stereo or multichannel signals • Conclusion and future works 10

11.

Time-frequency representation of audio signals • Audio waveform in time domain (speech) 11

12.

Time-frequency representation of audio signals • Time-varying frequency structure – Short-time Fourier transform (STFT) Waveform … Fourier transform Window Shift length FFT length Fourier transform Fourier transform Time-frequency domain Frequency Time domain … Time Spectrogram Complex-valued matrix Entry-wise absolute and power Power spectrogram Nonnegative real-valued matrix 12

13.

Power spectrogram of speech 13

14.

Power spectrogram of music 14

15.

Structural properties • Sparse (for both speech and music) – Strong (yellow) components are fewer – Weak (darker) components are dominant • Continuous contour (only in speech) – Spectrum continuously and dynamically changes • Low rank (especially in music) – Including similar patterns (similar timbres) many times Speech Music 15

16.

Comparison of low-rankness Drums Guitar Vocals Speech 16

17.

Comparison of low-rankness • Low-rankness (simplicity of a matrix) – can be measured by a cumulative singular value (CSV) 95% line 7 29 Around 90 Number of bases when CSV reaches 95% （Spectrogram size is 1025x1883） – Drums and guitar are quite low-rank • Also, vocals and speech are to some extent low-rank – Music spectrogram can be modeled by few patterns 17

18.

Modeling technique of low-rank structures • Nonnegative matrix factorization (NMF) [Lee, 1999] – is a low-rank approximation using limited number of bases • Bases and their coefficients must be nonnegative – can be applied to a power spectrogram • Spectral patterns (typical timbres) and their time-varying gains Amplitude Basis matrix Activation matrix (spectral patterns) (time-varying gains) Frequency Frequency Nonnegative matrix (power spectrogram) Time Activation Time Amplitude Basis : # of frequency bins : # of time frames : # of bases 18

19.

Modeling technique of low-rank structures • Parameters optimization in NMF – Minimize “similarity measure” between and – Arbitrarily measure for similarity can be used • Squared Euclidian distance , etc. – Closed form solution is still an open problem – Iterative calculation can minimize • Multiplicative update rules [Lee, 2000] （for the case of squared Euclidian distance） 19

20.

Modeling technique of low-rank structures • Example Pf. and Cl. Superposition of rank-1 spectrogram 20

21.

Modeling technique of low-rank structures • Example Pf. and Cl. Pf. Cl. – Pf. and Cl. are separated! – Source separation based on NMF • is a clustering problem of the obtained spectral bases in – But how? 21

22.

Supervised audio source separation with NMF • If the sourcewise training data is available, • Supervised NMF [Smaragdis, 2007], [Kitamura1, 2014] Training stage Spectral dictionary of Pf. Other bases Separation stage Given Only , , and are optimized 22

23.

Supervised audio source separation with NMF • Demonstration – Stereo music separation with supervised NMF [Kitamura, 2015] Training sound of Pf. Original song Separated sound (Pf.) Training sound of Ba. Separated sound (Ba.) 23

24.

Problem of supervised approach • Performance will be limited – when the difference of timbres between training data and target source in the mixture becomes large Target Different Pf. Slightly different Training data Amplitude [dB] Mixture sound Real sound Artificial sound by MIDI 60 40 Supervised NMF Separated signal using artificial Pf. as training data 20 0 -20 0.0 Mixture (actual Pf. & Tb.) 0.5 1.0 1.5 2.0 Frequency [kHz] 2.5 Difference of timbres 3.0 24

25.

Adaptive supervised audio source separation • Supervised NMF with basis deformation [Kitamura, 2013] – employs to adaptively deform pre-trained bases in Training stage Slightly different Separation stage Deformation term (positive and negative) Given 25

26.

Adaptive supervised audio source separation • Constraint in deformation term – Range of deformation is restricted For the case of ±30% Frequency Frequency – To avoid excess deformation of Mixture (actual Pf. & Tb.) Supervised NMF Separated signal Supervised NMF with basis deformation Separated signal Training data is the same (artificial Pf. sound) 26

27.

Adaptive supervised audio source separation • Demonstration – Separate actual instrumental sounds using artificial training data produced by MIDI synthesizer. Training sound of Sax. (produced by MIDI) Separated sound (Sax.) Original song (actual instruments) Residual sound Training sound of Ba. (produced by MIDI) Separated sound (Ba.) Residual sound Copyright © 2014 Yamaha Corp. All rights reserved. 27

28.

Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation For monaural signals • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence For stereo or multichannel signals • Conclusion and future works 28

29.

Multichannel recording using microphone array • Number of microphones and sources – Overdetermined situation (# of sources Sources # of mics.) Observed Mixing system Estimated Demixing system Microphone array – Underdetermined situation (# of sources L-ch # of mics.) 1-ch R-ch CD Stereo signal (2-ch) One mic. Monaural signal (1-ch) • a priori information – Training data of the source, position of sources, room geometry, music scores, etc. – Blind source separation (BSS): without any a priori info. 29

30.

BSS and independent component analysis • Blind source separation (BSS) – Estimate demixing system about the mixing system Mixing system without any prior information Demixing system • Typical BSS is based on a statistical independence • Independent component analysis (ICA) [Comon, 1994] – How to measure a statistical independence? – Define a “distribution of audio signals” – Find demixing system that maximizes independence 30

31.

What is the distribution of audio signals? Amplitude • Distribution of speech waveform 0.5 Time samples Gaussian distribution 0.4 0.3 0.2 0.1 Amount of components 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Spiky and heavy-tailed than Gaussian (Normal) distribution Amplitude 13

32.

What is the distribution of audio signals? Amplitude • Distribution of Piano waveform Time samples 0.6 0.5 0.4 0.3 0.2 0.1 0 Laplace distribution Amount of components -5 -4 -3 -2 -1 0 1 2 3 4 5 Spiky and heavy-tailed than Gaussian distribution Amplitude 13

33.

What is the distribution of audio signals? Amplitude • Distribution of Drums waveform 1 Cauchy distribution 0.8 Time samples 0.6 0.4 0.2 0 Amount of components -5 -4 -3 -2 -1 0 1 2 3 4 5 Spiky and heavy-tailed than Gaussian distribution Amplitude 13

34.

Central limit theorem • Audio source distribution is basically non-Gaussian – But still we don’t know the source distribution • How to model them for source separation? • Central limit theorem – “A sum of any kind of random variables always approaches to having a Gaussian distribution.”* 0.6 0.5 0.4 0.3 0.2 0.1 0 Laplace distribution -5 -4 -3 -2 -1 0 1 2 3 4 5 Uniform distribution 0.01 Gaussian distribution 0.5 0.008 0.4 0.006 0.3 0.004 0.2 0.002 0.1 0 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 Generate r.v.s • Can’t believe? Let’s see * Several r.v.s do not obey, e.g., Cauchy r.v. 34

35.

Central limit theorem • is pips of first dice, and is pips of second dice – – Probability is always 1/6 – What about Amount Amount • Results of 1 million trials for each dice ? 35

36.

Central limit theorem • is pips of first dice, and is pips of second dice – – Probability is always 1/6 Amount • Results of 1 million trials for each dice – What about Not a uniform distribution any more ? 36

37.

Central limit theorem • is pips of first dice, and is pips of second dice – – Probability is always 1/6 Amount Amount • Results of 1 million trials for each dice 37

38.

Central limit theorem • is pips of first dice, and is pips of second dice – – Probability is always 1/6 • Results of 1 million trials for each dice – Approaches to a Gaussian distribution (central limit theorem) 38

39.

Central limit theorem in audio signals • is an th speakers signal , around 3.3 s Amplitude Amplitude – – Time samples Amount Amount Time samples Amplitude Amplitude 39

40.

Central limit theorem in audio signals is an th speakers signal , around 3.3 s Amplitude – – Time samples Amount • Amplitude 40

41.

Central limit theorem in audio signals • is an th speakers signal , around 3.3 s Amplitude Amplitude – – Time samples Amount Amount Time samples Amplitude Amplitude 41

42.

Central limit theorem in audio signals is an th speakers signal , around 3.3 s Amplitude – – Time samples Amount • Amplitude 42

43.

Central limit theorem in audio signals is an th speakers signal , around 3.3 s Amplitude – – Time samples Almost a Gaussian dist. (central limit theorem) Amount • Amplitude 43

44.

Principle of ICA • What we can say from central limit theorem – Gaussian distribution is a limitation of mixture of sources – If we maximize non-Gaussianity of all signals, the signals will be the original sources before they mixed Approaching to Gaussian (central limit theorem) Maximizing non-Gaussianity Departing from Gaussian (ICA) Maximizing independence between components More general, Basic principle of ICA 44

45.

Principle of ICA • Assumption in ICA – 1. Sources are mutually independent – 2. Each source distribution is non-Gaussian – 3. Mixing system is invertible and time-invariant 1. Mutually independent Sources Mixtures (latent components) (observed signals) Mixing matrix 2. Non-Gaussian 3. Invertible and time-invariant Inverse matrix 10

46.

Principle of ICA • Uncertainty in ICA – 1. Signal scale (volume) cannot determined – 2. Signal permutation cannot determined Sources Mixtures Separated signals (latent components) (observed signals) (estimated by ICA) ICA Sources Mixtures Separated signals (latent components) (observed signals) (estimated by ICA) ICA 11

47.

Principle of ICA • Estimation in ICA – Maximize independence between source distributions Minimize distance – log-likelihood function ： Non-Gaussian source distribution Generally, is set to an appropriate non-Gaussian distribution 12

48.

ICA-based separation of reverberant mixture • Audio mixture in actual environment – Convolutive mixture with reverberation • Ex. office room has 300 ms, concert hall is more than 2000 ms length Convolutive mixture Simultaneous mixture Reverberation (length of convolution filter) – Mixing coefficient becomes mixing filter • How to deconvolute them? – 1. Estimate deconvolution filter • In 16 kHz sampling, the filter with 300 ms includes 4800 taps • # of parameters that should be estimated explodes – 2. Estimate demixing coefficient in frequency domain • Frequency-wise demixing matrix • encountering permutation problem should be estimated by ICA 48

49.

ICA-based separation of reverberant mixture • Frequency-domain ICA (FDICA) [Smaragdis, 1998] Time frame Inverse matrix ICA1 ICA2 ICA3 … Frequency-wise mixing matrix … … Frequency bin – Apply simple ICA to each frequency bin Spectrogram ICA Frequency-wise demixing matrix 49

50.

ICA-based separation of reverberant mixture • Permutation problem in frequency-domain ICA – Order of separated signals in each frequency is messed up* – Have to take an alignment through the frequency Separated signal 1 Source 1 Mixture 1 ICA Time Source 2 Permutation Solver Separated signal 2 Mixture 2 In all frequency *Scales are also messed up, but they can be easily fixed. 50

51.

ICA-based separation of reverberant mixture • Popular permutation solvers – Based on direction of arrival (DOA) • Frequency-domain ICA + DOA alignment [Saruwatari, 2006] – Based on a relative correlation among frequencies • Independent vector analysis (IVA) [Hiroe, 2006], [Kim, 2006] – Based on a low-rank modeling of each source • Independent low-rank matrix analysis (ILRMA) [Kitamura, 2016] • Demonstration of BSS using ILRMA – http://d-kitamura.net/en/demo_rank1_en.htm 51

52.

Contents • Research background – Audio source separation and its applications – Demonstration • Structural modeling of audio sources – Time-frequency representation – Low-rank modeling of audio spectrogram – Supervised audio source separation • Statistical modeling between sources – Blind audio source separation – Audio distribution and central limit theorem – Maximization of independence • Conclusion and future works 52

53.

Conclusions and future works • Audio source separation based on Region – Low-rank property • Nonnegative matrix factorization – Statistical independence • Blind source separation • For further improving Duration – Separation based on a huge dataset training • Deep learning, denoising auto encoder, etc. • Recording condition is juts one-time – Informed source separation • Music scores could be a powerful information • User can induce the system, and leads more accurate separation • Performance is still insufficient – Almost there? Not at all! Make our life better. That’s an engineering. 53