Скачать презентацию On the use of statistical tools for audio Скачать презентацию On the use of statistical tools for audio

6b82700a129cca316974db3bf9f3737a.ppt

  • Количество слайдов: 61

On the use of statistical tools for audio processing Mathieu Lagrange and Juan José On the use of statistical tools for audio processing Mathieu Lagrange and Juan José Burred Analyse / Synthèse Team, IRCAM Mathieu. [email protected] fr École centrale de Nantes Filière ISBA

Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. 2. Model Applications (coding, speaker recognition, speech recognition) Audio (Music) 1. Sound models 2. Applications (classification, similarity) 3. Limitations 3. Sound source separation 1. Paradigms, tasks and applications 2. Mixing Models 3. Methods for the under determined case 4. Clustering of Spectral Audio (Co. SA) 1. Auditory Scene Analysis (ASA) 2. Clustering Mathieu Lagrange. Statistical Tools for Audio Processing. 2

Outline 1. Introduction 1. Context and challenges Mathieu Lagrange. Statistical Tools for Audio Processing. Outline 1. Introduction 1. Context and challenges Mathieu Lagrange. Statistical Tools for Audio Processing. 3

Technological Context « We are drowning in information and starving for knowledge » R. Technological Context « We are drowning in information and starving for knowledge » R. Roger • Needs: o o Transmission o • Measurement Access Aim of a numerical representation: o o Efficiency o • Precision Relevance Means o Mechanical biology o Psycho-acoustic o Cognition Mathieu Lagrange. Statistical Tools for Audio Processing. 4

Challenges « Forty-two! yelled Loonquawl. Is that all you've got to show for seven Challenges « Forty-two! yelled Loonquawl. Is that all you've got to show for seven and a half million years' work? » D. Adams Music is great to study as it is both: • object : arrangement de sons et de silences au cours du temps • function: more or less codified form of expression of : o Individual feelings (mood) o Collective feelings (party, singing, dance) Mathieu Lagrange. Statistical Tools for Audio Processing. 5

Audio Processing: Past and Present Mathieu Lagrange. Statistical Tools for Audio Processing. 6 Audio Processing: Past and Present Mathieu Lagrange. Statistical Tools for Audio Processing. 6

Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. Model 2. Applications (coding, speaker recognition, speech recognition) Mathieu Lagrange. Statistical Tools for Audio Processing. 7

Speech signal • The speech signal is produced when the air flow coming from Speech signal • The speech signal is produced when the air flow coming from the lungs go through the vocal chords and the vocal tract. o o • The size and the shape of the vocal tract as well as the vocal chords excitations are changing relatively slowly The speech signal can therefore be considered as quasi-stationary over short period of about 20 ms. Type of speech production o Voiced: , , … o Unvoiced: , , o Plosives: , Mathieu Lagrange. Statistical Tools for Audio Processing.

Source / Filter Model • In the case of an idealized voiced speech signal, Source / Filter Model • In the case of an idealized voiced speech signal, the vocal chords are producing a perfectly periodic harmonic signal • The influence of the vocal tract can be considered as a filtering with a given frequency response whose maximas are called formants. Mathieu Lagrange. Statistical Tools for Audio Processing.

Source / Filter Coding • Algorithm : o Voiced / Unvoiced detection; o Voiced Source / Filter Coding • Algorithm : o Voiced / Unvoiced detection; o Voiced case: the source signal is approximated with a Dirac comb: o o o a Dirac comb whose successive Diracs are respectively T spaced by T as a spectrum which is a Dirac comb whose successive combs are 1/T spaced. Parameters : T, gain Unvoiced: the source signal is approximated by a stochastic signal: o o Parameter : gain. The Source signal is next filtered. o Parameters : filter coefficients. Mathieu Lagrange. Statistical Tools for Audio Processing.

 « Code-Excited Linear Predictive » (CELP) For each frame of 20 ms : « Code-Excited Linear Predictive » (CELP) For each frame of 20 ms : o Auto-Regressive coefficients are computed such that the prediction error is minimized over the entire duration of the frame: o Quantified coefficients and an index encoding the error signal are transmitted. Mathieu Lagrange. Statistical Tools for Audio Processing.

 « Code-Excited Linear Predictive » (CELP) Signal AR Coefficients Residual Mathieu Lagrange. Statistical « Code-Excited Linear Predictive » (CELP) Signal AR Coefficients Residual Mathieu Lagrange. Statistical Tools for Audio Processing. index

Speaker Recognition • Classical pattern recognition problem • Specific problems: o o Text Dependency Speaker Recognition • Classical pattern recognition problem • Specific problems: o o Text Dependency s 1 Identification / Verification o • Open Set / Closed Set: rejection problem s 2 s 3 Method o Feature extraction: model each speech with Mel-Frequency Cepstral Coefficients (MFCCs) and their derivatives. o Classification o Text independent: Vector Quantization Codebooks or Gaussian Mixture Models (GMMs) o Text dependent: Dynamic Time Warping (DTW) or Hidden Markov Model (HMM) Mathieu Lagrange. Statistical Tools for Audio Processing. 13

Speech recognition • An Automatic Speech Recognition System is typically decomposed into: o Feature Speech recognition • An Automatic Speech Recognition System is typically decomposed into: o Feature Extraction: MFCCs o Acoustic Models: HMMs trained for set of phones o Each phone is modelled with 3 states o Pronunciation dictionary: convert a series of phones into a word o Language Model: predict the likelihood of specific words occurring one after another with n-grams (Fig. from HTK documentation) Mathieu Lagrange. Statistical Tools for Audio Processing. 14

MFCCs rules ? Mel Frequency Cepstral Coefficients are commonly derived as follows: 1. Take MFCCs rules ? Mel Frequency Cepstral Coefficients are commonly derived as follows: 1. Take the Fourier transform of (a windowed excerpt of) a signal. 2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. 3. Take the logs of the powers at each of the mel frequencies. 4. Take the discrete cosine transform (DCT) of the list of mel log powers, as if it were a signal. 5. The MFCCs are the amplitudes of the resulting spectrum. Mathieu Lagrange. Statistical Tools for Audio Processing. 15

MFCCs Rules ? Mathieu Lagrange. Statistical Tools for Audio Processing. 16 MFCCs Rules ? Mathieu Lagrange. Statistical Tools for Audio Processing. 16

 • Issues with MFCCs computation steps The MEL frequency wraping: o o • • Issues with MFCCs computation steps The MEL frequency wraping: o o • highly criticized form a perceptual point of view (Greenwood) conceptually: periodicity analysis over data that are not periodic anymore (Camacho) The Cepstral Coefficients are COSINE coefficients: o • cannot shift with speaker size to capture the shift in formant frequencies that occurs as children grow up and their vocal tracts get longer Not a sound representation: o no way to provide enhancements such as speaker and channel adaptation, background noise suppression, source separation Mathieu Lagrange. Statistical Tools for Audio Processing. 17

Potentials of the DCT step • Observation of Pols that the main components capture Potentials of the DCT step • Observation of Pols that the main components capture most of the variance using a few smooth basis functions, smoothing away the pitch ripples • Principal components of a collection of vowel spectra on a warped frequency scale aren't so far from the cosine basis functions • Decorrelates the features. o This is important because the MFCC are in most cases modelled by Gaussians with diagonal covariance matrices Mathieu Lagrange. Statistical Tools for Audio Processing. 18

Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. 2. Model Applications (coding, speaker recognition, speech recognition) Audio (Music) 1. Sound models 2. Applications (classification, similarity) 3. Limitations Mathieu Lagrange. Statistical Tools for Audio Processing. 19

Sound Models • Major classes of sounds 1. 2. Pseudo periodic (flute, …) 3. Sound Models • Major classes of sounds 1. 2. Pseudo periodic (flute, …) 3. • Transients (castanets, …) Stochastic (waves, …) Models 1. Impulsive noise 2. Sum of sinusoids 3. Wide band noise Mathieu Lagrange. Statistical Tools for Audio Processing. 20

Classification • Method: [Tzanetakis’ 02] o Agree on mutually exclusive set of tags (the Classification • Method: [Tzanetakis’ 02] o Agree on mutually exclusive set of tags (the ontology) o Extract features from audio (MFCCs and variations) o Train statistical models: o • Due to the high dimensionality of the feature vectors discriminatives approaches are prefered (SVMs) Segmentation o Smoothing decision using dynamic programming (DP) Tzanetakis, G. Cook, P. Musical Genre Classification of Audio Signals (Fig. from [Ramona 07]) Tzanetakis’ 02 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 2002 Mathieu Lagrange. Statistical Tools for Audio Processing. 21

Multi-Class Discriminative Classification • Usually performed by combining binary classifiers • Two approaches: o Multi-Class Discriminative Classification • Usually performed by combining binary classifiers • Two approaches: o One-vs-all: For each class build a classifier for that class versus the rest o o s 1 s 2 s 3 Often very imbalanced classifiers (use asymmetric regularization) All-vs-all Build a classifier for each couple of class o A priori a large number of classifiers to build but the pairwise classification are faster and the classifications are balanced (easier to find the best regularization) Mathieu Lagrange. Statistical Tools for Audio Processing. 22

 • • Multi-Label Discriminative Classification Each object may be tagged using several labels • • Multi-Label Discriminative Classification Each object may be tagged using several labels Computational approaches o o • Power Sets C 1 C 2 C 3 Binary Relevance (equivalent to one-vsall) Multiple criteria: o « Flattening » the ontology o Research trend: considering the ontology structure to benefit from co-occurrence labels of different semantic criterion Mathieu Lagrange. Statistical Tools for Audio Processing. 23

Music Similarity • Question to solve: « Given a seed song, provide us with Music Similarity • Question to solve: « Given a seed song, provide us with the entries of the database which are the most similar » • Annotation type: Artist / Album • Method: [Aucouturier’ 04] o Songs are modeled as GMM of MFCCs o Proximity of GMMs are considered as similiarity measure: o Likelihood (requires access to the MFCCs) o Sampling [Aucouturier’ 04] J. -J. Aucouturier and F. Pachet. Improving Timbre Similarity: How High is the Sky? Journal of Negative Results in Speech and Audio Sciences, 1 (1), 2004. Mathieu Lagrange. Statistical Tools for Audio Processing. 24

Cover Version Detection • Question to solve: « Given a seed song, provide us Cover Version Detection • Question to solve: « Given a seed song, provide us with the entries of the database which are cover versions » • Annotation: canonical song • Method: [Serra’ 08] o Songs are modeled as a time series of Chromas o Computation of the similarity matrix between the two time series o Similarity is measured using Dynamic Programming Local Alignment (Fig. from [Serr 08]) [Serra’ 08] Chroma Binary Similarity and Local Alignment. Applied to Cover Song Identification, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2008 Mathieu Lagrange. Statistical Tools for Audio Processing. 25

Limitations • Description of audio and music o o • Polyphonic Multiple shapes varying Limitations • Description of audio and music o o • Polyphonic Multiple shapes varying in various ways Statistical Modeling o Curse of dimensionality o Sense of structure relevant at multiple levels of temporality Mathieu Lagrange. Statistical Tools for Audio Processing. 26

Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. 2. Model Applications (coding, speaker recognition, speech recognition) Audio (Music) 1. Sound models 2. Applications (classification, similarity) 3. Limitations 3. Sound source separation 1. Paradigms, tasks and applications 2. Mixing Models 3. Methods for the under determined case Mathieu Lagrange. Statistical Tools for Audio Processing. 27

Sound Source Separation • “Cocktail party effect” o E. C. Cherry, 1953. o Ability Sound Source Separation • “Cocktail party effect” o E. C. Cherry, 1953. o Ability to concentrate attention on a specific sound source from within a mixture. o Even when interfering energy is close to energy of desired source. • “Prince Shotoku Challenge” o Legendary Japanese prince Shotoku (6 th Century AD) could listen and understand simultaneously the petitions by ten people. o Concentrate attention on several sources at the same time! o “Prince Shotoku Computer” (Okuno et al. , 1997) • [Cherry 53] [Okuno 97] Both allegories imply an extra step of semantic understanding of the sources, beyond mere acoustical isolation. E. C. Cherry. Some Experiments on the Recognition of Speech, With One and Two Ears. Journal of the Acoustical Society of America, Vol. 25, 1953. H. G. Okuno, T. Nakatani and T. Kawabata. Understanging Three Simultaneous Speeches. Proc. Int. Joint Conference on Artificial Intelligence (IJCAI), Nagoya, Japan, 1997. Mathieu Lagrange. Statistical Tools for Audio Processing. 28

 • The paradigms of Musical Source Separation (based on [Scheirer 00]) Understanding without • The paradigms of Musical Source Separation (based on [Scheirer 00]) Understanding without separation E. g. music genre classification “Glass ceiling” of traditional methods (MFCC+GMM) [Aucouturier&Pachet 04] Separation for understanding First (partially) separate, then feature extraction Source separation as a way to break the glass ceiling? Separation without understanding BSS: Blind Source Separation (ICA, ISA, NMF) Blind means: only very general statistical assumptions taken. Understanding for separation Supervised source separation (based on a training database) [Scheirer 00] E. D. Scheirer. Music-Listening Systems. Ph. D thesis, Massachusetts Institute of Technology, 2000. [Aucouturier&Pachet 0 J. -J. Aucouturier and F. Pachet. Improving Timbre Similarity: How High is the Sky? Journal of Negative 4] Results in Speech and Audio Sciences, 1 (1), 2004. Mathieu Lagrange. Statistical Tools for Audio Processing. 29

Required sound quality • Audio Quality Oriented (AQO) o Aimed at full unmixing at Required sound quality • Audio Quality Oriented (AQO) o Aimed at full unmixing at the highest possible quality. o Applications: o o Hearing aids o • Unmixing, remixing, upmixing Post-production Significance Oriented (SO) o Separation quality just enough for facilitating semantic analysis of complex signals. o Less demanding, more realistic. o Applications: o Music Information Retrieval o Polyphonic Transcription o Object-based audio coding Mathieu Lagrange. Statistical Tools for Audio Processing. 30

Musical Source Separation Tasks • Classification according to the nature of the mixtures: • Musical Source Separation Tasks • Classification according to the nature of the mixtures: • Classification according to available a priori information: Mathieu Lagrange. Statistical Tools for Audio Processing. 31

Linear mixing model • Only amplitude scaling before mixing (summing) • Linear stereo recording Linear mixing model • Only amplitude scaling before mixing (summing) • Linear stereo recording setups: XY Stereo MS Stereo Close miking Direct injection Mathieu Lagrange. Statistical Tools for Audio Processing. 32

Delayed mixing model • Amplitude scaling and delay before mixing • Delayed stereo recording Delayed mixing model • Amplitude scaling and delay before mixing • Delayed stereo recording setups: AB Stereo Mixed Stereo Close miking with delay Direct injection with delay Mathieu Lagrange. Statistical Tools for Audio Processing. 33

Convolutive mixing model • Filtering between sources and sensors • Convolutive stereo recording setups: Convolutive mixing model • Filtering between sources and sensors • Convolutive stereo recording setups: Reverberant environment Binaural Close miking with reverb Direct injection with reverb Mathieu Lagrange. Statistical Tools for Audio Processing. 34

Some terminology • System of linear equations: o o • Usual algebraic methods from Some terminology • System of linear equations: o o • Usual algebraic methods from high school: X known, A known, S unknown But in source separation: unknown variables (S, sources) AND unknown coefficients (A, mixing matrix) Algebra terminology is retained for source separation: o o Same number of equations (mixtures) than unknowns (sources): determined (square A) o • More equations (mixtures) than unknowns (sources): overdetermined Less equations (mixtures) than unknowns (sources): underdetermined The underdetermined case is the most demanding, but also the most important for music! o Music is (still) mostly in stereo, with usually more than 2 instruments o Overdetermined and determined situtations are only of interest for arrays of sensors or arrays of microphones (localization, tracking) Mathieu Lagrange. Statistical Tools for Audio Processing. 35

Binaural Case (1) • Goal: find a mask M that retrieves one source when Binaural Case (1) • Goal: find a mask M that retrieves one source when used to filter a given time-frequency representation. º is the Hadamard (element-wise) product • DUET (Degenerate Unmixing Estimation Technique) (Fig. from [Vincent 06]) [Yilmaz&Rickard 04] o o • Histogram of Interchannel Intensity (IID) and Phase (IPD) Differences Binary Mask created by selecting bins around histogram peaks. Drawback of t-f masking: “musical noise” or “burbling” artifacts (Fig. from [Yilmaz&Rickard 04]) [Yilmaz&Rickard 04] Ö. Yilmaz and S. Rickard. Blind Separation of Speech Mixtures via Time-Frequency Masking. IEEE Trans. on Signal Processing. Vol. 52(7), July 2004 Mathieu Lagrange. Statistical Tools for Audio Processing. 36

Binaural Case (2) • Human-assisted time-frequency masking [Vinyes 06] o Human-assisted selection of the Binaural Case (2) • Human-assisted time-frequency masking [Vinyes 06] o Human-assisted selection of the time-frequency bins out of the DUET-like histogram for creating the unmixing mask o Implementation as a VST plugin (“Audio Scanner”) [Vinyes 06] M. Vinyes, J. Bonada and A. Loscos. Demixing Commercial Music Productions via Human-Assisted Time. Frequency Masking. 120 th AES convention, Paris, France, 2006. Mathieu Lagrange. Statistical Tools for Audio Processing. 37

Monaural Case • Classification according to a priori knowledge o Supervised o o o Monaural Case • Classification according to a priori knowledge o Supervised o o o • Based on training the model with a sound example database Better quality and more demanding situations at the cost of less generality Unsupervised Classification according to model type o o • Adaptive basis decompositions (ISA, NMF, NSC) Sinusoidal Modeling Classification according to mixture type o Monaural systems o Hybrid systems combining advanced source models with spatial diversity Mathieu Lagrange. Statistical Tools for Audio Processing. 38

Independent Subspace Analysis • Application of ISA to audio: Casey and Westner, 2000. • Independent Subspace Analysis • Application of ISA to audio: Casey and Westner, 2000. • Application of ICA to the spectogram of a mono mixture. • Each independent component corresponds to an independent subspace of the spectrogram. (Fig. from [Casey&Westner 00]) • Component-to-source clustering o The extracted components usually do not directly correspond to the sources. o They must be clustered together according to some similarity criterion. o Casey&Westner use a matrix of Kullback-Leibler divergences called the ixegram. [Casey&Westner 00] M. Casey and A. Westner. Separation of Mixed Audio Sources by Independent Subspace Analysis. Proc, Int. Computer Music Conference (ICMC), Berlin, Germany, 2000. Mathieu Lagrange. Statistical Tools for Audio Processing. 39

ICA for Audio (Figs from Virtanen) Mathieu Lagrange. Statistical Tools for Audio Processing. 40 ICA for Audio (Figs from Virtanen) Mathieu Lagrange. Statistical Tools for Audio Processing. 40

Nonnegative Matrix Factorization • Matrix factorization ( ) imposing non-negativity. • Needed when using Nonnegative Matrix Factorization • Matrix factorization ( ) imposing non-negativity. • Needed when using magnitude or power spectrograms. • NMF does not aim at statistical independence, but: o o NMF yields components that very closely correspond to the sources. o • It has been proven that, under some conditions, non-negativity is sufficient for separation. To date, there is no exact theoretical explanation why is that so! Use for transcription: o • P. Smaragdis and J. C. Brown. Non-Negative Matrix Factorization for Polyphonic Music Transcription. Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, 2003. Use for separation: o B. Wang and M. D. Plumbley. Musical Audio Stream Separation by Non-Negative Matrix Factorization. Proc. UK Digital Music Research Network (DMRN) Summer Conf. , 2005. Mathieu Lagrange. Statistical Tools for Audio Processing. 41

NMF for Audio (Figs from Virtanen) Mathieu Lagrange. Statistical Tools for Audio Processing. 42 NMF for Audio (Figs from Virtanen) Mathieu Lagrange. Statistical Tools for Audio Processing. 42

NMF for Vision • By representing signals as a sum purely additive, non- negative NMF for Vision • By representing signals as a sum purely additive, non- negative sources, we get a parts-based representation [Lee’ 99] Lee and Seung, Learning the parts of objects by nonnegative matrix factorization, Nature, 1999, 41 Mathieu Lagrange. Statistical Tools for Audio Processing. 43

Nonnegative Sparse Coding • Combination of non-negativity and sparsity constraints in the factorization. • Nonnegative Sparse Coding • Combination of non-negativity and sparsity constraints in the factorization. • [Virtanen 03]: NSC is optimized with an additional criterion of temporal continuity. o Measured by the absolute value of the overall amplitude difference between consecutive frames. Mixture • Component 1 Component 2 [Virtanen 04]: Convolutive Sparse Coding o Improved temporal accuracy by modeling the sources as the convolution of spectrograms with a vector of onsets. Mixture [Virtanen 03] [Virtanen 04] Component 1 Component 2 T. Virtanen. Sound Source Separation Using Sparse Coding with Temporal Continuity Objective. roc. Int. P Computer Music Conference (ICMC), Singapore, 2003. T. Virtanen. Separation of Sound Sources by Convolutive Sparse Coding. Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing (SAPA), Jeju, Korea, 2004. Mathieu Lagrange. Statistical Tools for Audio Processing. 44

Sinusoidal Methods • Sinusoidal Modeling: detection and tracking of the sinusoidal partial peaks on Sinusoidal Methods • Sinusoidal Modeling: detection and tracking of the sinusoidal partial peaks on the spectrogram. • Based on Auditory Scene Analysis (ASA) cues of good-continuation, common fate and smoothness of sinusoidal tracks. • Overall, very good reduction of interfering sources, but moderate timbral quality. o • (Fig. from [Every 06]) [Virtanen&Klapuri 02]: model of spectral smoothness of harmonic sounds o o • Appropriate for Significance-Oriented applications Based on basis decomposition of harmonic structures Mixture Additive resynthesis of partial parameters Separated sources [Every&Szymanski 06] o Spectral subtraction instead of additive resynthesis [Virtanen&Klapuri 02] T. Virtanen and A. Klapuri. Separation of Harmonic Sounds Using Linear Models for the Overtone Series. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Orlando, USA, [Every&Szymanski 06 2002. ] M. R. Every and J. E. Szymanski. Separation of Synchronous Pitched Notes by Spectral Filtering of Harmonics. IEEE Trans. on Audio, Speech and Signal Processing. Vol. 14(5), 2006. Mathieu Lagrange. Statistical Tools for Audio Processing. 45

Supervised Methods (1) • Use of a training database to create a set of Supervised Methods (1) • Use of a training database to create a set of source models, each one modeling a specific instrument. o • Better separation as a trade-off for generality. Supervised sinusoidal methods o [Burred&Sikora 07] o The source models are compact descriptions of the spectral envelope and its temporal evolution. o The detailed temporal evolution allows to ignore harmonicity constraints, and thus separation of chords and inharmonic sounds is possible. Separation of chords Mixture [Burred&Sikora 07] Separated sources Inharmonic separation Mixture Separated sources J. J. Burred and T. Sikora. Monaural Source Separation from Musical Mixtures Based on Time. Frequency Timbre Models. Proc. Int. Conf. on Music Information Retrieval (ISMIR), Vienna, Austria, September 2007. Mathieu Lagrange. Statistical Tools for Audio Processing. 46

Supervised Methods (2) • Bayesian Networks o o Multilayered model describing note probabilities (state Supervised Methods (2) • Bayesian Networks o o Multilayered model describing note probabilities (state layer), spectral decomposition (source layer) and spatial information (mixture layer). o Trained on a database of isolated notes. o • [Vincent 06] Allows separation of sounds with reverb. Mixture Separated sources Learnt priors for Wiener-based separation o [Ozerov 05] o Single-channel o GMM models of singing voice and accompaniment. [Vincent 06] [Ozerov 05] E. Vincent. Musical Source Separation Using Time-Frequency Source Priors. IEEE Trans. on Audio, Speech and Language Processing, Vol. 14 (1), 2006. A. Ozerov, O. Philippe, R. Gribonval and F. Bimbot. One Microphone Singing Voice Separation Using Source. Adapted Models. Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, 2005. Mathieu Lagrange. Statistical Tools for Audio Processing. 47

Conclusions • Still far from fully-general, audio-quality-oriented system. • More realistic: significance oriented o Conclusions • Still far from fully-general, audio-quality-oriented system. • More realistic: significance oriented o • Methods based on adaptive models, time-frequency masking: o • More realistic mixtures, but more artifacts and interferences Methods based on sinusoidal modeling: o • Separation good enough to facilitate content analysis More artificial timbre, but less interferences. Current polyphony limitations: o Mono signals: up to 3, 4 instruments o Stereo signals: up to 5, 6 instruments Mathieu Lagrange. Statistical Tools for Audio Processing. 48

Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. Outline 1. Introduction 1. Context and challenges 2. Past and Present 1. Speech 1. 2. Model Applications (coding, speaker recognition, speech recognition) Audio (Music) 1. Sound models 2. Applications (classification, similarity) 3. Limitations 3. Sound source separation 1. Paradigms, tasks and applications 2. Mixing Models 3. Methods for the under determined case 4. Clustering of Spectral Audio (Co. SA) 1. Auditory Scene Analysis (ASA) 2. Clustering Mathieu Lagrange. Statistical Tools for Audio Processing. 49

Binary Masking with Oracle • Binary masking is an effective way of performing the Binary Masking with Oracle • Binary masking is an effective way of performing the separation • Using an oracle allows to assess the relevance of Fourier spectrograms as an atomic representation • The binary mask is set to 1 if the source of interest is dominant in the considered frequency bin, and 0 otherwise STFT ISTFT |STFT Mask | Mathieu Lagrange. Statistical Tools for Audio Processing. 50

Auditory Scene Analysis • Formalism proposed by psycho-acousticians [Bregman’ 90] • Main principle : Auditory Scene Analysis • Formalism proposed by psycho-acousticians [Bregman’ 90] • Main principle : o The scene can be decomposed into a set of atoms o A first level of structuration clusters atoms into entities (notes) o A second one clusters entities into streams (voices) [Bregman’ 9 A. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, The MIT Press, 1990 0] Mathieu Lagrange. Statistical Tools for Audio Processing. 51

Affinity Cues • Proximity o Time o Frequency o Amplitude • Dynamic • Harmonicity Affinity Cues • Proximity o Time o Frequency o Amplitude • Dynamic • Harmonicity Mathieu Lagrange. Statistical Tools for Audio Processing. 52

Vector-based Clustering • Each object of the dataset is described as a set of Vector-based Clustering • Each object of the dataset is described as a set of features • For that purpose “k-means” is widely used due to its efficiency • Aim: minimize the within-cluster sum of squares (WCSS) • 2 -Steps iterative method: o Assignment step: Assign each observation to the cluster with the closest mean o Update step: Calculate the new means to be the centroid of the observations in the cluster Mathieu Lagrange. Statistical Tools for Audio Processing. 53

Graph-based Clustering • Given a set of data points A, the similarity matrix may Graph-based Clustering • Given a set of data points A, the similarity matrix may be defined as a matrix W where w(i, j) represents a measure of the similarity between points. • More generic approach, as vector-based descriptions can be trivially converted • Many existing methods: o Agglomerative Hierarchical Clustering (AHC) o k-medoids o Spectral clustering Mathieu Lagrange. Statistical Tools for Audio Processing. 54

Spectral Clustering (1) • Spectral clustering techniques make use of o the spectrum of Spectral Clustering (1) • Spectral clustering techniques make use of o the spectrum of the similarity matrix of the data (eigenvectors) o to perform dimensionality reduction for clustering in fewer dimensions. Mathieu Lagrange. Statistical Tools for Audio Processing. 55

Spectral Clustering (2) • Method: o Compute the similarity between each objects (W) o Spectral Clustering (2) • Method: o Compute the similarity between each objects (W) o Computes the normalized laplacian (L): o With D the degree matrix defined as the diagonal matrix with the degrees on the diagonal: o Select the eigenvectors corresponding to the k largest eigenvalues of the laplacian and normalize them by rows o Use those eigenvectors for clustering with k-means Mathieu Lagrange. Statistical Tools for Audio Processing. 56

Spectral Clustering (3) • Can be viewed as a relaxation of the Normalized Cuts Spectral Clustering (3) • Can be viewed as a relaxation of the Normalized Cuts problem: Mathieu Lagrange. Statistical Tools for Audio Processing. 57

Performance Criterion • Normalized Mutual Information o • Given 2 sets of labels (the Performance Criterion • Normalized Mutual Information o • Given 2 sets of labels (the ground truth and the cluster results) estimate the degree of matching between the 2 The NMI is between 0 and 1, 1 being a perfect match. Mathieu Lagrange. Statistical Tools for Audio Processing. 58

Clustering of Spectral Audio (Co. SA) • Aim: apply clustering method for generating the Clustering of Spectral Audio (Co. SA) • Aim: apply clustering method for generating the binary mask • Method: o Spectrogram as the atomic representation o Prune the representation to retain only high amplitude components o Compute the features related to each retained components o Split the spectrogram into contiguous texture windows o For each texture window o o o Apply clustering Select clusters that are likely to belong to the source of interest Apply spectrogram inversion Mathieu Lagrange. Statistical Tools for Audio Processing. 59

Performance Criteria • NMI o • Spectral domain Signal to Noise Ratio o • Performance Criteria • NMI o • Spectral domain Signal to Noise Ratio o • Does not consider the amplitude of the spectral components Does not consider phase and framing issues Time domain Signal to Noise Ratio o Is still not perfect as it is only vaguely related to perception Mathieu Lagrange. Statistical Tools for Audio Processing. 60

Literature • Very few overview materials on Musical Source Separation • P. D. O´Grady, Literature • Very few overview materials on Musical Source Separation • P. D. O´Grady, B. A. Pearlmutter and S. T. Rickard. Survey of sparse and non-sparse methods in source separation. International Journal of Imaging Systems and Technology, 15(1). 2005. • E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley and M. E. Davies. Model-based audio source separation. Technical Report C 4 DM-TR-05 -01, Queen Mary University, London, UK, 2006. • T. Virtanen. Unsupervised Learning Methods for Source Separation in Monaural Music Signals. Chapter in A. Klapuri, M. Davy (Eds. ), Signal Processing Methods for Music Transcription, Springer 2006. • Stereo Audio Source Separation Evaluation Campaign: o • http: //sassec. gforge. inria. fr Tutorial (section 3 is an excerpt) o Juan Jose Burred. Musical Source Separation: Principles and State of the Art. 2 nd Int. Workshop on Learning Semantics of Audio Signals (LSAS) Mathieu Lagrange. Statistical Tools for Audio Processing. 61