aa95d797027af75a0410ff653810d5b9.ppt
- Количество слайдов: 49
An Overview of Statistical Pattern Recognition Techniques for Speaker Verification* - Group Arecio Junior Isabela Cota Michelle Moreira Renan Vilas Novas Institute of Computing, Unicamp MO 447 Digital Forensics *[Fazel and Chakrabartty 2011]
Popularity Speaker verification is a popular biometric identification technique used for authenticating and monitoring human subjects using their speech signal. It is attractive for two main reasons: A. It does not require direct contact with the individual, thus avoiding the hurdle of “perceived invasiveness”; B. it does not require deployment of specialized signal transducers as microphones are now ubiquitous on most portable devices.
Popularity
Applications of Speaker Verification and Recognition Forensics and Tele-commerce Where the objective is to automatically authenticate speakers of interest using his/her conversation over a voice channel (telephone or wireless phone). Multimedia Web-portals (Facebook, Youtube, etc) Searching for metadata like topic of discussion or participant names and genders from these multimedia documents would require automated technology like speaker verification and recognition.
Types of Speaker Recognition Systems Traditionally, Speaker Verification systems are classified into two different categories based on constraints imposed on the authentication process: A. Text-dependent systems A. text-independent systems
Text-depedent Users are assumed to be “cooperative” and use identical pass-phreases during the training and testing phases Speech Recognition is used in text-dependent speaker verification Speech Database Feature Extraction Speech Recognition Speaker Verification
Text-independent No vocabulary constraints are imposed on the training and testing phase The reference (what is spoken in training) and the test (what is uttered in actual use) utterances may have completely different content Speech Database Feature Extraction Speaker Verification
Fundamentals of Speech Based Biometrics Speech is produced when air from the lungs passes through: • • throat vocal cords mouth nasal tract Martinsa, I. Carboneb, A. Pintoc, A. Silvab and A. Teixeira, “European Portuguese MRI based speech production stud- ies, ” Speech Commun. , vol. 50 , no. 11– 12, pp. 925– 952, 2008.
Fundamentals of Speech Based Biometrics Different position of the lips, tongue and the palate creates different sound patterns and gives rise to the physiological and spectral properties of the speech signal like: • • • pitch tone volume. Martinsa, I. Carboneb, A. Pintoc, A. Silvab and A. Teixeira, “European Portuguese MRI based speech production stud- ies, ” Speech Commun. , vol. 50 , no. 11– 12, pp. 925– 952, 2008.
Fundamentals of Speech Based Biometrics The combination of these properties is typically considered unique to the speaker because they are modulated by the size and shape of the mouth, vocal and nasal tract along with the size, shape and tension of the vocal cords. Martinsa, I. Carboneb, A. Pintoc, A. Silvab and A. Teixeira, “European Portuguese MRI based speech production stud- ies, ” Speech Commun. , vol. 50 , no. 11– 12, pp. 925– 952, 2008.
Fundamentals of Speech Based Biometrics Spectrograms corresponding to a sample utterance “fifty-six thirty-five seventytwo” for a male and female speaker. horizontal axis represents the time vertical axis corresponds to the frequency color map represents the magnitude of the spectrogram. Different parameters of the speech signal that makes it unique for each person are pitch, formants (F 1 -F 3), and prosody—labeled
Achitecture of a Speaker Verification System Operation of a speaker verification system typically consists of two phases: A. enrollment: parameters of a speaker specific statistical model are determined using annotated (pre-labeled) speech data; B. verification: an unknown speech sample is authenticated using the trained speaker specific model.
Architecture of a Speaker Verification System
Speech Acquisition
Feature Extraction
Importance The heart of a speaker recognition system is feature extraction [Ningaal and Ahmad 2006]
Challenges for Speaker Verification - Limited quantity and quality of training data - Intra-speaker variability - Mismatch between recording conditions during the enrollment and the verification phase
Model of additive and channel noise
Robust speaker verification systems techniques
Challenges
Techniques Mel-Frequency Cepstral Coefficients (MFCC) Feature widely used in automatic speech recognition. It was introduced by Davis and Mermelstein in the 1980's, and have been state-of-the-art ever since. • RASTA-PLP (Rel. Ative Spec. Tra. AL - Perceptual Linear Predictive) Robust technique which mimic some characteristics of the human auditory perception, in order to improve the speech signal. •
Mel-Frequency Cepstral Coefficients (MFCC) • Steps to calculate: Calculate the mel scale Frame the signal Windoning Fast Fourier Transform (FFT) Discrete Cosine Transform (DCT) Amplitude Frequency Mel Coefficients
Mel-Frequency Cepstral Coefficients (MFCC) • Steps to calculate: Initially it is necessary to obtain the Mel-frequency scale (linear up to 1000 Hz and logarithmic above) (Deller et al 2000). The formula for converting from frequency to Mel scale is: Mel scale Frequency (Hz)
Mel-Frequency Cepstral Coefficients (MFCC) frequency Time (T) • Frame the signal into short frames [Deller et al 2000]
RASTA-PLP (Rel. Ative Spec. Tra. AL – Perceptual Linear Predictive) [Schroeder 1976] • • • RASTA (Rel. Ative Spec. Tra. AL) – Analisys the perception of human hearing, considering the time. PLP (Perceptual Linear Predictive) - Analisys frequency response of the communication channel. Combining the two techniques result in a model more robust to such simulated channel variation.
RASTA-PLP (Rel. Ative Spec. Tra. AL – Perceptual Linear Predictive) [Schroeder 1976] Filtering Fast Fourier Transform Bank of Linear Bandpass Filters (1) Bark Frequency Power Scale (2) Critical-Band Power Spectrum (3) Solving of Set of Linear Equations Cepstral Coefficients
RASTA-PLP (Rel. Ative Spec. Tra. AL – Perceptual Linear Predictive) [Schroeder 1976] (1) (2) (3)
Software for phonetic analyses • Praat • • Program for doing phonetic analyses and sound manipulations (Boersma and Weenink 2012). Offers a general extremely flexible tool in the ‘Edit. . ’ function to visualize and extract information from a sound object. For example: • General analysis (waveform, intensity, spectrogram, pitch, duration)
Praat
Praat Duration of Vowels /i/ and /I/ (Gomes 2014) American Speaker Brazilian Speaker
Praat Word stress – Exchange (Gomes 2014) POLICE Brazilian Speaker American Speaker
Speaker Modeling
Types of Models for Speaker Verification Generative models - Based on the capture of the statistical properties of speaker specific speech signals Discriminative models - Otimized to minimize the error on a set of genuine and impostor training samples
Generative Models - Training typically involves data specific to the target speakers - Training focused in the capture of the empirical probability density function corresponding to the acoustic feature vectors - Examples: Gaussian Mixture Models (GMM) Hidden Markov Models (HMM)
Discriminative Models - Training involves data corresponding to the target and imposter speakers - Training focused in the estimation of parameters of the manifold witch distinguishes the features for the target speakers from the features for the imposter speakers - Examples: Support Vector Machines (SVMs)
Gaussian Mixture Models A GMM is composed of a finite mixture of multivariate Gaussian components Estimation of a general probability density function to the speaker verification
Gaussian Mixture Models Before training, the means of Gaussians are uniformly spaced and the variance and weights are chosen to be the same. After training, the mean and variance of the Gaussians align themselves to the data cluster centers and the weights capture the priori probability of the data.
Gaussian Mixture Models Advantages - Training is relatively fast - Models can be scaled and updated to add new speakers with relative ease Disadvantages - By construction, GMMs are static models that do not take into account the dynamics inherent in the speech vectors
Support Vector Machines Supervised learning model based on the construction of a set of hyperplanes in a high-dimensional space, which can be used for classification
Support Vector Machines
Support Vector Machines - It provides good verification performances even with relatively few data points in the training set - Learning ability of the classifier is controlled by a regularizer in the SVM training (which determines the trade-off between its complexity and its generalization performance) - Good out-of-sample performance
Score Normalization - Reduce the score variabilities across different channel conditions - Adapt the speaker-dependent threshold - Assumption that the impostors’ scores follow a Gaussian distribution where the mean and the standard deviation depend on the speaker model and/or test utterance
Score Normalization Zero Normalization (Z-norm) - Speaker model is tested against a set of speech signals produced by an imposter, resulting in an imposter similarity score distribution - Offline Test Normalization (T-norm) - Parameters are estimated using a test utterance - Online
Score Normalization Handset Test Normalization (HT-norm) - Parameters are estimated by testing each test utterance against handset-dependent imposter models Channel Normalization (C-norm) - Parameters are estimated by testing each speaker model against a handset or channel-dependent set of imposters - During testing, the type of handset or channel related to the test utterance is first detected
Performance of Speaker Verification System False acceptance and false rejection are functions of the decision threshold Detection Cost Function (DCF) National Institute of Standards and Technology (NIST)
Our Implementation - Won’t be focused on robustness of background noise and channel - Dataset comprised of 60 people, 19 audio samples for each, captured by the same device - Z-norm or T-norm since we don’t have different handsets and channels
Prior Work Campbell, William M. , Douglas E. Sturim, and Douglas A. Reynolds. "Support vector machines using GMM supervectors for speaker verification. " Signal Processing Letters, IEEE 13. 5 (2006): 308 -311. - MFCC - GMM to generate supervectors - KL Divergence (Super L 2) - Space inner product (Super Linear)
Prior Work - KL Divergence - Space inner product - Best Result (GMM Super Linear)
References • • Boersma, Paul and Weenink, David (2012). Praat: doing phonetics by computer. Available from http: //www. fon. hum. uva. nl/praat. Davis, S. Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 No. 4, pp. 357 -366. Deller J. R. , et al. (2000). Discrete-Time Processing of Speech Signals. IEEE Press, p. 936. Gomes, M. L. C. (2014). O Uso do Programa Praat para Compreensão do “Jeitinho Brasileiro” de Falar Inglês: Uma Experiência de um Grupo de Estudos. IV Congresso Internacional da Abrapui (Associação Brasileira de Professores Universitários de Inglês). Language and Literature in the Age of Technology, Maceió, Alagoas. Ningaal, I. Z. and Ahmad, A. M. (2006). The Fundamental of Feature Extraction in Speaker Recognition : A Review. Proceedings of the Postgraduate Annual Research Seminar. Faculty of Computer Science and Information System, University of Technology Malaysia. Schroeder, M. R. (1991). Recognition of Complex Acoustic Signals. In Life Sciences Research Report 5, T. H. Bullock, Ed. , p. 324, Abakon Verlag, Berlin. Zwicker, E. (1961), "Subdivision of the audible frequency range into critical bands, " The Journal of the Acoustical Society of America, Volume 33, Issue 2, pp. 248 -248 (1961)


