Скачать презентацию Landmark-Based Speech Recognition Spectrogram Reading Support Vector Machines Скачать презентацию Landmark-Based Speech Recognition Spectrogram Reading Support Vector Machines

db2c5fe9b279a311d271343e220bde2e.ppt

  • Количество слайдов: 43

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson University of Illinois at Urbana-Champaign, USA Assistant Professor, Electrical and Computer Engineering Department Assistant Professor, Beckman Institute for Advanced Science and Technology Adjunct Professor, Speech and Hearing Sciences Department

Lecture 1 Introduction to Spectrogram Reading • Review – – Laplace and Fourier transforms Lecture 1 Introduction to Spectrogram Reading • Review – – Laplace and Fourier transforms Short-time Fourier transform (STFT) and windowing White noise Periodic Signals • Spectrogram reading: Pitch – Wideband narrowband spectrograms • Spectrogram reading: Manner – Speech physiology – Manner classification of phonemes • Spectrogram reading: Formants – Log-linear form of a rational filter

Laplace and Fourier Transforms Laplace and Fourier Transforms

Transform Properties Transform Properties

Transforms worth knowing: Impulses Transforms worth knowing: Impulses

Transforms worth knowing: Filters Transforms worth knowing: Filters

Rectangular Window Rectangular Window

Hamming & Hanning Windows Hamming & Hanning Windows

Periodic Signals Periodic Signals

Random Signals (Noise) Random Signals (Noise)

The Short-Time Fourier Transform The Short-Time Fourier Transform

The Spectrogram The Spectrogram

Narrowband Spectrogram: N > 2 T 0 Narrowband Spectrogram: N > 2 T 0

Wideband Spectrogram: N < T 0 Wideband Spectrogram: N < T 0

Fundamental Frequency 10 F 0 4 T 0 Fundamental Frequency (Pitch): F 0=1/T 0 Fundamental Frequency 10 F 0 4 T 0 Fundamental Frequency (Pitch): F 0=1/T 0

On to New Material: Manner Features, Speech Production, and Landmarks On to New Material: Manner Features, Speech Production, and Landmarks

Anatomy of Speech Production Hard Palate Lips Nasal Cavity Oral Cavity Soft Palate (Open) Anatomy of Speech Production Hard Palate Lips Nasal Cavity Oral Cavity Soft Palate (Open) Pharynx Tongue Blade Epiglottis Tongue Body Vocal Folds Jaw Tongue Root

Speech sources: Voicing, Turbulence, and Transients • The vocal folds: – A nonlinear, high-impedance Speech sources: Voicing, Turbulence, and Transients • The vocal folds: – A nonlinear, high-impedance oscillator – Excitation is like a periodic impulse train • Turbulence: – Vortices striking an obstacle produce white noise – Excitation is like white noise • Transient: – High pressure, suddenly released – Excitation is like a single loud impulse, d(t)

The vocal folds: A nonlinear, highimpedance oscillator Vocal tract “rings” like a bell, shaping The vocal folds: A nonlinear, highimpedance oscillator Vocal tract “rings” like a bell, shaping the sound produced by the vocal folds (Cross-sectional area of the vocal tract: 0. 5 -10 cm 2) Larynx (the opening between the vocal folds) has an open area of 0. 03 cm 2. In order to get through, air from lungs must speed up to a high-speed jet. Vocal folds flap back and forth, driven by the jet, with a rate of 100 -200 pulses/second.

Turbulence: Vortices striking an obstacle produce white noise In a fricative, area of the Turbulence: Vortices striking an obstacle produce white noise In a fricative, area of the tongue constriction is about 0. 2 cm 2. In order to get through, air speeds up into a turbulent jet. The turbulent jet strikes against downstream obstacles, like the teeth. The jet contains vortices of all different radii, between 0 mm and 0. 2 cm, therefore the resulting sound contains noise at all frequencies above about 700 Hz.

Transient: High pressure, suddenly released While tongue tip is closed, air pressure builds up Transient: High pressure, suddenly released While tongue tip is closed, air pressure builds up behind the constriction. When constriction is released, there is a sudden change in air flow through the constriction (from 0 to nonzero). The sudden change in airflow is heard as a “pop. ”

The Source-Filter Model of Speech Production Corresponds to: S(s) = H(s)E(s), where S(s) = The Source-Filter Model of Speech Production Corresponds to: S(s) = H(s)E(s), where S(s) = Recorded speech spectrum E(s) = Source spectrum H(s) = Transfer function = Filtering by the vocal tract

Manner Classification of Phonemes: [continuant] • [-continuant] = lips or tongue close COMPLETELY on Manner Classification of Phonemes: [continuant] • [-continuant] = lips or tongue close COMPLETELY on midline of the vocal tract: – – stops (p, b, t, d, k, g) nasals (m, n, ng), affricates (q, j, ch, zh) syllable-initial lateral (l, e. g. , “lake”) • [+continuant] = no complete closure: – – fricatives (f, v, s, z, sh, x, Chinese h) glides (w, y, r, English h) vowels (a, e, i, o, u) diphthongs (in “buy, ” “bow”)

Manner Classification of Phonemes: [sonorant] • [+sonorant] = “a sound you can sing” (Latin) Manner Classification of Phonemes: [sonorant] • [+sonorant] = “a sound you can sing” (Latin) – – – nasals (m, n, ng) lateral (l) glides (w, y, r) vowels (a, e, i, o, u) diphthongs (buy, bow) • [-sonorant] = air pressure builds up behind constriction; voicing amplitude drops (also called an “obstruent consonant”) – stops (p, b, t, d, k, g) – affricates (q, j, ch, zh) – fricatives (f, v, s, z, sh, x) • Special status of “sonorant” in Chinese: – “initial” must be all-sonorant (“liang”) or all-obstruent (“qing”) – “final” must be all-sonorant

Sonorant Consonants: Glide, Lateral, Nasal “layya ton” -- /l/, /y/, /t/, /n/ (the /y/ Sonorant Consonants: Glide, Lateral, Nasal “layya ton” -- /l/, /y/, /t/, /n/ (the /y/ is [+continuant], others are -) “ame” -- /m/ [-continuant]

Obstruent Consonants: Fricatives, Affricates, and Stops sa (+continuant) shi (+continuant) ba (-continuant) qe (-continuant) Obstruent Consonants: Fricatives, Affricates, and Stops sa (+continuant) shi (+continuant) ba (-continuant) qe (-continuant) iji (-continuant) ita (-continuant)

Place of Primary Articulation Palatal (Blade): q, j, sh, y, i Alveolar (Blade): t, Place of Primary Articulation Palatal (Blade): q, j, sh, y, i Alveolar (Blade): t, d, s, z, n, l Retroflex (Blade): ch, zh, x, r, er Dental (Blade): th, dh Labial (Lips): p, b, f, v, m, w, u, o Velar (Body): k, g, ng, w, u Uvular (Body): h, o Pharyngeal(Body): a, ae Laryngeal: h

Features of Secondary Articulators: [lateral], [nasal], [affricated], [aspirated] • [+sonorant, +continuant]: vowels, glides • Features of Secondary Articulators: [lateral], [nasal], [affricated], [aspirated] • [+sonorant, +continuant]: vowels, glides • [+sonorant, -continuant]: – [+nasal] = soft palate is open; air escapes through the nose – [+lateral] = tongue is open on the sides; air can escape around edges of tongue • [-sonorant, +continuant]: fricatives • [-sonorant, -continuant]: – [+affricated]: tongue stays nearly closed after release, causing frication (q, j, ch, zh) – [+aspirated]: larynx stays open after release, causing aspiration (p, t, k) – [-affricated, -aspirated]: nothing special happens after release; vowel starts immediately (b, d, g)

Sonorant Consonants: Glide, Lateral, Nasal “layya ton” -- /l/, /y/, /t/, /n/ (the /y/ Sonorant Consonants: Glide, Lateral, Nasal “layya ton” -- /l/, /y/, /t/, /n/ (the /y/ is [+continuant], others are -) “ame” -- /m/ [-continuant]

Waveforms and Spectrograms: Aspirated and Unaspirated Stops Unaspirated: /b/ Aspirated: /t/ Waveforms and Spectrograms: Aspirated and Unaspirated Stops Unaspirated: /b/ Aspirated: /t/

Phonetic Subsegments in the Release of an Aspirated Stop Phonetic Subsegments in the Release of an Aspirated Stop

Waveforms and Spectrograms: Fricatives and Affricates sa shi qe iji Waveforms and Spectrograms: Fricatives and Affricates sa shi qe iji

Landmarks: Changes in the features [continuant], [sonorant] /t/ release /l/ release /t/ closure /m/ Landmarks: Changes in the features [continuant], [sonorant] /t/ release /l/ release /t/ closure /m/ release /m/ closure /v/ release /v/ closure /k/ /n/ release /n/ closure

The Vocal Tract Transfer Function The Vocal Tract Transfer Function

Log-Spectral Separation of Source and Filter Log-Spectral Separation of Source and Filter

Formant Frequencies = Resonant Frequencies of the Vocal Tract Formant Frequencies = Resonant Frequencies of the Vocal Tract

Formant Frequencies of a Vowel From Peterson and Barney, “Control Methods in a Study Formant Frequencies of a Vowel From Peterson and Barney, “Control Methods in a Study of the Vowels, ” Journal of the Acoustical Society of America, 1952

Classifying Vowels F 2=1200 Hz F 1=800 Hz Therefore vowel is /AH/ F 2 Classifying Vowels F 2=1200 Hz F 1=800 Hz Therefore vowel is /AH/ F 2 starts at 1200 Hz, rises to 2000 Hz F 1 starts at 800 Hz, falls to 300 Hz Therefore diphthong is /AY/

Rational Filters: Obstruents Rational Filters: Obstruents

Example: Front Cavity Resonance of /ch/ (q) is near F 3 of Following Vowel Example: Front Cavity Resonance of /ch/ (q) is near F 3 of Following Vowel

Rational Filters: Nasal Consonants Rational Filters: Nasal Consonants

Examples: Nasal Consonants /m/: This talker makes /m/ with resonances at 1000 Hz, 1800 Examples: Nasal Consonants /m/: This talker makes /m/ with resonances at 1000 Hz, 1800 Hz uncancelled, but with the resonance at 300 Hz cancelled by zeros. /ng/: This talker makes /ng/ with resonances at 300 Hz, 1000 Hz uncancelled, but with the resonance at 1800 Hz cancelled by zeros.

Summary • Spectrogram is the log magnitude of the STFT. • Wideband spectrogram: N<T Summary • Spectrogram is the log magnitude of the STFT. • Wideband spectrogram: N2 T 0, pitch shows up in the frequency domain • Landmarks occur at changes in the values of the distinctive features [continuant] and [sonorant]: – – [+continuant, +sonorant]: vowels, glides, diphthongs [+continuant, -sonorant]: fricatives [-continuant, +sonorant]: nasals, laterals [-continuant, -sonorant]: stops, affricates • Recognition of Vowels and Glides: F 1 and F 2 are usually enough • Recognition of Diphthongs: F 1 and F 2 at two separate points in time (beginning and ending of the vowel). • Obstruent Consonants: Back cavity formants are cancelled by zeros, leaving only the front cavity formants (e. g. , F 3 for /sh/, /q/) • Nasal Consonants: Resonances of the mouth-nose system are often cancelled by zeros, leaving primarily low-frequency energy.