Landmark-Based Speech Recognition Spectrogram Reading Support Vector Machines

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc. edu University of Illinois at Urbana-Champaign, USA

Lecture 11. Articulatory Phonology • Surface phonology problems: reduction, assimilation, deletion • Articulatory Phonology – The mental lexicon (our mental storage for words) is made of Gestures, not phonemes – Overlap among the gestures results in inter-gesture competition; competition can result in reduction and/or assimilation – No mental concept of “sequencing” – instead, mental representation incldes pair-wise coupling constraints between gestures • Speech motor control – – Constriction area matters more than non-constriction area Motor control model: only control the constrictions Tract variables Task dynamics • Prosody – Units of prosody: phrases and pitch accents – Prosodic gestures: spatial scaling, time stretching – Prosodic landmark detection

Pronunciation Variability (Read Speech) Manner Class Assimilation: /t/ becomes part of the /n/ Vowel Reduction: /iy/ becomes /ix/

Pronunciation Variability (Read Speech) Syllable Merger: “carry an” becomes “carin” Vowel Reduction: /iy/ becomes /ax/

Autosegmental Phonology (Goldsmith, 1975) • • • Inter-word phonological rules all have a simple form: manner or place assimilation Hypothesis: instructions to the speech articulators are arranged in “autosegmental tiers, ” i. e. , on a kind of musical score with asynchronous rows Assimilation = feature spreading /s/ /sh/ [-nasal] [+strident] [+blade] [+anterior] [-anterior]

Articulatory Phonology (Browman and Goldstein, 1990) • • • Word is composed of “gestures” Gestures are MENTAL speech planning units, but they have close correspondence to articulatory controls Example: Mental Lexicon Entry for “she: ” TB-LOC TT-LOC LIP-OP 1. TT-OPEN→FRICATIVE (/š/) 2. TT-LOC→PALATAL (/š/) 3. TB-OPEN→NARROW (whole word) 4. TB-LOC→PALATAL (whole word) 5. GLOTTIS-OPEN→WIDE (/š/) then GLOTTIS-OPEN→CRITICAL (/i/) TB-OPEN TT-OPEN VOICING VELUM

Articulatory Phonology (Browman and Goldstein, 1990) • Rule-based Phonologies: – • Autosegmental Phonologies: – • Reduction and assimilation are SUBSTITUTIONS of neighboring phone’s features in place of current phone’s features Articulatory Phonology: – – • Reduction and assimilation are CHANGES in the value of a distinctive feature, just like morpho-phonological processes “Frozen” word construction processes may result in the deletion or substitution of gestures in the lexicon, but… The process of sequencing words to create a sentence never deletes or changes any gesture; all gestures stay in the mental representation all the time!! Reduction and Assimilation can be explained by – – Overlap among gestures Competition among overlapping gestures, for control of the same articulators

Example: Manner-Class Assimilation “Don’t Ask: ” Careful Speech TT-CLOSED TB-OPEN TT-FRIC TB-OPEN GL-CLO GL-CRIT /d/ /o/ /n/ TB-CLOSED GL-OPEN GL-CRIT /t/ /ae/ /s/ “Don’t Ask: ” Fast Speech TT-CLOSED TB-OPEN TT-FRIC TB-OPEN GL-CLO GL-CRIT /d/ /o/ /n/ TB-CLOSED GL-OPEN GL-CRIT /ae/ /s/ /k/

What’s in the Lexicon? (Browman and Goldstein, 2000) • • Experimental Observation: consonant clusters at the beginning of a syllable (/sp/ in “spat”) show less production variability than consonant clusters at the end of a syllable (/ps/ in “taps”) Hypothesis: the mental lexicon includes GESTURES and PAIRWISE COUPLING CONSTRAINTS – – Two kinds of coupling: simultaneous or sequential Coda consonants FOLLOW the vowel, e. g. in “taps: ” TB-WIDE→LIP-CLOSED→TT-CRITICAL – Onset consonants are produced SIMULTANEOUSLY with start of the tongue body vowel gesture, but therefore in “spat: ” both TT-CRITICAL→TB-WIDE and LIP-CLOSED→TB-WIDE. Competition among them yields reduced variability in production.

Production Planning: Lexical Entry Turned Into a Gestural Score “SPAT: ”

From Gestural Score to Acoustics • • Perturbation Theory (Chiba and Kajiyama, 1941) showed that d. Fn ~ d log. A(x) ≈ d. A(x)/A(x) The audibility of a change d. A(x) is proportional to 1/A(x) – – • Changes near a constriction (small A(x)) are very audible Changes elsewhere (large A(x)) are not very audible Therefore, talkers carefully control A(x) only near a constriction: – Inter-utterance variability of A(x) is an increasing function of 1/A(x) (Perkell and Nelson, JASA 1985): E[(A(x)-m. A(x))2] ~ 1/m. A(x) – – m. A(x)≡E[A(x)] Inter-talker variability of A(x) is an increasing function of 1/A(x) (Hasegawa-Johnson et al. , JSLHR 2003) Inter-talker variability of log A(x) is independent of A(x) (Hasegawa-Johnson et al. , JSLHR 2003): E[(log. A(x)-mlog. A(x))2] ~ constant

Constriction Control as a Model of Speech Motor Control (Stevens and House, JASA, 1955) • Vocal tract shape controlled by just three control parameters: – – – • x. POS = POSition of tongue constriction r. CD = Constriction Degree = radius of the constriction r. LIP = effective radius of the lip constriction All other vocal tract areas determined by A(x) = p r(x)2 r(x) = 0. 7+0. 144 x 2, = min(1. 6, r. CD– 0. 025(1. 2–r. CD)(x–x. POS)2), = r. CD – 0. 025(1. 2–r. CD) (x–x. POS)2, = r. L x, r(x) are in centimeters, A(x) in cm 2 0 ≤ x ≤ 2. 75 (larynx) 2. 75 ≤ x. POS (pharynx) x. POS ≤ x ≤ 17 (mouth) 17 ≤ x ≤ 18 (lips)

Examples: Vowel /a/

Examples: Vowel /i/

Extending the Model: Tract Variables (Saltzmann and Munhall, 1989) • Languages treat tongue tip and tongue body differently, e. g. , both can have constrictions at the same time – • Talkers can independently control lip area and lip length – • Therefore we need a control variable VELCD Glottis control: open (breathy), critical (voiced), closed (glottal stop) – • Therefore split (RL) → (LIPCD, LIPPOS) Soft palate (“velum”) control: open vs. closed – • Therefore split (x. POS, ACD) → (TTPOS, TTCD, TBPOS, TBCD) Control variable GLOCD The tract variable model: speech is controlled by a mental controller with an 8 -dimensional control vector: a(t) = [LIPCD, LIPPOS, TTCD, TTPOS, TBCD, TBPOS, VELCD, GLOCD]T

Tract Variables

Task Dynamics: Connecting Gestures to Tract Variables (Saltzmann and Munhall, 1989) • • • Lexicon Gestures sequenced into a GESTURAL SCORE The Gestural Score is “played” like a musical score. Each Gesture onset is turned into TRACT VARIABLE TARGETS, a(t). Relationship between tract variable targets, a(t), and physical articulator positions, x(t), given by 2 nd order system M d 2 x/dt 2 = K(t) (a(t)–x(t)) – R dx/dt – – – K(t) = effective tract-variable-stiffness matrix; controlled by the talker, but varies more slowly than a(t) M = effective mass matrix R = effective damping matrix

Production Planning: Lexical Entry Turned Into a Gestural Score “SPAT: ”

Speech Motor Control: Gestural Score Drives Task Dynamics

Speech Production: Vocal Tract Shape Determines Acoustics

Prosody: Beyond Words

1. Prosodic Phrases • • Prosodic Phrasing = the PERCEPTUAL grouping of words Prosodic phrase boundaries usually (not always) a subset of SYNTACTIC phrase boundaries • • A hierarchy of phrases: – – – • “I like ginger | chocolate ice cream | and cigars” “I like ginger-chocolate ice cream | and cigars” “I bought a book from | the old used bookstore downtown” Intonational phrase = 1 -5 accent phrases Intermediate/Accent phrase = 1 -5 prosodic words Prosodic word = 1 -2 dictionary words, e. g. , “the+open | door” Acoustic correlates of phrasing • • • Phrase-final syllable is MUCH LONGER (typically 50 -100%) Intonational phrase often followed by a PAUSE (Language-dependent): Phrase may end in a PHRASE TONE • • Intermediate Phrase Tones in English: L-, H- (low and high) Intonational Phrase Tones in English: L-L%, L-H%, H-L%, H-H%

2. Prominence/Pitch Accent • • Prominence: Usually, a listener can tell which syllable in an accent phrase the talker thinks is most important. That syllable is called “prominent. ” Acoustic correlates of prominence (language-dependent): • DURATION: • • • HYPER-ARTICULATION: • • • English, Dutch, and “stress-timed languages: ” prominent syllables are longer French, Japanese, and other “syllable-timed languages: ” no prominent syllables often more clearly pronounced ENERGY: prominent syllables are louder PITCH ACCENT (language-dependent) • English: • • Swedish: • • • Single-peaked accents similar to English Double-peaked accents perhaps unique to Swedish Japanese: • • Extra high pitch: H* Extra low pitch: L* Various combinations (H*+L, L+H*, L*+H) F 0 is high from beginning of accent phrase until prominent syllable, then drops In Chinese: • Lexical tone is HYPER-ARTICULATED (e. g. , 3 rd tone dips MORE than usual)

Example: “Massachusetts” Unaccented Accented: /u/ is longer, louder

Example: “(if they think they can drink and drive, and) get away with it, they’ll pay. Probability of Voicing Pitch get away L * with it H-H% they’ll Hi. F 0 pay H * L-L%

Do Prominence and Phrasing Affect Tongue Movement? (Fougeron and Keating, 1997) • Experiment: – Design an electropalate for each subject • • • – Subjects read carrier sentences, target word in different positions • • Electropalate = a plastic insert covered with small electrodes. When the tongue touches the palate, the touched electrodes detect contact Keep track of the area and shape of tongue-palate contact as a function of time “book” Prominent: “the red book holder, not the red basket holder” “book” Non-prominent: “the red book holder, not the blue book holder” “book” Phrase-final: “the red book, Holbert, not the blue book” Result: – – Prominent words: longer + much more tongue-palate contact Phrase-final wods: longer; little change in tongue-palate contact

Do Prominence and Phrasing Affect the MFCCs? (Borys, Hasegawa-Johnson, and Cole, 2003) Clustered Triphones Prosody-Dependent Allophones N N R Vowel? Yes No N-VOW L Stop? No N Yes STOP+N WER: 36. 2% Yes No Pitch Accent? No N N-VOW Yes N* WER: 25. 4% BUT: WER of baseline Monophone system = 25. 1%

Prosody-dependent allophones: ASR clustering matches EPG Consonant Clusters Accented Unaccented Phrase Initial Medial Class 1 Class 2 Fougeron & Keating (1997) EPG Classes: 1. Strengthened 2. Lengthened 3. Neutral Phrase Final Class 3

Why is there a relationship between Prosody and Tongue Movement?

What’s the Scale of a Gestural Score? TT-CLOSED TT-OPEN VEL-OPEN t 1 TT-FRIC t 2 What is t 2 -t 1 in seconds? How much does the TB-CLOSED tongue tip open? (How many cm? ) t

Prosodic Gestures (Byrd and Saltzmann) TT-CLOSED TT-OPEN TT-FRIC TT-OPEN TB-CLOSED VEL-OPEN Relative Time SPATIAL-SCALE-LARGE REDUCED TIME-SCALE-STRETCHED ps Prosodic Gestures p. T Prosodic Gestures Gestural Score “Playback Head” Time Scale for Gesture Playback Spatial Scale for Gesture Playback Tract Variable Targets a(t) Convert Gestural Score to Tract Variable Targets Absolute Time

Convert Tract Variable Targets to Tract Variables, Then to Acoustics

Prosodic Landmarks: Detecting Pitch Accents from F 0 Contour

Prosodic Landmark Detection (Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)

The Time-Delay Recursive Neural Network (Kim, Neurocomputing, 1998) Pitch Accented Output Layer 2 nd Hidden Layer 1 st Hidden Layer . . D D. . . D Time-Delayed Inputs Pitch Unaccented . . D F 0 Prob_Voice Time-Delayed Internal State

Prosodic Landmark Detection (Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)

Summary • Surface phonology problems: reduction, assimilation, deletion • Articulatory Phonology – The mental lexicon (our mental storage for words) is made of Gestures, not phonemes – No mental concept of “sequencing” – instead, mental representation incldes pair-wise coupling constraints between gestures • Speech motor control – – Constriction area matters more than non-constriction area Motor control model: only control the constrictions Tract variables Task dynamics • Prosody – Units of prosody: phrases and pitch accents – Prosodic gestures: spatial scaling, time stretching – Prosodic landmark detection