36c7eded8ae180a4d647be94c3eb681b.ppt
- Количество слайдов: 75
Phonetic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shuangyu Chang International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 {steveng, shawnc}@icsi. berkeley. edu http: //www. icsi. berkeley. edu/~steveng Large Vocabulary Continuous Speech Recognition Workshop Maritime Institute of Technology, Linthicum Heights, MD, May 4, 2001
Take Home Messages • PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS – Many different analyses (to follow) support this conclusion – Consonants appear to be more important than vowels • SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR FOR ACCURATE RECOGNITION – The pattern of errors differs across the syllable (onset, nucleus, coda) and exhibit consistent patterns difficult to discern with other units of analysis • STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE, PARTICULARLY FOR UNDERSTANDING THE NATURE OF WORD-DELETION ERRORS – Relation among stress-accent, syllable structure, vocalic identity and length • THE NATURE OF PRONUNCIATION MODELS and THEIR RELATION TO LEXICAL REPRESENTATIONS IS A POTENTIALLY KEY FACTOR – The unit of lexical representation (phones, articulatory features, etc. ) is probably of the utmost importance for optimizing ASR performance • FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS LIKELY TO DEPEND ON DEEP INSIGHT INTO THE NATURE OF SPOKEN LANGUAGE
Structure of the Presentation • DESCRIPTION OF THE CORPUS MATERIALS FOR THE 2000 AND 2001 EVALUATIONS – 2000 – Brief (2 -17 s) utterances spoken by hundreds of different speakers. No relation to competitive evaluation – 2001 – A subset of the competitive evaluation • BRIEF OVERVIEW OF THE ANALYSIS REGIME COMMON TO THE 2000 AND 2001 PHONETIC EVALUATIONS – File formats, time-mediated alignment, statistical analysis of the corpora, etc. – Details are contained in “Linguistic Dissection …. . ” (in workshop notebook) and in “An Introduction …. ” (NIST Speech Transcription Workshop, 2000) • ANALYSES AND PATTERNS COMMON TO BOTH 2000 and 2001 EVALUATIONS – Syllable structure, phonetic segments, articulatory-acoustic features. Details pertaining to the 2000 evaluation are in the papers cited above • PHONETIC CONFUSION MATRICES FOR THE 2001 EVALUATION • FUTURE ANALYSIS PLANNED FOR THIS SPRING WHEN REMAINING 2001 SUBMISSIONS ARRIVE – Relationship between phonetic classification, pronunciation and language models
Evaluation Material - 2000 • SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – – 54 minutes of material that previously phonetically transcribed (by highly trained phonetics students from UC-Berkeley) – All of this material was hand-segmented at either the phoneticsegment or syllabic level by the transcribers – • Switchboard contains informal telephone dialogues The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72 -minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified. THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED IN THE CURRENT PROJECT ARE AVAILABLE ON THE PHONEVAL WEB SITE: http: //www. icsi. berkeley. edu/real/phoneval • THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL ARE AVAILABLE AT: http: //www. icsi. berkeley. edu/real/stp
Evaluation Material Details - 2000 • 581 DIFFERENT SPEAKERS • AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS • BROAD DISTRIBUTION OF UTTERANCE DURATIONS – 2 -4 sec - 40%, 4 -8 sec - 50%, 8 -17 sec - 10% • COVERAGE OF ALL (7) U. S. DIALECT REGIONS IN SWITCHBOARD • A WIDE RANGE OF DISCUSSION TOPICS • VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD) By Subjective Difficulty Number of Utterances By Dialect Region Subjective Difficulty
Evaluation Material - 2001 • SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – – The material was hand-segmented at the syllabic level by the transcribers – • Seventy-four minutes of material phonetically labeled by five highly trained phonetics students from UC-Berkeley plus S. Greenberg The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained originally on 72 -minutes of hand-segmented Switchboard material (similar to the process performed the previous year) THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED ARE AVAILABLE ON THE PHONEVAL WEB SITE: http: //www. icsi. berkeley. edu/real/phoneval
Evaluation Material Details - 2001 • A SUBSET OF THE HUB-5 COMPETITIVE EVALUATION CORPUS – A representative selection from the evaluation set, including an even distribution of data from the three main recording conditions (cellular and 2 land-line conditions) • 21 SEPARATE CONVERSATIONS (2 speakers per conversation) • 42 DIFFERENT SPEAKERS • A TOTAL OF 74 MINUTES OF SPOKEN LANGUAGE MATERIAL – (including FILLED PAUSES, JUNCTURES, etc. ) • AVERAGE LENGTH OF SPEECH PER SPEAKER – 106 seconds • RANGE OF LENGTH PER SPEAKER – 48 s (least) to 226 s (most) • STANDARD DEVIATION – 38 s • APPROXIMATELY ONE-THIRD OF THE MATERIAL FROM CELL PHONES
Evaluation Sites - 2000 • EIGHT SITES PARTICIPATED IN THE EVALUATION – All eight provided material for the unconstrained-recognition phase – Six sites also provided sufficient forced-alignment-recognition material (i. e. , phone/word labels and segmentation given the word transcript for each utterance) for a detailed analysis • AT&T (forced-alignment recognition incomplete, not analyzed ) • Bolt, Beranek and Newman • Cambridge University • Dragon (forced-alignment recognition incomplete, not analyzed ) • Johns Hopkins University • Mississippi State University • SRI International • University of Washington
Evaluation Sites - 2001 • SEVEN SITES ARE PARTICIPATING IN THE EVALUATION – Unconstrained-recognition phase – 6 Sites – Forced-alignment – 7 Sites – Phone classification confidence scores – 5 Sites – Variable condition recognition – 2 Sites – Phone strings to words - 1 Site • AT&T • Bolt, Beranek and Newman • IBM • Johns Hopkins University • Mississippi State University • Philips • SRI International
Evaluation Data Status - 2001 • However … NOT ALL OF THE MATERIAL REQUIRED TO PERFORM THE ANALYSES HAVE MATERIALIZED – The tables below summarize the commitments and currently usable data (certain data arrived in not-quite-ready-for-prime-time form) Commitments Current (usable data)
Initial Recognition File - Example Parameter Key START - Begin time (in seconds) of phone DUR - Duration (in sec) of phone PHN - Hypothesized phone ID WORD - Hypothesized Word ID Format is for all 674 files in the evaluation set (Example courtesy of MSU)
Phone Mapping Procedure • EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE SET – Most of the phone sets are available on the PHONEVAL web site • THE SITES’ PHONE SETS WERE MAPPED TO A COMMON “REFERENCE” PHONE SET – – • The reference phone set is based on the ICSI Switchboard transcription material (STP), but is adapted to match the less granular symbol sets used by the submission sites The set of mapping conventions relating to the STP (and reference) sets are also available on the PHONEVAL web site THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE SUBMISSION SITE PHONE SETS – – This reverse mapping was done in order to insure that variants of a phone were given due “credit” in the scoring procedure For example - [em] (syllabic nasal) is mapped to [ix] + [m], the vowel [ix] maps in certain instances to both [ih] and [ax], depending on the specifics of the phone set
Phone Scoring Procedures - 2001 • TWO METHODS WERE USED FOR THE 2001 EVALUATION – The “UNCOMPENSATED” form is the same as last year’s scoring method. Only common phone ambiguities (such as [ix], [ih], [ah]. [ax], etc. are allowed – The “TRANSCRIPTION-COMPENSATED” form allows for certain phones commonly confused among human transcribers to be scored as “correct, ” even though they would otherwise be scored as “wrong” – The compensated form of transcription lowers the phone “error” by ca. 10 -20% • TIME-MEDIATED SCORING WAS OF TWO VARIETIES A “STRICT” form is identical to that used in last year’s evaluation. There is a severe penalty for deviations from time boundaries for words and phones – A “LENIENT” form allows for a much looser fit between time markers associated with words and phones. A weighting of 0. 15 (relative to the STRICT form) was used (by modifying the penalty algorithm in SC-Lite). The 0. 15 weight reduced the number of phone “errors” by ca. 20% without a significant decline in false-positive responses –
Visualization of a 3 -D Confusion Matrix • When the matrix is sparsely coded, as below, it is more efficient to view the pattern as if squashed against a brick wall (see below) The diagonal is plotted in a linear plane
Interlabeler Agreement (74%) - 3 Transcribers • Highest for consonants (especically the stops) • Lowest for vowels (particularly the lax monophthongs) Proportion Concordance Vowels Consonants Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices
Interlabeler Disagreement Patterns - 2001 • INTERLABELER DISAGREEMENT PATTERNS WERE DERIVED FROM THE 2000 EVALUATION MATERIAL – Several minutes of 3 transcribers material transcribed in common were analyzed (2 from 1996 -1997 STP, 1 from 2001 STP) • THE FOLLOWING PATTERNS WERE OBSERVED IN THE INTERLABELER DISAGREEMENT ANALYSIS • Consonants – Stop and nasal consonants exhibit a small amount of disagreement – Fricatives exhibit slightly higher amounts of disagreement – Liquids show a moderate amount of disagreement • Vowels Lax monophthongs exhibit a high amount of disagreement – Diphthongs show a relatively small amount of disagreement – Tense, low monophthongs show relatively little disagreement (except for [ao] (probably a dialect issue) – • Overall Transcriber Agreement was 70%
Interlabeler Disagreement Patterns - 2001 • FROM SUCH PATTERNS THE FOLLOWING FORMS OF TOLERANCES WERE ALLOWED IN “TRANSCRIPTION COMPENSATED” SCORING: Segment UNcompensated Compensated [d] [d] [dx] [k] [k] [s] [s] [z] [n] [nx] [ng] [en] [r] [r] [axr] [er] [iy] [ix] [ih] [ao] [aa] [ow] [ax] [ah] [aa] [ix] [ih] [ax] [ih] [iy] [ax]
Transcription Compensation Affects Phone Error • COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS LOWERS THE PHONE “ERROR” APPRECIABLY FOR MOST SITES STRICT Time Mediation Error Rate
Transcription Compensation Affects Phone Error • COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS LOWERS THE PHONE “ERROR” APPRECIABLY FOR MOST SITES LENIENT Time Mediation Error Rate
Generation of Evaluation Data - 1
CTM File Format for Word Scoring • EACH SITE’S MATERIAL WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE) ERROR KEY C = CORRECT I = INSERTION N = NULL ERROR S = SUBSTITUTION
Generation of Evaluation Data - 2
Summary of Corpus Acoustic Properties • LEXICAL PROPERTIES – Lexical Identity – Unigram Frequency – Number of Syllables in Word – Number of Phones in Word – Word Duration – Speaking Rate – Prosodic Prominence – Energy Level – Lexical Compounds – Non-Words – Word Position in Utterance • SYLLABLE PROPERTIES – Syllable Structure – Syllable Duration – Syllable Energy – Prosodic Prominence – Prosodic Context • PHONE PROPERTIES – Phonetic Identity – Phone Frequency – Position within the Word – Position within the Syllable – Phone Duration – Speaking Rate – Phonetic Context – Contiguous Phones Correct – Contiguous Phones Wrong – Phone Segmentation – Articulatory Features – Articulatory Feature Distance – Phone Confusion Matrices • OTHER PROPERTIES – Speaker (Dialect, Gender) – Utterance Difficulty – Utterance Energy – Utterance Duration
Word- and Phone-Centric “Big Lists” • THE “BIG LISTS” CONTAIN SUMMARY INFORMATION ON 55 -65 SEPARATE PARAMETERS ASSOCIATED WITH PHONES, SYLLABLES, WORD, UTTERANCES AND SPEAKERS SYNCHRONIZED TO EITHER THE WORD (THIS SLIDE) OR THE PHONE
Generation of Evaluation Data - 3
Phoneval-2000 Web Site FORCED ALIGNMENT FILES RECOGNITION FILES • Converted Submissions • Forced Alignment Files BBN , JHU, MSU, WASH ATT, BBN , JHU, MSU, SRI, WASH • Word Level Recognition Errors • Word-Level Alignment Errors BBN , CU, JHU, MSU, SRI, WASH ATT, CU, BBN , JHU, MSU, SRI, WASH • Phone Error (Free Recognition) • Phone Error (Forced Alignment) CU, BBN, JHU, MSU, SRI, WASH ATT, BBN, JHU, MSU, WASH • Word Recognition Phone Mapping • Alignment Word-Phone Mapping BBN , JHU, MSU, WASH ATT, BBN, JHU, MSU, WASH BIG LISTS • Word-Centric BBN, CU, JHU, MSU, SRI, WASH ATT, CU, BBN, JHU, MSU, SRI, WASH • Phone-Centric BBN, JHU, MSU, WASH ATT, BBN, JHU, MSU, WASH • Phonetic Confusion Matrices BBN, JHU, MSU, WASH ATT, BBN, JHU, MSU, WASH • Phone Mapping for Each Site ATT, BBN , JHU, MSU, WASH STP-to-Reference Map STP Phone-to-Articulatory-Feature Map Description of the STP Phone Set • STP Transcription Material • Phone-Word Reference Syllable-Word Reference http: //www. icsi. berkeley. edu/real/phoneval
A Syllable-Centric Perspective In this presentation we will “drill down” from the lexical to the phonetic tiers by way of the syllable, the phone and articulatoryacoustic features Words Stress-accent Phonetic segment Articulatory-Acoustic Features
Coarse Word and Phone Recognition • THE FOLLOWING SLIDES PROVIDE DETAILS ABOUT THE COARSE WORD AND PHONE SCORES FOR THE 2000 AND 2001 EVALUATIONS • ALTHOUGH THE WORD AND PHONE SCORES ARE ROUGHLY COMPARABLE ACROSS YEARS (FOR ANALOGOUS CONDITIONS) THE 2001 EVALUATION HAS FOUR TIMES THE NUMBER OF SCORING CONDITIONS (FOR PHONES) BASED ON THE “LENIENT” vs. STRICT TIME-MEDIATION AND THE COMPENSATED vs. UNCOMPENSATED TRANSCRIPTION SCORING
Word Recognition Error (2000) WORD ERROR RATES VARY BETWEEN 27% AND 43% – Substitutions are the major source of word errors Site Error Rate • Error Type
Prosodic Stress & Word Error Rate (2000) • The effect of stress is most concentrated among word-deletion errors Unstressed Intermediate Stress Data represent averages across all eight ASR systems Fully Stressed
Syllable Structure & Word Error Rate (2000) • • Vowel-initial forms show the greatest error Polysyllabic forms exhibit the lowest error Data are averaged across all eight sites C = Consonant V = Vowel
Syllable Structure & Word Error Rate (2000) • • VOWEL-INITIAL forms exhibit the HIGHEST error POLYSYLLABLES have the LOWEST error rate
Word Recognition Error (2001) • WORD ERROR RATES VARY BETWEEN 33% AND 49% – Substitutions are the major source of phone errors Error Rate Site STRICT Time Mediation Error Type
Word Recognition Error (2001) • WORD ERROR RATES VARY BETWEEN 31% AND 44% – Substitutions are the major source of phone errors Error Rate Site LENIENT Time Mediation Error Type
Prosodic Stress & Word Error Rate (2001) • NOT YET • PROSODIC LABELING OF THIS MATERIAL REQUIRED FIRST • ANALYSIS SCHEDULED FOR JUNE, 2001
Syllable Structure & Word Error Rate (2001) • • Vowel-initial forms show the greatest error Polysyllabic forms exhibit the lowest error, except fpr CVCV forms (probably due to forms such as “gonna, ” etc. ) Data are averaged across all five sites
Syllable Structure & Word Error Rate (2001) • • VOWEL-INITIAL forms exhibit the HIGHEST error POLYSYLLABLES have the LOWEST error rate
Are Word and Phone Errors Related? (2000) • COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE – The correlation between the two parameters is 0. 78 Pronunciation Models? The differential error rate is probably related to the use of either pronunciation or language models (or both) Error Rate Submission Site
Are Word and Phone Errors Related? (2001) • COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE Strict Time Mediation Pronunciation Model? Transcription Un. Compensated Error Rate
Are Word and Phone Errors Related? (2001) • COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE Lenient Time Mediation Pronunciation Model? Transcription Un. Compensated Error Rate
Phonetic - Pronunciation Mismatch • THERE A FAR GREATER NUMBER OF PRONUNCIATIONS IN THE TRANSCRIPTION MATERIALS THAN IN THE ASR LEXICONS • GIVEN THAT MOST WORDS ARE CORRECTLY RECOGNIZED, THIS RESULT IMPLIES THAT PHONETIC CLASSIFICATION IN ASR SYSTEMS IS, BY NECESSITY, HIGHLY AGRANULAR • THUS, UNUSUAL PRONUNCIATIONS ARE UNLIKELY TO BE DECODED CORRECTLY • THE COARSE NATURE OF THE PRONUNCIATION MODELS ALSO MAKE IT DIFFICULT TO FINE-TUNE THE RELATION BETWEEN THE PHONETIC CLASSIFIER AND PRONUNCIATION MODEL COMPONENTS
Pronunciation Variation in ASR Lexicons • MOST WORDS IN THE ASR LEXICONS HAVE A SINGLE PRONUNCIATION • EXCEPTIONS ARE HIGHLY FREQUENT WORDS (SUCH AS “THE” AND “AND” WHICH HAVE 2 OR 3 PRONUNCIATION VARIATIONS. NO WORD HAS MORE THAN 5 PRONUNCIATION VARIANTS (AT LEAST NOT IN THE PHONETIC OUTPUT PROVIDED TO ICSI FOR THE EVALUATION)
Pronunciation Variation in Switchboard (2001) • THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR THE 100 MOST FREQUENT WORDS IN THE PHONETIC EVALUATION MATERIAL WORD INSTANCES #PRON
Pronunciation Variation in Switchboard (2001) • THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR THE 100 MOST FREQUENT WORDS IN THE PHONETIC EVALUATION MATERIAL WORD INSTANCES #PRON
Phone Error and Word Length (2000) • For CORRECT words, only one phone (on average) is misclassified – Implication – short words are highly tolerant of phone “errors” • For INCORRECT words, phone errors increase linearly with word length Data are averaged across all eight sites
Phone Error and Word Length (2001) • For CORRECT words, only one phone (on average) is misclassified – Implication – short words are highly tolerant of phone “errors” • For INCORRECT words, phone errors increase linearly with word length Data are averaged across all five sites
Phone Error - Forced Alignment (2000) PHONE ERROR RATES VARY BETWEEN 35% AND 49% – This, despite having the word transcript!!! Site Error Rate • Error Type AT&T, Dragon did not provide a complete set of forced alignments
Phone Error - Forced Alignment (2001) • PHONE ERROR RATES VARY BETWEEN 40% AND 50% – Same picture for 2001. Suggests a potential mismatch between lexical and phonetic representations Error Rate Site STRICT Time Mediation Error Type Transcription UNcompensated
Phone Error - Forced Alignment (2001) • PHONE ERROR RATES VARY BETWEEN 30% AND 44% – Still a poor match between phonetic transcripts and lexical reps Error Rate Site LENIENT Time Mediation Error Type Transcription UNcompensated
Phone Error - Forced Alignment (2001) • PHONE ERROR RATES VARY BETWEEN 32% AND 38% – Still a lack of concordance with a tolerant scoring method Error Rate Site Error Type STRICT Time Mediation Transcription Compensated
Phone Error - Forced Alignment (2001) • PHONE ERROR RATES VARY BETWEEN 23% AND 29% – With the most tolerant scoring there is still some lack of concordance Error Rate Site Error Type LENIENT Time Mediation Transcription Compensated
Visualization of a 3 -D Confusion Matrix • When the matrix is sparsely coded, as below, it is more efficient to view the pattern as if squashed against a brick wall (see below) The diagonal is plotted in a linear plane
Phonetic Confusion Matrix - CVC Syllables • Onset consonants tend to be highly concordant with transcription • Coda consonants are slightly less concordant, particularly some fricatives STOPS FRICATIVES NASALS APPROXIMANTS Proportion Concordance CVC Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Forced Alignment
Phonetic Confusions - CCVC, CVCC Syllables • Certain fricatives are problematic in CVCC coda position • Redo this figure and others - no wrong words, compare CVC, CVC etc, STOPS FRICATIVES NASALS APPROXIMANTS Proportion Concordance CCVC CVCC Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Forced Alignment
Phonetic Confusions - CV and CVC Nuclei • Diphthongs and tense, low monophthongs tend to be concordant • Lax monophthongs tend to be less concordant (cf. Stress-accent-paper) Proportion Concordance CVC CV Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Forced Alignment
Phone Error - Unconstrained Recognition (2000) • PHONE ERROR RATES VARY BETWEEN 39% AND 55% – Phone error is only slightly greater than forced alignments Error Rate Site Error Type
Phone Error - Unconstrained Recognition(2001) • PHONE ERROR RATES VARY BETWEEN 44% AND 55% – Results similar to 2000 evaluation Condition most analogous to 2000 evaluation Error Rate Site STRICT Time Mediation Error Type Transcription Uncompensated
Phone Error - Unconstrained Recognition (2001) • PHONE ERROR RATES VARY BETWEEN 38% AND 48% – Relaxing time-mediation brings down the error slightly Error Rate Site LENIENT Time Mediation Error Type Transcription Uncompensated
Phone Error - Unconstrained Recognition(2001) • PHONE ERROR RATES VARY BETWEEN 25% AND 39% – Transcription compensation also brings down the error Error Rate Site STRICT Time Mediation Error Type Transcription Compensated
Phone Error - Unconstrained Recognition(2001) • PHONE ERROR RATES VARY BETWEEN 27% AND 38% – Phone errors decline somewhat more with lax scoring Error Rate Site LENIENT Time Mediation Error Type Transcription Compensated
Phonetic Confusion Matrix - CV Onsets • ARROWS pinpoint problem segments • AFFRICATES and FRICATIVES are problematic in CV onset position • [d] is also problematic STOPS FRICATIVES NASALS APPROXIMANTS Proportion Concordance Correct Words Wrong Words Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Unconstrained Recognition
Phonetic Confusion Matrix - CVC Onsets • Fricatives and affricates are problematic in CVC onset position STOPS FRICATIVES NASALS APPROXIMANTS Proportion Concordance Correct Words Wrong Words Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Unconstrained Recognition
Phonetic Confusion Matrix - CCVC Onsets • Certain fricatives are particularly problematic in CCVC onset position STOPS FRICATIVES NASALS APPROXIMANTS Proportion Concordance Correct Words Wrong Words Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Unconstrained Recognition
Phonetic Confusion Matrix - CVC Codas • Fricatives are particularly problematic in CVC coda position • Certain Stops are also problematic in CVC coda position STOPS FRICATIVES NASALS APPROXIMANTS Proportion Concordance Correct Words Wrong Words Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Unconstrained Recognition
Phonetic Confusion Matrix - CVCC Codas • Certain fricatives are problematic in CVCC coda position • [d] is also problematic in CVCC coda position STOPS FRICATIVES NASALS APPROXIMANTS Proportion Concordance Correct Words Wrong Words Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Unconstrained Recognition
Phonetic Confusion Matrix - CVC Nuclei • Certain vowels are a problem in CVC nucleus position • Note that the level of concordance is much lower for vowels than for consonants (in onset or coda position), even for correct words Proportion Concordance Correct Words Wrong Words Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Unconstrained Recognition
Phonetic Confusion Matrix - CV Nuclei • Diphthongs and low, tense vowels are more concordant with the transcription than the lax monophthongs – cf. Stress-accent paper Proportion Concordance Correct Words Wrong Words Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices Unconstrained Recognition
Consonantal Onsets and AF Errors (2000) • • Syllable onsets are intolerant of AF errors in CORRECT words Place and manner AF errors are particularly high in INCORRECT onsets Data are averaged across all eight sites
Consonantal Onsets and AF Errors (2001) • • • Syllable onsets are intolerant of AF errors, particularly place, in CORRECT words Place and manner AF errors are particularly high in INCORRECT onsets Syllable structure does not have the same effect as in the 2000 analysis Data are averaged across all five sites
Consonantal Codas and AF Errors (2000) • Syllable codas exhibit a slightly higher tolerance for error than onsets • There is a high degree of AF error for wrong words Data are averaged across all eight sites
Consonantal Codas and AF Errors (2001) • Syllable codas exhibit a slightly higher tolerance for error than onsets • There is a high degree of AF error for wrong words Data are averaged across all five sites
Vocalic Nuclei and AF Errors (2000) • • Nuclei exhibit a much higher tolerance for error than onsets & codas There are many more errors than among syllabic onsets & codas Data are averaged across all eight sites
Vocalic Nuclei and AF Errors (2001) • • Nuclei exhibit a much higher tolerance for error than onsets & codas, particularly for height and front/back There are many more errors than among syllabic onsets & codas Data are averaged across all five sites
Into the (Near) Future … • WITH THE ARRIVAL OF THE REMAINING FORCED-ALIGNMENT AND UNCONSTRAINED RECOGNITION DATA IT will be possible to investigate in the relative contribution of the phonetic classification, pronunciation and language models to recognition performance – In order to do this, it is necessary to obtain unconstrained recognition, forced alignment and phone-confidence material from each site (to the extent possible) [the phone confidence metric is problematic] – • CUSTOMIZED ANALYSES FOR INDIVIDUAL SITES – – – SRI has different versions of their system (with & w/o adaptation, etc. ) AT&T will use phone strings from ICSI transcription material Individual diagnostics for each site (are there significant differences for specific parameters? ) • MOST OF THE DATA FOR THE 2001 EVALUATION WILL BE POSTED ON THE PHONEVAL WEB SITE SHORTLY • WEB-BASED ORACLE DATABASE APPLICATION IS NEAR COMPLETION – • Will enable searches over the web of the Phoneval corpus and be able to graph the results (this is the tricky part, given the ugly nature of Oracle Web DB…) A PAPER DESCRIBING THE FULL SET OF ANALYSES WILL BE AVAILABLE AT THE END OF JUNE (2001)
Summary and Conclusions • PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS – Many different analyses (to follow) support this conclusion – Consonants appear to be more important than vowels • SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR FOR ACCURATE RECOGNITION – The pattern of errors differs across the syllable (onset, nucleus, coda) and exhibit consistent patterns difficult to discern with other units of analysis • STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE, PARTICULARLY FOR UNDERSTANDING THE NATURE OF WORD-DELETION ERRORS – Relation among stress-accent, syllable structure, vocalic identity and length • THE NATURE OF PRONUNCIATION MODELS and THEIR RELATION TO LEXICAL REPRESENTATIONS IS A POTENTIALLY KEY FACTOR – The unit of lexical representation (phones, articulatory features, etc. ) is probably of the utmost importance for optimizing ASR performance • FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS LIKELY TO DEPEND ON DEEP INSIGHT INTO THE NATURE OF SPOKEN
36c7eded8ae180a4d647be94c3eb681b.ppt