6ca706f698eead8f8ca8376001a0bdd7.ppt
- Количество слайдов: 22
Borrowing Language Resources for Development L R D of Automatic Speech Recognition for Low- and A S R L Middle-Density Languages L Lynette Melnar & Chen Liu Application and Software Research Center, Motorola Labs Motorola, Schaumburg, IL 60196, USA Lynette. Melnar, Chen. Liu@motorola. com LREC 2008 May 28 -30, Marrakech, Morocco
Language Resources Broader, faster language coverage critical! IE: Iranian? IE: Caribbean Spanish? Dravidian? Ssssssss Austro-Asiatic? Ssssssss IE: Venezuelan Spanish? Niger-Congo? Ssssssss The IE: Rioplatense Spanish? Ssssssss 1 onlyway xxxxx 7 IE: Germanic Sssss to 1 6 IE: Italic go is by the 1 IE: Slavic Current language coverage: 2 1 lake onxxxxx the 1 IE: Indic 1 1 Afro-Asiatic: Semitic bay 1 LREC 2008 May 28 -30, Marrakech, Morocco Ssssssss Sinitic: Xiang? Ssssssss Sinitic: Min? Ssssssss Sinitic: Hakka? Sssss Altaic: Turkic Altaic: Korean Altaic: Japonic Sinitic: Mandarin Sinitic: Cantonese Sinitic: Wu
Language Density • High-density languages: – Majority languages associated with large, economically advantaged speaker populations • significant proportions of which regularly use computers – Bear official status or non-official predominant use in one or more countries – Recognized as important by foreign governments – Supported by writing traditions and have been studied and well documented in various types of language resources • In particular, a high-density language is here associated with several commercially available speech resources of various types (and quality) – Examples of such high-density languages are major dialects of Arabic, English, Mandarin, and Spanish LREC 2008 May 28 -30, Marrakech, Morocco
Language Density • Low-density languages: – Speaker populations may be very small and economically disadvantaged • and have little or no computer experience – Have minority status in the countries where spoken – Not judged to be very significant by foreign governments – Not supported by writing systems or only supported by nonstandardized writing systems and not well studied or documented in language resources • Middle-density languages: – A balance of those extremes exhibited by high- and low-density languages • Many Chinese, Indian, and African languages in the emerging market LREC 2008 May 28 -30, Marrakech, Morocco
Extend Existing Language Resources • Goal - extend the value of existing language resources by developing a tool that uses in-house source language data to automatically predict target-language ASR performance. – Speech DBs are expensive (time/money). – Use source DBs to quickly create (monophone) acoustic models for target languages lacking speech data. • Approach – base the approach on a series of linguistic metrics characterizing the articulatory phonetic and phonological information of phonemes from both the target and source languages. • Benefit - objectively measure phonological similarity among languages to: – Help direct future multilingual exploration – Guide a cost-effective DB acquisition strategy LREC 2008 May 28 -30, Marrakech, Morocco
Crosslingual Phoneme Distance (cf. Liu & Melnar Interspeech 2005) A combined phonetic-phonological crosslingual distance metric (CPPCD) automatically selects phoneme models from existing source languages as surrogate models for a target language lacking existing speech data. Feature weights Phonemes from all languages Feature matrix Phonetic distance (X 2) CPP-CD Lexica from all languages Monophoneme distribution distance Biphoneme distribution distance LREC 2008 May 28 -30, Marrakech, Morocco
CPP-CD (cf. Liu & Melnar Interspeech 2005) Target Language: Latin American Spanish Target phoneme: /i/ Source language phonemes: Italian /i/ : CPP-CD = 0. 445999111424272 {Ph. D(2)= 0} + MD(1)= 0. 0931453475842772 + BD(1)= 0. 352853763839995 Brazilian Portuguese /i/ : CPP-CD = 0. 970328191371562 {Ph. D(2)= 0} + MD(1)= 0. 420797537723557 + BD(1)= 0. 549530653648005 … ® The pre-existing acoustic models for the top two least distant sourcelanguage phonemes for each target-language phoneme are selected for the crosslingual target-language model set. LREC 2008 May 28 -30, Marrakech, Morocco
Model Training (cf. Liu & Melnar ISCA MULTILING 2006) 1. Embed the CPP-CD identified HMMs corresponding to the target 2. Attach all donor language phoneme labels that have been selected as language into the model training of the donating source languages; target-language surrogates with a target-language ID tag (LT) − 3. Attach all donor language phoneme labels that have NOT been selected as target-language surrogates with the donor-language ID tag (L 1, L 2, L 3, …) − 4. E. g. , Donor language 1 phoneme inventory : a_LT, e_LT, u_LT, b_LT, … E. g. , Donor language 1 phoneme inventory : i_L 1, o_L 1, d_L 1, f_L 1, … Build a mega phoneme inventory, lexicon, and phoneme transcription set. The ID-tagged mega phoneme inventory includes all the phonemes from the donor languages and is used to retranscribe all the pronunciation lexica and phoneme transcriptions LREC 2008 May 28 -30, Marrakech, Morocco
Model Training (cf. Liu & Melnar ISCA MULTILING 2006) LREC 2008 May 28 -30, Marrakech, Morocco
CPP Crosslingual Prediction (CPP-CP) To predict the performance of the crosslingual model set: 1. Assign each target phoneme an importance weight − Based on the phoneme's lexical frequency 2. Measure the contribution and interference effect of all the donor phonemes to each target phoneme – For each target phoneme, define a matching set that consists of the two least distant source-language donor phonemes – For each target phoneme, define a confusing set that consists of all the other donor-language phonemes LREC 2008 May 28 -30, Marrakech, Morocco
CPP Crosslingual Prediction (CPP-CP) ® The combined contribution and interference effect of donor phonemes to all the target phonemes is derived from the same component distance measures used in the derivation of CPP-CD (phonetic, monophoneme, biphoneme). Target Phoneme : Matching set : Confusing set : Lexical importance weight Lesser CPP-CD distance is better Greater CPP-CD distance is better LREC 2008 May 28 -30, Marrakech, Morocco
CPP Crosslingual Prediction (CPP-CP) Matching set CPP-CD (For target Phoneme #1) (For target Phoneme #2) Contribution effect Discriminative contribution effect Donor phonemes Contrasting set Interference effect . . . (For target Phoneme #N) . . . LREC 2008 May 28 -30, Marrakech, Morocco Discriminative contribution effect. . . Discriminative contribution effect CPP-CP
CPP Crosslingual Prediction (CPP-CP) • Because the prediction score is based on distance, the smaller the CPP-CP score, the higher the predicted performance of the crosslingual models • For a set of languages, we evaluate CPP-CP scores relative to recognition phoneme error rates: – recognition results with the crosslingual models on the native speech data ®Where the prediction scores match the general trend of the recognition phoneme error rates, we consider the relative prediction scores reliable LREC 2008 May 28 -30, Marrakech, Morocco
CPP-CP: Known Performance Factors • Inconsistency in data quality and task complexity across languages • Sub-optimal native model quality for some languages • Non-unary mapping between the acoustic and nonacoustic phonological domains • Biphoneme inventory size • Target-source language proximity LREC 2008 May 28 -30, Marrakech, Morocco
CPP-CD: Recognition Experiment 1. Compare CPP-CD approach with an acoustic distance approach (Bhattacharyya metric) and native monolingual modeling approach in ASR experiments 2. Test five target languages: − Latin American Spanish, Italian, Japanese, Danish, and European Portuguese (all high-density - speech data for testing required) 3. Use twenty source languages from six major language groups: – (i) Afro-Asiatic; (ii) Altaic; (iii) IE Germanic; (iv) IE Italic; (v) IE Slavic; and (vi) Sinitic • For each crosslingual experiment, the target language is left out of the source language pool for model selection LREC 2008 May 28 -30, Marrakech, Morocco
CPP-CD: Recognition Results Target Language Native Baseline Genetic Relation Biphoneme Inv. Acoustic Distance CPP-CD Italian 98. 42 Italic (5) 613 98. 27 98. 52 Spanish 94. 49 Italic (5) 520 88. 61 93. 06 Japanese 95. 36 Japonic (0) 643 76. 72 78. 76 Portuguese 96. 31 Italic (5) 776 77. 91 72. 74 Danish 94. 36 Germ. (5) 980 72. 95 70. 15 Overall the CPP-CD and Bhattacharyya acoustic approaches perform similarly. The average Model Performance Comparison (word accuracy %) recognition result using the Bhattacharyya-derived models is 82. 89% while the average CPP -CD result is 82. 65%. LREC 2008 May 28 -30, Marrakech, Morocco
CPP-CP Confirmation The trend of the CPP-CP scores matches that of the phoneme error rates • Practical use threshold • Results suggest that CPPCP scores are indicative of actual recognition performance ® Results suggest that target -language crosslingual model sets that have a CPP-CP score of less than 1 are likely to achieve acceptable recognition performance levels Comparison is based on linguistic distance and Prediction scoreof phoneme error rates and CPP-CP scores (1) corresponds to phoneme error rate LREC 2008 May 28 -30, Marrakech, Morocco – Spanish, Italian, and Japanese CPP-CP models average a word accuracy of 90. 11%
CPP-CP: Experiment 1. Test with lower-density target languages: Brazilian Portuguese, Cantonese, and Shanghainese • Brazilian Portuguese, Cantonese, and Shanghainese may be considered middledensity languages in the sense that they are relatively underrepresented in terms of available speech resources. − We consulted the speech resource catalogues of the European Language Resources Association (ELRA), Linguistic Data Consortium (LDC), and Appen Ltd. to informally evaluate speech database availability. For both Shanghainese and Cantonese only one speech database is available, while Brazilian Portuguese is associated with three publicly available speech databases. 2. Use same twenty source languages selected for the first experiment and likewise compare their crosslingual phoneme error rates and CPP-CP scores LREC 2008 May 28 -30, Marrakech, Morocco
CPP-CP: Experiment Biphoneme Inventory Related Languages Average 665. 75 4. 125 Cantonese 320 2 Shanghainese 428 2 Spanish 520 5 5 Italian 613 5 5 Japanese 643 0 Eur. Portuguese 776 5 Danish 980 5 Br. Portuguese 1046 5 Target Language Comparison of performance factors among the eight test languages LREC 2008 May 28 -30, Marrakech, Morocco
CPP-CP: Experiment • The trends of the CPP-CP scores and phone error percentages correspond • Among the recently added languages, only Cantonese has a prediction score below 1 – Crosslingual model set is expected to achieve a practical use recognition level • Small Cantonese biphoneme inventory size contributes to low prediction score Comparison of phoneme error rates and CPP-CP scores (2) LREC 2008 May 28 -30, Marrakech, Morocco
Conclusion ® Completely automatic approach for predicting crosslingual phoneme distance and speech recognition performance ® Approach requires: − Source-language speech data and pronunciation lexica − Target-language pronunciation lexicon − NO target-language speech data − NO acoustic measurements ® Approach is especially useful for creating and validating crosslingual model sets for low- and middle-density languages ® Tool can be of great assistance in entering an emerging market quickly and cost-effectively LREC 2008 May 28 -30, Marrakech, Morocco
ﻫﺬﺍ ﻛﻠﺶ ﺷﻜﺮﺍ (Voilà, merci!) LREC 2008 May 28 -30, Marrakech, Morocco
6ca706f698eead8f8ca8376001a0bdd7.ppt