Data Recording Transcription and Speech Recognition for Egypt

Скачать презентацию Data Recording Transcription and Speech Recognition for Egypt

7905b421272c1d03845a9abd777eb27a.ppt

Количество слайдов: 31

Data Recording, Transcription, and Speech Recognition for Egypt Tanja Schultz Carnegie Mellon University Cairo, Egypt, May-21 2001

Outline Ê Requirements è Data for Speech Recognition Requirements è Audio data è Pronunciation dictionary è Text corpus data è Recording of Audio data è Transcription ËInitialization of an Egypt Speech Recognition Engine è Multilingual è Rapid of Audio data Speech Recognition Adaptation to new Languages

Part 1 Requirements for Speech Recognition è Data Requirements è Audio data è Pronunciation dictionary è Text corpus data è Recording of Audio data è Transcription of Audio data Thanks to Celine Morel and Susanne Burger

Speech Recognition Speech Input - Preprocessing Decoding / Search h e l l o Postprocessing - Synthesis Hello Hale Bob Hallo : : TTS

Fundamental Equation of SR h e l l o P(W/x) = [ P(x/W) * P(W) ] / P(x) A-b A-m A-e Acoustic Model Am Are I you we AE M AR AI JU VE Pronunciation I am you are we are : Language Model

SR: Data Requirements Acoustic Model Am Are I you we AE M AR AI JU VE Pronunciation Dictionary A-b A-m A-e Audio Data Phoneme Set Text Data I am you are we are : Language Model

Audio Data For training and testing the SR-engine many high quality data in the target language should be collected u What kind of data are needed Scenario and Task How to collect these data, Recording setup Preparation of Information u Quality of data Sampling rate, resolution u Amount of data Number of dialogs and speakers u Transcription of Audio Data

What kind of Audio Data C-Star Scenario: Travel arrangement (planning a vacation trip, booking a hotel room, . . . ) u Scenario u Dialog is realistic and attractive to the people between two people: One Agent: Travel assistant One Client: Traveler, pretends to visit a specific site u Speakers get instructions about what task they have to accomplish but not HOW to do that u Role playing setup

How to collect Audio Data u Recording The dialog partners can NOT see each other, i. e. no face-to-face (in preparation of telephone, web applications) No non-verbal communication Spontaneous Speech (noise effects, disfluencies, . . . may occur) No Push-to-talk, try to avoid crosstalk Balanced dialogs u Dialog setup structure, Task Greetings and formalities between dialog partners Client gives information like number of persons traveling, date of travel (arrival/departure), interest Client ask questions about means of transportation (train, flight), hotel or appartment modalities, visits of sights or cultural events Agent provides information according to clients questions

Prepare Information for Client and Agent u u u u A: Hotel list (3 -4 hotels per dialog) A: Transportation list (3 -4 flights, train, bus schedules) A: List of 3 -4 cultural events per dialog C: information about specific task: who is traveling (i. e. client travels with partner + two kids) when is s/he traveling (i. e. 2 weeks vacation trip in July) where (i. e. trip to Pennsylvania, US) how ( i. e. direct flight to Pittsburgh, rental car) what are the places of interest (CMU - Pittsburgh, Liberty Bell in Philadelphia, . . . ) Date and time of recording might be faked Dialog takes place at recording place Example sheets Celine Morel

Quality and Quantity of Audio Data u Quality of data High quality clean speech close-speaking microphone, like Sennheiser H-420 16 k. Hz sampling rate, 16 bit resolution u Amount Minimum of 10 hours of spoken speech Average length of dialogs 10 - 20 minutes 10 hours 30 - 60 dialogs u Number of data of speakers as much speakers as possible (speaker independent AM) 30 - 60 dialogs = maximum of 120 different spk Split up the speakers/dialogs into three disjunctive subsets: training set, development testset, evaluation testset

Recording Tool: Total Recorder u http: //www. highcriteria. com/download/totrec. exe u Registration fee: 11. 95 $ u IBM compatible PC, soundcard (i. e. Soundblaster) u Close-speaking microphone (i. e. Sennheiser H-420) u Win 95, Win 98, Win 2000, Win. NT Soundboard Driver Total Recorder

Transcription of Audio Data For training the SR-engine we need to transcribe the spoken data manually u Very u The time consuming (10 -20 times real time) more accurate transcribed the more valuable u Since we do have the pronunciations, only word-based transcriptions are needed u Transcription convention from Susanne Burger download from http: //www. cs. cmu. edu/~tanja Describes notation u Transcription tool: trans. Edit (Burger & Meier)

Transliteration conventions Example: tanja_0001: this sentence +uhm+ was spoken +pause+ by ~Tanja and +/cos/+ contains one restart u Parsability - one turn per line: Tanja_0001 u Consistency u Filter programs tagging of proper names ~Tanja tagging of numbers special noise markers +uhm+ no capitalization at the beginning of turns

Pronunciation Dictionary For each word seen in the training set, a pronunciation of this word has to be defined in terms of the phoneme set u u u Define an appropriate phoneme set: atomar sounds of language Describe each word to be recognized in terms of this phoneme set Example in English: I you u u AI JU Strong Grapheme-to-Phoneme relation in Egypt/Arabic IF the vocalization is transcribed, romanized transcription Grapheme-to-Phoneme tool for Standard Arabic (collected in Tunesia and Palestine) already developed at CMU (master student Jamal Abu-Alwan)

Phoneme Set (i. e. Standard Arabic)

Text Data For training the language model we need a huge corpus of text data of same domain u The language model helps guiding the search u Compute probabilities of words, word pairs and word tripels u Millions of words needed to calculate these probs u Text corpus should be as close as possible to the given domain u Writing systems must be the same u Other text might be useful as background information

Computer Requirements u Data collection IBM compatible PC High quality Soundcard like Soundblaster Close-speaking microphone like Sennheiser H-420 Operating System Win 95 Large Harddisc Õ Õ u Speech Recognition u 16000 x 2 bytes per sec 30 k. Bytes/sec 2 Mb/min 120 Mb/hr 1. 2 Giga. Bytes for 10 hr spoken speech Fast processor - as fast as possible RAM 512 Mb Additional 2 -4 Giga. Bytes for temporary files during training and testing Translation Donna, Lori?

Discussion u u Speech Recognizer in Egypt or Standard Arabic language ? Egypt u Standard Arabic u Spoken -used- language more interesting for a human-to-human speech-to-speech translation system? Standardized pronunciation? Large text resources available in Egypt? Parser output follows Standard Arabic vocalization? Use Egypt Call. Home data and pronunciation dictionaries (LDC)? Useful to a larger community? Canonical pronunciation? Preliminary speech recognizer and data already available at CMU Larger text resources available? Do we want monolingual dialogs (agent&client) or multilingual recordings?

Part 2 Initialization of an Egypt Speech Recognition Engine è Multilingual è Rapid Speech Recognition Adaptation to new Languages

Initialization of Egypt SR Engine u Rapid initialization of an Egypt/Arabic speech recognizer? u Pronunciation dictionary: Grapheme-to-Phoneme tool available if vocalization, romanization is provided by trl u Language model: text corpora if vocalized u Apply Egypt parser for vocalization? u Acoustic models: Initialization or Adaptation according to our fast adaptation approach PDTS

Global. Phone Arabic Ch-Mandarin Ch-Shanghai English French German Japanese Korean Croatian Portuguese Russian Spanish Swedish Tamil Turkish Multilingual Database Widespread languages Native Speakers Uniformity Broad domain Huge text resources è Internet Newspapers Total sum of resources 15 languages so far 300 hours speech data

Speech Recognition in Multiple Languages Goal: Speech recognition in a many different languages Problem: Only few or no training data available (costs, time) Sound system Speech data Pronunciation rules ( 10 hours) Text data ela /e/l/a/ eu /e/u/ sou /s/u/ AM Lex eu sou você é ela é LM

Speech Recognition in Multiple Languages Sound system Speech data Pronunciation rules Text data ela /e/l/a/ eu /e/u/ sou /s/u/ AM Lex eu sou você é ela é LM

Multilingual Acoustic Modeling Step 1: • Combine acoustic models • Share data across languages

Multilingual Acoustic Modeling Sound production is human not language specific: è International Phonetic Alphabet (IPA) è Multilingual Acoustic Modeling 1) Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes 2) Each sound class is represented by one “phoneme” which is trained through data sharing across languages m, n, s, l occur in all languages p, b, t, d, k, g, f and i, u, e, a, o occur in almost all languages no sharing of triphthongs and palatal consonants

Rapid Language Adaptation Step 2: • Use ML acoustic models, borrow data • Adapt ML acoustic models to target language ela /e/l/a/ eu /e/u/ sou /s/u/ AM Lex eu sou você é ela é LM

Rapid Language Adaptation Model mapping to the target language 1) Map the multilingual phonemes to Portuguese ones based on the IPA-scheme 2) Copy the corresponding acoustic models in order to initialize Portuguese models Problem: Contexts are language specific, how to apply context dependent models to a new target language Solution: Adaptation of multilingual contexts to the target language based on limited training data

Language Adaptation Experiments Ø Tree ML-Tree + Po-Tree PDTS

Summary 4 Multilingual database suitable for MLVCSR 4 Covers the most widespread languages 4 Language dependent recognition in 10 languages 4 Language independent acoustic modeling 4 Global phoneme set that covers 10 languages 4 Data sharing thru multilingual models 4 Language adaptive speech recognition 4 Limited amount of language specific data Create speech engines in new target languages using only limited data, save time and money

Selected Publications u Language Independent and Language Adaptive Acoustic Modeling Tanja Schultz and Alex Waibel in: Speech Communication, To appear 2001 u Multilinguality in Speech and Spoken Language Systems Alex Waibel, Petra Geutner, Laura Mayfield-Tomokiyo, Tanja Schultz, and Monika Woszczyna in: Proceedings of the IEEE, Special Issue on Spoken Language Processing, Volume 88(8), pp 1297 -1313, August 2000 u Polyphone Decision Tree Specialization for Language Adaptation Tanja Schultz and Alex Waibel in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000), Istanbul, Turkey, June 2000. u Download from http: //www. cs. cmu. edu/~tanja