
d86f53f3c489d9140dc78c75d03ddfd0.ppt
- Количество слайдов: 22
What I did on my Summer “Vacation“ Jeremy Morris 10/06/2006
Summer at AFRL - DAGSI n AFRL • Air Force Research Labs • Wright-Patterson AFB, Dayton OH n DAGSI Student/Faculty Resarch Fellowship program • Dayton Area Graduate Studies Institute • Effort to encourage collaboration between Ohio universities and AFRL
Summer at AFRL – SCREAM Lab n SCREAM Lab • Speech and Communication Research, Engineering, Analysis and Modeling Lab • Interest in a wide variety of speech research issues for the military n Speech-to-speech translation, rapid development of speech recognition systems, etc.
Summer at AFRL – Why us? n n SCREAM Lab members were interested in collaborating with OSU SCREAM Lab working on research in using phonological features in speech recognition • Perceived overlap with ASAT project
Review – Phonological Features n n For the ASAT Project, we have been using phonological feature detectors We train detectors on a particular phonological feature • e. g. manner or place for consonant, height, frontness, etc. for vowels n We then combine these features together for ASR purposes
Phonological Features (cont. ) n SCREAM Lab very interested in phonological feature detectors • Need for quick development of new ASR systems for new languages • A full set of phonological feature detectors would allow reuse of acoustic data for training across new languages n Multi-lingual detectors are clearly needed to get full coverage of all features
Phonological Features (cont. ) n Our phonological feature detectors • Monolingual (English only) • Trained using a set of multi-layer perceptron neural networks • Output a set of phonological feature class probabilities n SCREAM lab feature detectors • • Monolingual and multilingual Trained using Gaussian Mixture Models Output a set of likelihoods Based on work by Tanja Schultz (CMU)
Summer at AFRL - Proposal n n Besides acoustic models, new ASR systems for new languages have other needs An ASR system needs a lexicon mapping phones-to-words • Normally hand-constructed • Require time and expertise
Summer at AFRL - Proposal n Our proposal: look at methods of bootstrapping new lexicons from: • Acoustic data • Word-level transcripts • Phonological feature detector outputs n How? • Start by looking at work on deriving Acoustic Sub-Word Units
Summer at AFRM - Proposal n Acoustic Sub-Word Units (ASWUs) • Similar to phones in that they are smaller pieces of words • BUT – automatically derived from acoustics instead of manually defined • Used to derive both a sub-word unit set and a lexicon for that set simultaneously • Research in this area has been mainly to improve ASR performance
Summer at AFRL - Proposal n Can we use these methods along with phonological features as inputs to induce new lexicons? • Using phonological features, the subword units may be mappable to standard IPA phone labels
Summer at AFRL - Proposal n The proposed system is inspired by an ASWU by (Singh et al. , 2002) • Notable for not requiring word boundaries to be marked for training n n Start with a basic dictionary (including a starting phoneset size) Train a set of acoustic models on the training data with that dictionary Alter the basic dictionary in a manner that improves your pronunciations Repeat until a stopping criterion is reached
Summer at AFRL - Proposal n Start with a basic dictionary • Start with an assumption that the number of phones in a word is related to the number of letters in the orthography n Basic dictionary maps word to sequence of letters in that word: ABLE A B L E BANNED B A N N E D
Summer at AFRL - Proposal n Train a set of acoustic models • Using the basic dictionary, map words in the transcript to these “pronunciations” • Train an HMM-model using the output of the feature detectors as its input, and the above mapping as training labels
Summer at AFRL - Proposal n Alter the basic dictionary • Using some metric, find a candidate “phone” to be modified n We’ve looked at a couple of metrics – more on this later • Once the phone is identified, see if the phone should be “split” or “deleted” n n A “split” indicates that the given phone label actually represents two different sounds, and so should be replaced with two different phone labels A “delete” indicates that for a particular word or words the model fits better if that phone label is removed from the pronunciation
Summer at AFRL - Proposal n n Split example: BE B E DEVELOP D E 1 V E 1 L O P Delete examples: ABLE A B L E : : ABLE A B L ABANDONED A B A N D O N D
Summer at AFRL - Proposal n n n For splits, all possible alterations are added to temporary lexicon For deletes, we alter the HMM to add a possible deletion arc for the phone After lexicon or HMM is altered, word transcript is force aligned using new possible pronunciations • Best pronunciations are pulled from this alignment and used to build new lexicon • Steps are repeated using the new lexicon in place of the basic lexicon
Summer at AFRL - Proposal n How do we determine the candidate “phone label” to alter? • Initially, modelled each phone with two Gaussians in the HMM • Compared the two Gaussians to each other using their KL-divergences n n Took the phone label with the largest KL divergence as the one to alter Idea was that each Gaussian described a cluster – the further these centers were from each other, the more probable they were describing two different phones
Summer at AFRL - Proposal n KL-divergence metric did not work well • System would pick candidates that a human would find unreasonable (such as “F” or “Q”) • System would split or delete these phones multiple times, continually returning to the same phone label
Summer at AFRL - Proposal n Why did the KL divergence perform this way? • Suspcion: Large variations in the two Gaussians in areas that do not matter for that phone pushed up the scores (e. g. vowel features for consonants) • Splitting these phones only allowed the coverage to spread wider, drawing the system back to those phones
Summer at AFRL - Proposal n n n What next? Tried Mahalanobis distance metric, with poor results also Returned to Acoustic Sub-Word papers for inspiration • Instead of looking at cluster stats, multiple papers use an average frame likelihood metric for each phone cluster to determine candidate phone for altering • Have started moving my code to use this framework – preliminary passes show promise, but no results quite yet
Conclusion – It’s 75 miles to Dayton n Advice for those thinking of doing work at WPAFB • Working in the SCREAM Lab was great n n Hundreds of processors, tons of multi-lingual corpora Friendly people, decent work environment (if a bit dark) • Many hoops to jump through, even just for a summer student n ID badges, computer usage training, etc. • Sometimes feels like you’re working at a corporation… n until the guys in uniform come around • The base is built like a campus crossed with a prison n cinderblock is the building material of choice. • Don’t forget your ID Badge n It’s 75 miles from Columbus to Dayton