2197bb0b47cc683a099f4f030176e7fe.ppt
- Количество слайдов: 35
Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text? Joseph Picone, Ph. D Professor, Electrical Engineering Mississippi State University Patti Price, Ph. D VP Business Development Bravo. Brava
Outline • • Introduction and state of the art (Price) Research issues (Picone) – – – • Evaluation metrics Acoustic modeling Language modeling Practical issues Technology demands Conclusion and future directions (Price) Mississippi State University Bravo. Brava
Introduction What is Speech Recognition? Goal: Automatically extract the string of words spoken from the speech signal Speech Signal • Speech Recognition Words “How are you? ” Speech recognition does NOT determine – Who is talker (speaker recognition, Heck and Reynolds) – Speech output (speech synthesis, Fruchterman and Ostendorf) – What the words mean (speech understanding) Mississippi State University Bravo. Brava
Introduction Speech in the Information Age • Speech & text were revolutionary because of information access • New media and connectivity yield information overload • Can speech technology help? Time Source of Information Speech Access to Information Listen, remember Mississippi State University Text Read books Film, video, multimedia, voice mail, radio, television, conferences, web, on-line resources Computer Careful spoken, Conversational typing language written input Bravo. Brava
State of the Art 1997 Initial and Current Applications • Command control –Manufacturing –Consumer products http: //www. speech. be. philips. com/ • Database query – Resource management – Flight information – Stock quote Nuance, American Airlines: 1 -800 -433 -7300, touch 1 • Dictation –http: //www. dragonsys. com –http: //www-4. ibm. com/software/speech Mississippi State University Bravo. Brava
State of the Art How Do You Measure? • • • What benchmarks? Have other systems used same one? What was training set? What was test set? Were training and test independent? How large was the vocabulary and the sample size? What speakers? What style speech? What kind of noise? Mississippi State University Bravo. Brava
State of the Art Factors that Affect Performance 2005 wherever speech occurs 2000 vehicle noise radio cell phones NOISE ENVIRONMENT all speakers of the language including foreign regional accents native speakers competent foreign speakers 1995 normal office various microphones telephone quiet room fixed high – quality mic speaker independent and adaptive USER speakerdep. POPULATION 1985 careful reading SPEECH STYLE planned speech natural humanmachine dialog (user can adapt) all styles including human -human (unaware) Mississippi State University application– specific speech and expertto years language create app – specific language model COMPLEXITY some application – specific data and one engineer year application independent or adaptive Bravo. Brava
Evaluation Metrics Evolution Word Error Rate Conversational Speech 40% 30% Broadcast News 20% Read Speech 10% Continuous Digits Letters and Numbers Command Control • Spontaneous telephone speech is still a “grand challenge”. • Telephone-quality speech is still central to the problem. • Vision for speech technology continues to evolve. • Broadcast news is a very dynamic domain. 0% Level Of Difficulty Mississippi State University Bravo. Brava
Evaluation Metrics Human Performance Word Error Rate 20% Wall Street Journal (Additive Noise) • Human performance exceeds machine performance by a factor ranging from 4 x to 10 x depending on the task. • On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. 15% Machines 10% • The nature of the noise is as important as the SNR (e. g. , cellular phones). 5% Human Listeners (Committee) 0% 10 d. B 16 d. B 22 d. B Quiet • A primary failure mode for humans is inattention. • A second major failure mode is the lack of familiarity with the domain (i. e. , business terms and corporation names). Speech-To-Noise Ratio Mississippi State University Bravo. Brava
Evaluation Metrics Machine Performance 100% (Foreign) Read Speech Spontaneous 20 k Speech 10% 5 k 1 k Conversational Speech Broadcast Speech Varied • Common evaluations fuel (Foreign) Microphones Noisy 10 X technology development. • Tasks become progressively more ambitious and challenging. • A Word Error Rate (WER) below 10% is considered acceptable. • Performance in the field is typically 2 x to 4 x worse than performance on an evaluation. 1% 1988 1989 1990 1991 Mississippi State University 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 Bravo. Brava
Evaluation Metrics Beyond WER: Named Entity • Information extraction is the analysis of natural language to collect information about specified types of entities. F-Measure 100% • An example of named entity annotation: Mr.
Recognition Architectures Why Is Speech Recognition So Difficult? • Comparison of “aa” in “IOck” vs. “iy” in b. EAt for conversational speech (SWB) Feature No. 2 Ph_1 Ph_2 Ph_3 Feature No. 1 • Our measurements of the signal are ambiguous. • Region of overlap represents classification errors. • Reduce overlap by introducing acoustic and linguistic context (e. g. , context-dependent phones). Mississippi State University Bravo. Brava
Recognition Architectures A Communication Theoretic Approach Message Source Observable: Message Linguistic Channel Articulatory Channel Acoustic Channel Words Sounds Features Bayesian formulation for speech recognition: • P(W|A) = P(A|W) P(W) / P(A) Objective: minimize the word error rate Approach: maximize P(W|A) during training Components: • P(A|W) : acoustic model (hidden Markov models, mixtures) • P(W) : language model (statistical, finite state networks, etc. ) The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams). Mississippi State University Bravo. Brava
Recognition Architectures Incorporating Multiple Knowledge Sources • The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. Input Speech Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Recognized Utterance Mississippi State University • Acoustic models represent sub-word units, such as phonemes, as a finitestate machine in which states model spectral structure and transitions model temporal structure. • The language model predicts the next set of words, and controls which models are hypothesized. • Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence. Bravo. Brava
Acoustic Modeling Feature Extraction Fourier Transform Input Speech • Measure features 100 times per sec. Cepstral Analysis • Incorporate knowledge of the nature of speech sounds in measurement of the features. • Utilize rudimentary models of human perception. • Use a 25 msec window for frequency domain analysis. • Include absolute energy and 12 spectral measurements. Perceptual Weighting Time Derivative Energy + Mel-Spaced Cepstrum Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum • Time derivatives to model spectral change. Mississippi State University Bravo. Brava
Acoustic Modeling Hidden Markov Models • Acoustic models encode the temporal evolution of the features (spectrum). • Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. • Phonetic model topologies are simple left-to-right structures. • Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. • Sharing model parameters is a common strategy to reduce complexity. Mississippi State University Bravo. Brava
Acoustic Modeling Parameter Estimation • Closed-loop data-driven modeling supervised only from a word-level transcription • Single Gaussian Estimation • The expectation/maximization (EM) algorithm is used to improve our parameter estimates. • 2 -Way Split • • Mixture Distribution Reestimation Computationally efficient training algorithms (Forward-Backward) have been crucial. • Batch mode parameter updates are typically preferred. • Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge. • Initialization • 4 -Way Split • Reestimation • • • Mississippi State University Bravo. Brava
Language Modeling Is A Lot Like Wheel of Fortune Mississippi State University Bravo. Brava
Language Modeling N-Grams: The Good, The Bad, and The Ugly Unigrams (SWB): • Most Common: “I”, “and”, “the”, “you”, “a” • Rank-100: “she”, “an”, “going” • Least Common: “Abraham”, “Alastair”, “Acura” Bigrams (SWB): • Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” • Rank-100: “do it”, “that we”, “don’t think” • Least Common: “raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): • Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” • Rank-100: “it was a”, “you know that” • Least Common: “you have parents”, “you seen Brooklyn” Mississippi State University Bravo. Brava
Language Modeling Integration of Natural Language • Natural language constraints can be easily incorporated. • Lack of punctuation and search space size pose problems. • Speech recognition typically produces a word-level time-aligned annotation. • Time alignments for other levels of information also available. Mississippi State University Bravo. Brava
Implementation Issues Search Is Resource Intensive • Typical LVCSR systems have about 10 M free parameters, which makes training a challenge. • Large speech databases are required (several hundred hours of speech). • Tying, smoothing, and interpolation are required. Mississippi State University Bravo. Brava
Implementation Issues Dynamic Programming-Based Search • Dynamic programming is used to find the most probable path through the network. • Beam search is used to control resources. • Search is time synchronous and left-to-right. • Arbitrary amounts of silence must be permitted between each word. • Words are hypothesized many times with different start/stop times, which significantly increases search complexity. Mississippi State University Bravo. Brava
Implementation Issues Cross-Word Decoding Is Expensive • Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries. • Cross-word decoding significantly increases memory requirements. Mississippi State University Bravo. Brava
Implementation Issues Decoding Example Mississippi State University Bravo. Brava
Implementation Issues Internet-Based Speech Recognition Mississippi State University Bravo. Brava
Technology Conversational Speech • Conversational speech collected over the telephone contains background noise, music, fluctuations in the speech rate, laughter, partial words, hesitations, mouth noises, etc. • WER has decreased from 100% to 30% in six years. • Laughter • Singing • Unintelligible • Spoonerism • Background Speech • No pauses • Restarts • Vocalized Noise • Coinage Mississippi State University Bravo. Brava
Technology Audio Indexing of Broadcast News Broadcast news offers some unique challenges: • Lexicon: important information in infrequently occurring words • Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”) • Language Model: must adapt (“ Bush, ” “Clinton, ” “Bush, ” “Mc. Cain, ” “? ? ? ”) • Language: multilingual systems? language-independent acoustic modeling? Mississippi State University Bravo. Brava
Technology Real-Time Translation • From President Clinton’s State of the Union address (January 27, 2000): “These kinds of innovations are also propelling our remarkable prosperity. . . Soon researchers will bring us devices that can translate foreign languages as fast as you can talk. . . molecular computers the size of a tear drop with the power of today’s fastest supercomputers. ” • Imagine a world where: • You book a travel reservation from your cellular phone while driving in your car without ever talking to a human (database query) • You converse with someone in a foreign country and neither speaks a common language (universal translator) • You place a call to your bank to inquire about your bank account and never have to remember a password (transparent telephony) • You can ask questions by voice and your Internet browser returns answers to your questions (intelligent query) • Human Language Engineering: a sophisticated integration of many speech and language related technologies. . . a science for the next millennium. Mississippi State University Bravo. Brava
Technology Future Directions Analog Filter Banks Dynamic Time-Warping 1960 Hidden Markov Models 1980 1970 2000 1990 What have we learned? What are the challenges? • supervised training is a good machine learning technique • • • large databases are essential for the development of robust statistics discrimination vs. representation generalization vs. memorization pronunciation modeling human-centered language modeling What are the algorithmic issues for the next decade: • Better features by extracting articulatory information? • Bayesian statistics? Bayesian networks? • Decision Trees? Information-theoretic measures? • Nonlinear dynamics? Chaos? Mississippi State University Bravo. Brava
To Probe Further References Journals and Conferences: [1] N. Deshmukh, et. al. , “Hierarchical Search for Large. Vocabulary Conversational Speech Recognition, ” IEEE Signal Processing Magazine, vol. 1, no. 5, pp. 84 - 107, September 1999. [2] N. Deshmukh, et. al. , “Benchmarking Human Performance for Continuous Speech Recognition, ” Proceedings of the Fourth International Conference on Spoken Language Processing, pp. Su. P 1 P 1. 10, Philadelphia, Pennsylvania, USA, October 1996. [3] R. Grishman, “Information Extraction and Speech Recognition, ” presented at the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, USA, February 1998. [9] P. Robinson, et. al. , “Overview: Information Extraction from Broadcast News, ” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999. [10] F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1998. URLs and Resources: [11] “Speech Corpora, ” The Linguistic Data Consortium, http: //www. ldc. upenn. edu. [12] “Technology Benchmarks, ” Spoken Natural Language Processing Group, The National Institute for Standards, http: //www. itl. nist. gov/iaui/894. 01/index. html. [13] “Signal Processing Resources, ” Institute for Signal and Information Technology, Mississippi State University, [4] R. P. Lippmann, “Speech Recognition By Machines and Humans, ” Speech Communication, vol. 22, pp. 1 -15, July 1997. [5] M. Maybury (editor), “News on Demand, ” Communications of the ACM, vol. 43, no. 2, February 2000. [14] “Internet- Accessible Speech Recognition Technology, ” D. Miller, et. al. , “Named Entity Extraction from Broadcast News, ” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999. [15] “A Public Domain Speech Recognition System, ” [6] [7] [8] D. Pallett, et. al. , “Broadcast News Benchmark Test Results, ” presented at the DARPA Broadcast News Workshop, Herndon, Virginia, USA, February 1999. J. Picone, “Signal Modeling Techniques in Speech Recognition, ” IEEE Proceedings, vol. 81, no. 9, pp. 1215 - 1247, September 1993. Mississippi State University http: //www. isip. msstate. edu/projects/speech/index. html. http: //www. isip. msstate. edu/projects/speech/software/index. html. [16] “Remote Job Submission, ” http: //www. isip. msstate. edu/projects/speech/experiments/index. html. [17] “The Switchboard Corpus, ” http: //www. isip. msstate. edu/projects/switchboard/index. html. Bravo. Brava
Conclusion and Future Directions Trends Speech as Access Speech as Source Information as Partner What are the words? What does it mean? Here’s what you need. We need new technology to help with information overload • Speech information sources are everywhere – Voice mail messages – Professional talk – Lectures, broadcasts • Speech sources of information will increase – As devices shrink – As mobility increases – New uses: annotation, documentation Mississippi State University Bravo. Brava
Conclusion and Future Directions Limitations on Applications • • Recognition performance, especially in error recovery Natural language understanding (speech differs from text) – – • • • Speech unfolds linearly in time Speech is more indeterminate than text Speech has different syntax and semantics Prosody differs from punctuation Cost to develop applications (too few experts) Cost to integrate/interoperate with other technologies New capabilities – – "When did he say Y and was he angry? ” Scanning, refocusing quickly (browsing) Proactive information: Match past pattern, find novel aspects Gist, summarize, translate for different purposes Mississippi State University Bravo. Brava
Conclusion and Future Directions Applications on the Horizon Beginnings of speech as source of information • ISLIP http: //www. mediasite. net/info/frames. htm • Virage http: //www. virage. com Why Speech technology in education and training • Cliff Stoll, High Tech Heretic –Good schools need no computers –Bad schools won’t be improved by them • Beulah Arnott: also true of indoor plumbing doesn’t belong in the classroom • Bravo. Brava: Co-evolving technology and people can – Dramatically reduce the cost of delivery of content – Increase its timeliness, quality and appropriateness – Target needs of individual and/or group – Reading Pal demo Mississippi State University Bravo. Brava
Reading Pal Child reads Errors in red Looks up word “massive” Clicks ‘Listen’ To play back from “massive” Clicks ‘You’ to play it back Mississippi State University Bravo. Brava
Summary Goal: Speech Better Than Text Healthy loop between research and applications • Research leads to applications, which lead to new research opportunities We need collaboration • Too much for one person, one site, one country Humans will probably continue to be better than machines at many things Can we learn to use technology and training to augment human-human and human-machine collaboration? • We need to “micronize” education and training It’s not a solved problem • Further technology development is needed to enable the vision Mississippi State University Bravo. Brava


