ASR Workshop Paris France 18 -20 Sept 2000

Скачать презентацию ASR Workshop Paris France 18 -20 Sept 2000

a69cfed7e86cb384bf037ab0a6224dd9.ppt

Количество слайдов: 23

ASR Workshop Paris, France 18 -20 Sept 2000 Towards Superhuman Speech Recognition Mukund Padmanabhan and Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms Group IBM

ASR Workshop Paris, France 18 -20 Sept 2000 Common UI Folklore “Except when interacting with video games, a user does not take very well to surprises” Human-Computer Interaction Dix, Finley, Aboud and Beale “Golden Rule #3: Make the interface consistent” Elements of user interface design Mandel “Computer users usually seek predictable responses and are discouraged if they must engage in clarification dialogs frequently” Designing the User Interface Shneiderman IBM

ASR Workshop Paris, France 18 -20 Sept 2000 Speech Recognition Progress IBM

ASR Workshop Paris, France 18 -20 Sept 2000 Human Performance (Lippmann, 1997) IBM

ASR Workshop Paris, France 18 -20 Sept 2000 Problem Categorization Dictation (WSJ) Broadcast News DARPA SWB Communicato r Voicemail Meetings Well Formed Varied, primarily Well formed Spontaneous Computer Audience Computer Person People Full BW Mixed, primarily full BW Telephone BW Far-field 7% 12% 16% 20 -30% 55% IBM 30%

ASR Workshop Paris, France 18 -20 Sept 2000 Domain Dependence Training Data Transactio Switchboard Voicemail n YP 4. 39 6. 44 8. 55 Digits 1. 34 1. 86 2. 36 Switchboar -39 57 d Voicemail -47 36 IBM

ASR Workshop Paris, France 18 -20 Sept 2000 Observations - 1. spontaneous speech: largest effect on WER (Switchboard, Voicemail, Meetings, real-world speech) - 2. multi-environment speech sources (16 K, 8 K, far-field microphone, noisy. . . ) - 3. multi-domain speech sources (dictation, travel, call center, small vocab, broadcast news) - 4. domain-dependence of performance Objective: Develop speech recognition system that mimics human performance (independent of environment, domain, works as well for spontaneous as for carefully enunciated speech) Focus areas Improve spontaneous speech models 1. Articulatory modeling 2. Prosodic features 3. Segmental graphical models 4. Joint parameter estimation 5. Speaker separation for multi-speaker speech 6. Data collection for "meeting speech" IBM Multi-environment Multi-domain 1. non-linear feature space transformation 2. Hidden observations 1. Multistyle training 2. Domain independent LM

ASR Workshop IBM Paris, France 18 -20 Sept 2000

ASR Workshop Paris, France 18 -20 Sept 2000 • 30% Improvement • No initial decoding IBM

ASR Workshop IBM Paris, France 18 -20 Sept 2000

ASR Workshop Paris, France 18 -20 Sept 2000 A Language Model that Works Well on Many Domains • Different (static) language models work best on different domains • Use dynamic adaptation to make a generic LM act like a domain -specific LM – Generic LM – linear interpolation of collection of domain-specific LMs (SWB, BN, digit/date grammar, etc. ) – Adapt by dynamically adjusting interpolation weights • Want to be able to adapt quickly – At the word/sentence level, not at the document level Um, yeah. Well, anyway, I’ll be arriving at four twenty two p. m. on flight fifty six. Say hi to mom. Oh, and don’t forget to buy IBM at one forty-four. IBM

ASR Workshop Paris, France 18 -20 Sept 2000 Adapting Language Model Interpolation Weights • Simply re-estimate weights to maximize likelihood of adaptation data (like dynamic deleted interpolation) – Can be quite slow because have to accumulate a lot of evidence • Add hidden variable to model that tracks which domain LM is currently being used (Bayesian adaptation) – Rate of adaptation can be fast, depend on context, and can be trained on domain labelled data. IBM

ASR Workshop IBM Paris, France 18 -20 Sept 2000

ASR Workshop Paris, France 18 -20 Sept 2000 Other Factors Driving Progress IBM

ASR Workshop Paris, France 18 -20 Sept 2000 What Types of Data Do We Need? Condition Targets Currently Available (U. S) Total Amount • 5000 hours speech • 10 GB LM data • 1000 hours speech • 1 GB LM data Styles • Imperatives • Queries • Fluent conversation • Declamatories • C&C tasks • ATIS/DC • SWB/BN/Meetings • WSJ/Voicemail/BN Environment s • High bandwidth/High SNR • Low SNR • WSJ/BN • SWB/Voicemail • Meetings Domains • Low perplexity • Medium perplexity • High perplexity • Digits, spelling • DC/ATIS • SWB/VM/WSJ/BN IBM

ASR Workshop Paris, France 18 -20 Sept 2000 Some Concrete Suggestions Target: 5000 Hours of transcribed spontaneous speech 2000 Hours/year 50000 hours/year (25) Sources of new data: Supergirl By David Odell Document Superman: The Motion Picture By Mario Puzo Superman II Directed By Richard Donner Superman II Directed By Richard Lester Superman II Superman IV: The Quest for Peace By Christopher Reeve, Superman: The Man Of Steel By Alex Ford & J Ellison Superman Lives By Kevin Smith Superman Lives By Dan Gilroy Test data: Mixture of current and new sources • Switchboard, Voicemail, BN, DC, OGI • SPEECON, Meetings IBM 5000 hours of speech Cost ~ $1 M Script - Revised Screenplay Word Early Draft Script Shooting Script - Early Version Script Later Version Shooting Script - Unproduced Script synopsis Unproduced

ASR Workshop Paris, France 18 -20 Sept 2000 Conclusions • Speech recognition performance not adequate • Human performance figures suggests that we still have enormous room for improvement • Presented several new algorithms to attack problem aggressively • Suggested training and test methodology to drive research • Communal participation critical to push ahead IBM