9723a42e7cffa3293ea0b0c18cb3ef7c.ppt
- Количество слайдов: 74
Introduction to Speech Technologies James A. Larson, VP, Larson Technical Services jim@larson-tech. com © 2013 by Larson Technical Services 1
Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services 2
Recognition Technology Source Target Typical Technique Touchtone Recognition (DTMF) Caller presses touch tone buttons Digits Tone recognition Automatic Speech Recognition (ASR) Spoken language Words Hidden Markov Model, Neural Net, Table Lookup Speaker Identification Spoken language Names of registered callers Table lookup Voice Activity Detection Caller speaks or does not speak “On” or “Off” Attention word Classification Spoken language Categories Statistical analysis Language Identification Spoken language National language names Table lookup © 2013 by Larson Technical Services 3
Touchtone Recognition • Caller responds to voice menus by pressing touchtone buttons on the telephone keypad • Advantages Uses – Highly accurate • Disadvantages • Enter digits • Privacy – Lost in space – Time-consuming menus where user must convert choice to a digit © 2013 by Larson Technical Services 4
Speech Recognition (ASR, SST) • Advantages – User does not convert choices to a digit • Disadvantages – Occasional failure to recognize what user said – Time-consuming dialogs • Users may interrupt prompts by “barge-in” © 2013 by Larson Technical Services 5
Words and Phrases Word Identification How Speech Recognition Works Phoneme Identification Feature Extraction Digital signal processing signal Audio Input © 2013 by Larson Technical Services 6
Words and Phrases Word Identification Phoneme Identification Transform features to phonemes How Speech Recognition Works Acoustic Model Feature Extraction Audio Input • Sounds in a language • Different for each language • May be speaker dependent (speaker must train model) • May be speaker independent (pretrained) • Usually supplied by ASR vendor © 2013 by Larson Technical Services 7
How Speech Recognition Works Words and Phrases Word Identification Transform phonemes to words Language Model Phoneme Identification Words in a language and their pronunciations Feature Extraction Audio Input © 2013 by Larson Technical Services 8
Speech Recognition • Grammar-based – Developer specifies words to be recognized • Statistical Language Models – Developer records and tags phrases © 2013 by Larson Technical Services 9
Grammar-based Speech Recognition Words and Phrases Word Identification Phoneme Identification Context-free Grammar (CFG) Grammar Compiler Grammar Language Model Lexicon Feature Extraction Audio Input © 2013 by Larson Technical Services 10
Where Are Grammars Used? • Interactive Response Systems (IVR) – Automated telephone agents • Each step may use a different grammar – Grammar defines only the words which the user may speak during a step – Application developers specify grammars for each step • The same grammar may be reused in multiple applications © 2013 by Larson Technical Services 11
type = "application/srgs+xml"
" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-14.jpg" alt="Grammar with 3 Rules
Grammar Exercise • Extend the grammar to include the combination of “color, ” “size, ” and “product” where product may be “T-shirt” or “vest” © 2013 by Larson Technical Services 15
XML and ABNF Grammar Formats • XML format • Verbose • Validated by XML tools • ABNF format • Terse • Familiar to compiler experts • Not validated by XML tools
Summary Grammar-based Speech Recognition • A large variety of application use speech recognition technologies. • Speech grammars constrain the words that a user may speak during a single step of an automated conversation. • Trained application developers create a grammar for each step of an automated conversation. © 2013 by Larson Technical Services 17
Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services 19
Statistical Language Model-based Recognition Technologies • • • Call Routing Speaker Identification Dictation Speaker emotion Also used for Voice pitch • Optical Character Recognition (OCR) Age • Machine vision Gender • Big data analysis Intoxication Stress Medical conditions (e. g. , sleep apnea) © 2013 by Larson Technical Services 20
Example Verbal Phrases with Annotations • • • “I have a problem with my bill” “Where is my order? ” “My gadget arrived broken” “I need to return my gadget” “My statement is wrong” “I want a refund” accounting shipping customer service shipping accounting Annotate thousands of verbal phrases © 2013 by Larson Technical Services 21
Statistical Language Model-based Speech Recognition Category Classifier Phoneme Identification Feature Extraction Does not use grammars Language Model Statistical Language Model (SLM) Statistical Routines Verbal Phrases Annotated with categories Audio Input © 2013 by Larson Technical Services 22
Grammars vs. Statistical Language Models (SLMs) Context-Free Grammars (CFGs) • Data-driven • Hand-crafted rules • High-accuracy • Very high-accuracy • Complex to assemble • Easy to assemble • Natural language • Finite phrases • Used for dictation • Used for • • Interactive Voice Response (IVR) Command control
Call Routing How may I help you? Accounting Sales Where is my order? Classifier © 2013 by Larson Technical Services … Customer Support 24
Speaker Identification Technologies • General techniques for identifying people – Something you know – Something you have – Something about you Your speech features • Three basic functions for speaker identification – Speaker registration – Speaker authentication – Speaker identification © 2013 by Larson Technical Services 25
Speaker Registration Speech Profiles Wanda’s Speech Features Good Morning Fred’s Speech Features Good Morning Joe’s Speech Features Good Morning © 2013 by Larson Technical Services 26
Speaker Authentication Speech Profiles Compare Good morning Wanda’s speech features Used to supplement or replace passwords © 2013 by Larson Technical Services 27
Speaker Identification Speech Profiles Good morning Wanda’s speech features Good morning Fred’s speech features Good morning Joe’s speech features Select Wanda’s speech features Good morning © 2013 by Larson Technical Services 28
Speaker Identification Technologies • Advantages – – Are unobtrusive Are location independent Require no special equipment Replace passwords • Disadvantages – Sometimes fail • • • Siblings with similar voice profiles Teenage male voice “break” Colds, sore throats, sore lips, etc Variety of microphones Tape recordings © 2013 by Larson Technical Services 29
Statistical Language Model-based Recognition Technologies • • • Call routing Speaker authentication Widely available Dictation Speaker emotion Actively being researched Voice pitch Age Gender Intoxication Stress Medical conditions (e. g. , sleep apnea) © 2013 by Larson Technical Services 30
Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services 31
Speech Synthesis (Text-To-Speech, TTS) Text Structure Analysis Structure Rules Text Normalization Abbreviation and Acronym Database Text-to-phoneme Conversion Pronunciation Lexicon Prosody Analysis Prosody Rules Waveform Production Phoneme-to-sound Database © 2013 by Larson Technical Services 32
Concatenated vs. Parameterbased Speech Synthesis “The dog barked” Isolate Phonemes Voice Parameters dh eh d ao g b ah er k eh d “red car” Concatenate Phonemes “red car” Generate Speech er ed d k ah er © 2013 by Larson Technical Services 33
Speech Synthesis ML Structure Analysis Markup support: paragraph, sentence Non-markup behavior: infer structure by automated text analysis Text Normalization Text-to. Phoneme Conversion Prosody Analysis Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs © 2013 by Larson Technical Services Waveform Production Markup support: voice, audio* *audio icons, branding, advertising Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax 34
Pronunciation Specification • Directly within the text replace “creek” by “krik” • With the phoneme commands
Prerecorded Messages vs. Speech Synthesis Prerecorded messages Speech Synthesis (TTS) • Natural sounding • Artificial sounding • Easy to understand • May be difficult to understand • Static data • Computer-generated data • Tedious to record and tag • Easy to specify
When to Use Speech Synthesis? • During application creation – Debug the dialog – Replace with prerecorded messages before deployment • Rendering information from dynamic database or news feed © 2013 by Larson Technical Services 37
Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services 38
Dialog Management • Controlling the interchange of information between users and application • Three dialog styles 1. Application-directed conversational dialogs • Application asks questions to solicit answers and instructions from a user. 2. Human-directed conversational dialogs • User asks a question or speaks a command the computer responds. 3. Mixed-initiative dialogs • User and application take turns driving conversations. © 2013 by Larson Technical Services 39
Three Dialog Styles Application-directed Application: What month? Caller: February Application: What day of the month? Caller: Twelve Application: What year? Caller: Nineteen ninety-seven Human-directed Caller: Set month to February Application: Month is February Caller: Set day to month? Application: Day is twelve Caller: Set year to nineteen ninety-seven Application: Year is nineteen ninety-seven Mixed-initiative Application: What month? Caller: February twelve nineteen ninety-seven © 2013 by Larson Technical Services 40
Voice. XML 2. 1 • XML format for specifying interactive voice dialogues between a human and a computer – DTMF input and prerecorded voice as output – Speech recognition and speech synthesis – Video output to user (non-standard) • Designed for Interactive Voice Response (IVR) applications using telephone • Currently does not support external events, except
Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-42.jpg" alt="Example of Voice. XML 2. 1 Fragment xml version="1. 0"? > Dialog Language" />
Example of Voice. XML 2. 1 Fragment xml version="1. 0"? > Dialog Language (Voice. XML 2. 1)
Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-43.jpg" alt="Example of Voice. XML 2. 1 Fragment xml version="1. 0"? > Dialog Language" />
Example of Voice. XML 2. 1 Fragment xml version="1. 0"? > Dialog Language (Voice. XML 2. 1)
Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-44.jpg" alt="Example of Voice. XML 2. 1 Fragment xml version="1. 0"? > Dialog Language" />
Example of Voice. XML 2. 1 Fragment xml version="1. 0"? > Dialog Language (Voice. XML 2. 1)
Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) xml version="1. 0"? > Speech Synthesis Markup Language (SSML)
Voice. XML 2. 1 Features • Menus, forms, subdialogs –




























