Introduction to Speech Technologies James A Larson VP

Recognition Technology Source Target Typical Technique Touchtone Recognition (DTMF) Caller presses touch tone buttons Digits Tone recognition Automatic Speech Recognition (ASR) Spoken language Words Hidden Markov Model, Neural Net, Table Lookup Speaker Identification Spoken language Names of registered callers Table lookup Voice Activity Detection Caller speaks or does not speak “On” or “Off” Attention word Classification Spoken language Categories Statistical analysis Language Identification Spoken language National language names Table lookup © 2013 by Larson Technical Services 3

Touchtone Recognition • Caller responds to voice menus by pressing touchtone buttons on the telephone keypad • Advantages Uses – Highly accurate • Disadvantages • Enter digits • Privacy – Lost in space – Time-consuming menus where user must convert choice to a digit © 2013 by Larson Technical Services 4

Speech Recognition (ASR, SST) • Advantages – User does not convert choices to a digit • Disadvantages – Occasional failure to recognize what user said – Time-consuming dialogs • Users may interrupt prompts by “barge-in” © 2013 by Larson Technical Services 5

Words and Phrases Word Identification Phoneme Identification Transform features to phonemes How Speech Recognition Works Acoustic Model Feature Extraction Audio Input • Sounds in a language • Different for each language • May be speaker dependent (speaker must train model) • May be speaker independent (pretrained) • Usually supplied by ASR vendor © 2013 by Larson Technical Services 7

How Speech Recognition Works Words and Phrases Word Identification Transform phonemes to words Language Model Phoneme Identification Words in a language and their pronunciations Feature Extraction Audio Input © 2013 by Larson Technical Services 8

Grammar-based Speech Recognition Words and Phrases Word Identification Phoneme Identification Context-free Grammar (CFG) Grammar Compiler Grammar Language Model Lexicon Feature Extraction Audio Input © 2013 by Larson Technical Services 10

Where Are Grammars Used? • Interactive Response Systems (IVR) – Automated telephone agents • Each step may use a different grammar – Grammar defines only the words which the user may speak during a step – Application developers specify grammars for each step • The same grammar may be reused in multiple applications © 2013 by Larson Technical Services 11

Example Grammar <grammar <rule id = type = "application/srgs+xml" root = "single_digit"" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-12.jpg" alt="Example Grammar type = "application/srgs+xml" root = "single_digit"" /> Example Grammar type = "application/srgs+xml" root = "single_digit" one mode = "voice"> two three four five six seven eight nine © 2013 by Larson Technical Services 12

Grammar with 3 Rules <grammar type = " src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-14.jpg" alt="Grammar with 3 Rules " /> Grammar with 3 Rules small medium large red green blue © 2013 by Larson Technical Services 14

Summary Grammar-based Speech Recognition • A large variety of application use speech recognition technologies. • Speech grammars constrain the words that a user may speak during a single step of an automated conversation. • Trained application developers create a grammar for each step of an automated conversation. © 2013 by Larson Technical Services 17

Statistical Language Model-based Recognition Technologies • • • Call Routing Speaker Identification Dictation Speaker emotion Also used for Voice pitch • Optical Character Recognition (OCR) Age • Machine vision Gender • Big data analysis Intoxication Stress Medical conditions (e. g. , sleep apnea) © 2013 by Larson Technical Services 20

Example Verbal Phrases with Annotations • • • “I have a problem with my bill” “Where is my order? ” “My gadget arrived broken” “I need to return my gadget” “My statement is wrong” “I want a refund” accounting shipping customer service shipping accounting Annotate thousands of verbal phrases © 2013 by Larson Technical Services 21

Statistical Language Model-based Speech Recognition Category Classifier Phoneme Identification Feature Extraction Does not use grammars Language Model Statistical Language Model (SLM) Statistical Routines Verbal Phrases Annotated with categories Audio Input © 2013 by Larson Technical Services 22

Grammars vs. Statistical Language Models (SLMs) Context-Free Grammars (CFGs) • Data-driven • Hand-crafted rules • High-accuracy • Very high-accuracy • Complex to assemble • Easy to assemble • Natural language • Finite phrases • Used for dictation • Used for • • Interactive Voice Response (IVR) Command control

Speaker Identification Technologies • General techniques for identifying people – Something you know – Something you have – Something about you Your speech features • Three basic functions for speaker identification – Speaker registration – Speaker authentication – Speaker identification © 2013 by Larson Technical Services 25

Speaker Identification Speech Profiles Good morning Wanda’s speech features Good morning Fred’s speech features Good morning Joe’s speech features Select Wanda’s speech features Good morning © 2013 by Larson Technical Services 28

Speaker Identification Technologies • Advantages – – Are unobtrusive Are location independent Require no special equipment Replace passwords • Disadvantages – Sometimes fail • • • Siblings with similar voice profiles Teenage male voice “break” Colds, sore throats, sore lips, etc Variety of microphones Tape recordings © 2013 by Larson Technical Services 29

Statistical Language Model-based Recognition Technologies • • • Call routing Speaker authentication Widely available Dictation Speaker emotion Actively being researched Voice pitch Age Gender Intoxication Stress Medical conditions (e. g. , sleep apnea) © 2013 by Larson Technical Services 30

Speech Synthesis (Text-To-Speech, TTS) Text Structure Analysis Structure Rules Text Normalization Abbreviation and Acronym Database Text-to-phoneme Conversion Pronunciation Lexicon Prosody Analysis Prosody Rules Waveform Production Phoneme-to-sound Database © 2013 by Larson Technical Services 32

Concatenated vs. Parameterbased Speech Synthesis “The dog barked” Isolate Phonemes Voice Parameters dh eh d ao g b ah er k eh d “red car” Concatenate Phonemes “red car” Generate Speech er ed d k ah er © 2013 by Larson Technical Services 33

Speech Synthesis ML Structure Analysis Markup support: paragraph, sentence Non-markup behavior: infer structure by automated text analysis Text Normalization Text-to. Phoneme Conversion Prosody Analysis Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs © 2013 by Larson Technical Services Waveform Production Markup support: voice, audio* *audio icons, branding, advertising Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax 34

Pronunciation Specification • Directly within the text replace “creek” by “krik” • With the phoneme commands creek • Avoid so text can also be used for other purposes: • Display on a screen • Print • Copy and paste to another document In the pronunciation lexicon creek "krik" © 2013 by Larson Technical Services 35

Prerecorded Messages vs. Speech Synthesis Prerecorded messages Speech Synthesis (TTS) • Natural sounding • Artificial sounding • Easy to understand • May be difficult to understand • Static data • Computer-generated data • Tedious to record and tag • Easy to specify

When to Use Speech Synthesis? • During application creation – Debug the dialog – Replace with prerecorded messages before deployment • Rendering information from dynamic database or news feed © 2013 by Larson Technical Services 37

Dialog Management • Controlling the interchange of information between users and application • Three dialog styles 1. Application-directed conversational dialogs • Application asks questions to solicit answers and instructions from a user. 2. Human-directed conversational dialogs • User asks a question or speaks a command the computer responds. 3. Mixed-initiative dialogs • User and application take turns driving conversations. © 2013 by Larson Technical Services 39

Three Dialog Styles Application-directed Application: What month? Caller: February Application: What day of the month? Caller: Twelve Application: What year? Caller: Nineteen ninety-seven Human-directed Caller: Set month to February Application: Month is February Caller: Set day to month? Application: Day is twelve Caller: Set year to nineteen ninety-seven Application: Year is nineteen ninety-seven Mixed-initiative Application: What month? Caller: February twelve nineteen ninety-seven © 2013 by Larson Technical Services 40

Voice. XML 2. 1 • XML format for specifying interactive voice dialogues between a human and a computer – DTMF input and prerecorded voice as output – Speech recognition and speech synthesis – Video output to user (non-standard) • Designed for Interactive Voice Response (IVR) applications using telephone • Currently does not support external events, except and • Requires a Voice. XML interpreter © 2013 by Larson Technical Services 41

Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-42.jpg" alt="Example of Voice. XML 2. 1 Fragment Dialog Language" /> Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) Speech Synthesis Markup Language (SSML)

Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-43.jpg" alt="Example of Voice. XML 2. 1 Fragment Dialog Language" /> Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) Speech Synthesis Markup Language (SSML)

Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-44.jpg" alt="Example of Voice. XML 2. 1 Fragment Dialog Language" /> Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) Speech Synthesis Markup Language (SSML)

Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) Speech Synthesis Markup Language (SSML)

Voice. XML 2. 1 Features • Menus, forms, subdialogs –

,