Скачать презентацию Introduction to Speech Technologies James A Larson VP Скачать презентацию Introduction to Speech Technologies James A Larson VP

9723a42e7cffa3293ea0b0c18cb3ef7c.ppt

  • Количество слайдов: 74

Introduction to Speech Technologies James A. Larson, VP, Larson Technical Services jim@larson-tech. com © Introduction to Speech Technologies James A. Larson, VP, Larson Technical Services jim@larson-tech. com © 2013 by Larson Technical Services 1

Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services 2

Recognition Technology Source Target Typical Technique Touchtone Recognition (DTMF) Caller presses touch tone buttons Recognition Technology Source Target Typical Technique Touchtone Recognition (DTMF) Caller presses touch tone buttons Digits Tone recognition Automatic Speech Recognition (ASR) Spoken language Words Hidden Markov Model, Neural Net, Table Lookup Speaker Identification Spoken language Names of registered callers Table lookup Voice Activity Detection Caller speaks or does not speak “On” or “Off” Attention word Classification Spoken language Categories Statistical analysis Language Identification Spoken language National language names Table lookup © 2013 by Larson Technical Services 3

Touchtone Recognition • Caller responds to voice menus by pressing touchtone buttons on the Touchtone Recognition • Caller responds to voice menus by pressing touchtone buttons on the telephone keypad • Advantages Uses – Highly accurate • Disadvantages • Enter digits • Privacy – Lost in space – Time-consuming menus where user must convert choice to a digit © 2013 by Larson Technical Services 4

Speech Recognition (ASR, SST) • Advantages – User does not convert choices to a Speech Recognition (ASR, SST) • Advantages – User does not convert choices to a digit • Disadvantages – Occasional failure to recognize what user said – Time-consuming dialogs • Users may interrupt prompts by “barge-in” © 2013 by Larson Technical Services 5

Words and Phrases Word Identification How Speech Recognition Works Phoneme Identification Feature Extraction Digital Words and Phrases Word Identification How Speech Recognition Works Phoneme Identification Feature Extraction Digital signal processing signal Audio Input © 2013 by Larson Technical Services 6

Words and Phrases Word Identification Phoneme Identification Transform features to phonemes How Speech Recognition Words and Phrases Word Identification Phoneme Identification Transform features to phonemes How Speech Recognition Works Acoustic Model Feature Extraction Audio Input • Sounds in a language • Different for each language • May be speaker dependent (speaker must train model) • May be speaker independent (pretrained) • Usually supplied by ASR vendor © 2013 by Larson Technical Services 7

How Speech Recognition Works Words and Phrases Word Identification Transform phonemes to words Language How Speech Recognition Works Words and Phrases Word Identification Transform phonemes to words Language Model Phoneme Identification Words in a language and their pronunciations Feature Extraction Audio Input © 2013 by Larson Technical Services 8

Speech Recognition • Grammar-based – Developer specifies words to be recognized • Statistical Language Speech Recognition • Grammar-based – Developer specifies words to be recognized • Statistical Language Models – Developer records and tags phrases © 2013 by Larson Technical Services 9

Grammar-based Speech Recognition Words and Phrases Word Identification Phoneme Identification Context-free Grammar (CFG) Grammar Grammar-based Speech Recognition Words and Phrases Word Identification Phoneme Identification Context-free Grammar (CFG) Grammar Compiler Grammar Language Model Lexicon Feature Extraction Audio Input © 2013 by Larson Technical Services 10

Where Are Grammars Used? • Interactive Response Systems (IVR) – Automated telephone agents • Where Are Grammars Used? • Interactive Response Systems (IVR) – Automated telephone agents • Each step may use a different grammar – Grammar defines only the words which the user may speak during a step – Application developers specify grammars for each step • The same grammar may be reused in multiple applications © 2013 by Larson Technical Services 11

type = "application/srgs+xml" root = "single_digit"" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-12.jpg" alt="Example Grammar type = "application/srgs+xml" root = "single_digit"" /> Example Grammar type = "application/srgs+xml" root = "single_digit" one mode = "voice"> two three four five six seven eight nine © 2013 by Larson Technical Services 12

Example Grammar one two three four five twenty six seven twenty eight nine © 2013 by Larson Technical Services 13

" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-14.jpg" alt="Grammar with 3 Rules " /> Grammar with 3 Rules small medium large red green blue © 2013 by Larson Technical Services 14

Grammar Exercise • Extend the grammar to include the combination of “color, ” “size, Grammar Exercise • Extend the grammar to include the combination of “color, ” “size, ” and “product” where product may be “T-shirt” or “vest” © 2013 by Larson Technical Services 15

XML and ABNF Grammar Formats • XML format • Verbose • Validated by XML XML and ABNF Grammar Formats • XML format • Verbose • Validated by XML tools • ABNF format • Terse • Familiar to compiler experts • Not validated by XML tools one two three four five six seven eight nine $single_digit = one | two | three | four| five | six | seven | eight | nine © 2013 by Larson Technical Services 16

Summary Grammar-based Speech Recognition • A large variety of application use speech recognition technologies. Summary Grammar-based Speech Recognition • A large variety of application use speech recognition technologies. • Speech grammars constrain the words that a user may speak during a single step of an automated conversation. • Trained application developers create a grammar for each step of an automated conversation. © 2013 by Larson Technical Services 17

Answer: Grammar Exercise small medium large red T-shirt green vest blue © 2013 by Larson Technical Services 18

Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services 19

Statistical Language Model-based Recognition Technologies • • • Call Routing Speaker Identification Dictation Speaker Statistical Language Model-based Recognition Technologies • • • Call Routing Speaker Identification Dictation Speaker emotion Also used for Voice pitch • Optical Character Recognition (OCR) Age • Machine vision Gender • Big data analysis Intoxication Stress Medical conditions (e. g. , sleep apnea) © 2013 by Larson Technical Services 20

Example Verbal Phrases with Annotations • • • “I have a problem with my Example Verbal Phrases with Annotations • • • “I have a problem with my bill” “Where is my order? ” “My gadget arrived broken” “I need to return my gadget” “My statement is wrong” “I want a refund” accounting shipping customer service shipping accounting Annotate thousands of verbal phrases © 2013 by Larson Technical Services 21

Statistical Language Model-based Speech Recognition Category Classifier Phoneme Identification Feature Extraction Does not use Statistical Language Model-based Speech Recognition Category Classifier Phoneme Identification Feature Extraction Does not use grammars Language Model Statistical Language Model (SLM) Statistical Routines Verbal Phrases Annotated with categories Audio Input © 2013 by Larson Technical Services 22

Grammars vs. Statistical Language Models (SLMs) Context-Free Grammars (CFGs) • Data-driven • Hand-crafted rules Grammars vs. Statistical Language Models (SLMs) Context-Free Grammars (CFGs) • Data-driven • Hand-crafted rules • High-accuracy • Very high-accuracy • Complex to assemble • Easy to assemble • Natural language • Finite phrases • Used for dictation • Used for • • Interactive Voice Response (IVR) Command control

Call Routing How may I help you? Accounting Sales Where is my order? Classifier Call Routing How may I help you? Accounting Sales Where is my order? Classifier © 2013 by Larson Technical Services … Customer Support 24

Speaker Identification Technologies • General techniques for identifying people – Something you know – Speaker Identification Technologies • General techniques for identifying people – Something you know – Something you have – Something about you Your speech features • Three basic functions for speaker identification – Speaker registration – Speaker authentication – Speaker identification © 2013 by Larson Technical Services 25

Speaker Registration Speech Profiles Wanda’s Speech Features Good Morning Fred’s Speech Features Good Morning Speaker Registration Speech Profiles Wanda’s Speech Features Good Morning Fred’s Speech Features Good Morning Joe’s Speech Features Good Morning © 2013 by Larson Technical Services 26

Speaker Authentication Speech Profiles Compare Good morning Wanda’s speech features Used to supplement or Speaker Authentication Speech Profiles Compare Good morning Wanda’s speech features Used to supplement or replace passwords © 2013 by Larson Technical Services 27

Speaker Identification Speech Profiles Good morning Wanda’s speech features Good morning Fred’s speech features Speaker Identification Speech Profiles Good morning Wanda’s speech features Good morning Fred’s speech features Good morning Joe’s speech features Select Wanda’s speech features Good morning © 2013 by Larson Technical Services 28

Speaker Identification Technologies • Advantages – – Are unobtrusive Are location independent Require no Speaker Identification Technologies • Advantages – – Are unobtrusive Are location independent Require no special equipment Replace passwords • Disadvantages – Sometimes fail • • • Siblings with similar voice profiles Teenage male voice “break” Colds, sore throats, sore lips, etc Variety of microphones Tape recordings © 2013 by Larson Technical Services 29

Statistical Language Model-based Recognition Technologies • • • Call routing Speaker authentication Widely available Statistical Language Model-based Recognition Technologies • • • Call routing Speaker authentication Widely available Dictation Speaker emotion Actively being researched Voice pitch Age Gender Intoxication Stress Medical conditions (e. g. , sleep apnea) © 2013 by Larson Technical Services 30

Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services 31

Speech Synthesis (Text-To-Speech, TTS) Text Structure Analysis Structure Rules Text Normalization Abbreviation and Acronym Speech Synthesis (Text-To-Speech, TTS) Text Structure Analysis Structure Rules Text Normalization Abbreviation and Acronym Database Text-to-phoneme Conversion Pronunciation Lexicon Prosody Analysis Prosody Rules Waveform Production Phoneme-to-sound Database © 2013 by Larson Technical Services 32

Concatenated vs. Parameterbased Speech Synthesis “The dog barked” Isolate Phonemes Voice Parameters dh eh Concatenated vs. Parameterbased Speech Synthesis “The dog barked” Isolate Phonemes Voice Parameters dh eh d ao g b ah er k eh d “red car” Concatenate Phonemes “red car” Generate Speech er ed d k ah er © 2013 by Larson Technical Services 33

Speech Synthesis ML Structure Analysis Markup support: paragraph, sentence Non-markup behavior: infer structure by Speech Synthesis ML Structure Analysis Markup support: paragraph, sentence Non-markup behavior: infer structure by automated text analysis Text Normalization Text-to. Phoneme Conversion Prosody Analysis Markup support: phoneme, say-as Non-markup behavior: look up in pronunciation dictionary Markup support: say-as for dates, times, etc. sub for aliasing Non-markup behavior: automatically identify and convert constructs © 2013 by Larson Technical Services Waveform Production Markup support: voice, audio* *audio icons, branding, advertising Markup support: emphasis, break, prosody Non-markup behavior: automatically generate prosody through analysis of document structure and sentence syntax 34

Pronunciation Specification • Directly within the text replace “creek” by “krik” • With the Pronunciation Specification • Directly within the text replace “creek” by “krik” • With the phoneme commands creek • Avoid so text can also be used for other purposes: • Display on a screen • Print • Copy and paste to another document In the pronunciation lexicon creek "krik" © 2013 by Larson Technical Services 35

Prerecorded Messages vs. Speech Synthesis Prerecorded messages Speech Synthesis (TTS) • Natural sounding • Prerecorded Messages vs. Speech Synthesis Prerecorded messages Speech Synthesis (TTS) • Natural sounding • Artificial sounding • Easy to understand • May be difficult to understand • Static data • Computer-generated data • Tedious to record and tag • Easy to specify

When to Use Speech Synthesis? • During application creation – Debug the dialog – When to Use Speech Synthesis? • During application creation – Debug the dialog – Replace with prerecorded messages before deployment • Rendering information from dynamic database or news feed © 2013 by Larson Technical Services 37

Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Outline • • • Grammar-based Speech Recognition Statistical Language Model-based Recognition Speech Synthesis Dialog Management Natural Language Processing © 2013 by Larson Technical Services 38

Dialog Management • Controlling the interchange of information between users and application • Three Dialog Management • Controlling the interchange of information between users and application • Three dialog styles 1. Application-directed conversational dialogs • Application asks questions to solicit answers and instructions from a user. 2. Human-directed conversational dialogs • User asks a question or speaks a command the computer responds. 3. Mixed-initiative dialogs • User and application take turns driving conversations. © 2013 by Larson Technical Services 39

Three Dialog Styles Application-directed Application: What month? Caller: February Application: What day of the Three Dialog Styles Application-directed Application: What month? Caller: February Application: What day of the month? Caller: Twelve Application: What year? Caller: Nineteen ninety-seven Human-directed Caller: Set month to February Application: Month is February Caller: Set day to month? Application: Day is twelve Caller: Set year to nineteen ninety-seven Application: Year is nineteen ninety-seven Mixed-initiative Application: What month? Caller: February twelve nineteen ninety-seven © 2013 by Larson Technical Services 40

Voice. XML 2. 1 • XML format for specifying interactive voice dialogues between a Voice. XML 2. 1 • XML format for specifying interactive voice dialogues between a human and a computer – DTMF input and prerecorded voice as output – Speech recognition and speech synthesis – Video output to user (non-standard) • Designed for Interactive Voice Response (IVR) applications using telephone • Currently does not support external events, except and • Requires a Voice. XML interpreter © 2013 by Larson Technical Services 41

Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-42.jpg" alt="Example of Voice. XML 2. 1 Fragment Dialog Language" /> Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) Speech Synthesis Markup Language (SSML)

Speech Recognition Grammar Specification (SRGS) … Semantic Interpretation (SI) Which account savings or checking savings checking CD certificate of deposit $ = “CD” ….
… © 2013 by Larson Technical Services 42

Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-43.jpg" alt="Example of Voice. XML 2. 1 Fragment Dialog Language" /> Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) Speech Synthesis Markup Language (SSML)

Speech Recognition Grammar Specification (SRGS) … Semantic Interpretation (SI) Which account savings or checking savings checking CD certificate of deposit $ = “CD” ….
… © 2013 by Larson Technical Services 43

Dialog Language" src="https://present5.com/presentation/9723a42e7cffa3293ea0b0c18cb3ef7c/image-44.jpg" alt="Example of Voice. XML 2. 1 Fragment Dialog Language" /> Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) Speech Synthesis Markup Language (SSML)

Speech Recognition Grammar Specification (SRGS) … Semantic Interpretation (SI) Which account savings or checking savings checking CD certificate of deposit $ = “CD” ….
… © 2013 by Larson Technical Services 44

Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) <? Example of Voice. XML 2. 1 Fragment Dialog Language (Voice. XML 2. 1) Speech Synthesis Markup Language (SSML)

Speech Recognition Grammar Specification (SRGS) … Semantic Interpretation (SI) Which account savings or checking savings checking CD certificate of deposit $ = "CD" ….
Voice. XML places text recognized by the … speech recognizer into the variable "account" © 2013 by Larson Technical Services 45

Voice. XML 2. 1 Features • Menus, forms, subdialogs – <menu>, <form>, <subdialog> • Voice. XML 2. 1 Features • Menus, forms, subdialogs –

,
, • Inputs – Speech recognition – Recording – Keypad • Output – Audio files