Скачать презентацию LINGUIST 180 Introduction to Computational Linguistics Dan Jurafsky Скачать презентацию LINGUIST 180 Introduction to Computational Linguistics Dan Jurafsky

bd96714a1dd5dd21e964993843f4d349.ppt

  • Количество слайдов: 108

LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech LING 180 Autumn LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech LING 180 Autumn 2007 1

Outline 1) Arpabet 2) TTS Architectures 3) TTS Components • Text Analysis • • Outline 1) Arpabet 2) TTS Architectures 3) TTS Components • Text Analysis • • • Text Normalization Homonym Disambiguation Grapheme-to-Phoneme (Letter-to-Sound) Intonation Waveform Generation • • Unit Selection Diphones LING 180 Autumn 2007 2

Dave Barry on TTS “And computers are getting smarter all the time; scientists tell Dave Barry on TTS “And computers are getting smarter all the time; scientists tell us that soon they will be able to talk with us. (By "they", I mean computers; I doubt scientists will ever be able to talk to us. ) LING 180 Autumn 2007 3

ARPAbet http: //www. stanford. edu/class/linguist 180/arpabet. ht ml LING 180 Autumn 2007 4 ARPAbet http: //www. stanford. edu/class/linguist 180/arpabet. ht ml LING 180 Autumn 2007 4

ARPAbet Vowels 1 2 3 b_d bead bid bayed ARPA iy ih ey 4 ARPAbet Vowels 1 2 3 b_d bead bid bayed ARPA iy ih ey 4 5 6 7 8 bed bad bod(y) bawd Budd(hist) eh ae aa ao uh 9 10 11 b_d bode booed bud ARPA ow uw ah 12 13 14 15 bird bide bowed Boyd er ay aw oy Sounds from Ladefoged LING 180 Autumn 2007 5

Brief Historical Interlude • Pictures and some text from Hartmut Traunmüller’s web site: • Brief Historical Interlude • Pictures and some text from Hartmut Traunmüller’s web site: • http: //www. ling. su. se/staff/hartmut/kemplne. htm • Von Kempeln 1780 b. Bratislava 1734 d. Vienna 1804 • Leather resonator manipulated by the operator to try and copy vocal tract configuration during sonorants (vowels, glides, nasals) • Bellows provided air stream, counterweight provided inhalation • Vibrating reed produced periodic pressure wave LING 180 Autumn 2007 6

Von Kempelen: • Small whistles controlled consonants • Rubber mouth and nose; nose had Von Kempelen: • Small whistles controlled consonants • Rubber mouth and nose; nose had to be covered with two fingers for non-nasals • Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air From Traunmüller’s web site LING 180 Autumn 2007 7

Modern TTS systems 1960’s first full TTS: Umeda et al (1968) 1970’s Joe Olive Modern TTS systems 1960’s first full TTS: Umeda et al (1968) 1970’s Joe Olive 1977 concatenation of linear-prediction diphones Speak and Spell 1980’s 1979 MITalk (Allen, Hunnicut, Klatt) 1990’s-present Diphone synthesis Unit selection synthesis LING 180 Autumn 2007 8

2. Overview of TTS: Architectures of Modern Synthesis Articulatory Synthesis: Model movements of articulators 2. Overview of TTS: Architectures of Modern Synthesis Articulatory Synthesis: Model movements of articulators and acoustics of vocal tract Formant Synthesis: Start with acoustics, create rules/filters to create each formant Concatenative Synthesis: Use databases of stored speech to assemble new utterances. Text from Richard Sproat slides LING 180 Autumn 2007 9

Formant Synthesis Were the most common commercial systems while computers were relatively underpowered. 1979 Formant Synthesis Were the most common commercial systems while computers were relatively underpowered. 1979 MITalk (Allen, Hunnicut, Klatt) 1983 DECtalk system The voice of Stephen Hawking LING 180 Autumn 2007 10

Concatenative Synthesis All current commercial systems. Diphone Synthesis Units are diphones; middle of one Concatenative Synthesis All current commercial systems. Diphone Synthesis Units are diphones; middle of one phone to middle of next. Why? Middle of phone is steady state. Record 1 speaker saying each diphone Unit Selection Synthesis Larger units Record 10 hours or more, so have multiple copies of each unit Use search to find best sequence of units LING 180 Autumn 2007 11

TTS Demos (all are Unit-Selection) Festival http: //www-2. cs. cmu. edu/~awb/festival_demos/index. html Cepstral http: TTS Demos (all are Unit-Selection) Festival http: //www-2. cs. cmu. edu/~awb/festival_demos/index. html Cepstral http: //www. cepstral. com/cgi-bin/demos/general IBM http: //www 306. ibm. com/software/pervasive/tech/demos/tts. shtml LING 180 Autumn 2007 12

Architecture The three types of TTS Concatenative Formant Articulatory Only cover the segments+f 0+duration Architecture The three types of TTS Concatenative Formant Articulatory Only cover the segments+f 0+duration to waveform part. A full system needs to go all the way from random text to sound. LING 180 Autumn 2007 13

Two steps PG&E will file schedules on April 20. TEXT ANALYSIS: Text into intermediate Two steps PG&E will file schedules on April 20. TEXT ANALYSIS: Text into intermediate representation: WAVEFORM SYNTHESIS: From the intermediate representation into waveform LING 180 Autumn 2007 14

LING 180 Autumn 2007 15 LING 180 Autumn 2007 15

1. Text Normalization Analysis of raw text into pronounceable words: Sentence Tokenization Text Normalization 1. Text Normalization Analysis of raw text into pronounceable words: Sentence Tokenization Text Normalization Identify tokens in text Chunk tokens into reasonably sized sections Map tokens to words Identify types for words LING 180 Autumn 2007 16

Rules for end-of-utterance detection A dot with one or two letters is an abbrev Rules for end-of-utterance detection A dot with one or two letters is an abbrev A dot with 3 cap letters is an abbrev. An abbrev followed by 2 spaces and a capital letter is an end-ofutterance Non-abbrevs followed by capitalized word are breaks This fails for Cog. Sci. Newsletter Lots of cases at end of line. Badly spaced/capitalized sentences LING 180 Autumn 2007 17

Determining if a word is end-ofutterance: a Decision Tree LING 180 Autumn 2007 18 Determining if a word is end-ofutterance: a Decision Tree LING 180 Autumn 2007 18

Learning Decision Trees DTs are rarely built by hand Hand-building only possible for very Learning Decision Trees DTs are rarely built by hand Hand-building only possible for very simple features, domains Lots of algorithms for DT induction LING 180 Autumn 2007 19

Next Step: Identify Types of Tokens, and Convert Tokens to Words Pronunciation of numbers Next Step: Identify Types of Tokens, and Convert Tokens to Words Pronunciation of numbers often depends on type: 1776 date: seventeen seventy six. 1776 phone number: one seven six 1776 quantifier: one thousand seven hundred (and) seventy six 25 day: twenty-fifth LING 180 Autumn 2007 20

Classify token into 1 of 20 types EXPN: abbrev, contractions (adv, N. Y. , Classify token into 1 of 20 types EXPN: abbrev, contractions (adv, N. Y. , mph, gov’t) LSEQ: letter sequence (CIA, D. C. , CDs) ASWD: read as word, e. g. CAT, proper names MSPL: misspelling NUM: number (cardinal) (12, 45, 1/2, 0. 6) NORD: number (ordinal) e. g. May 7, 3 rd, Bill Gates II NTEL: telephone (or part) e. g. 212 -555 -4523 NDIG: number as digits e. g. Room 101 NIDE: identifier, e. g. 747, 386, I 5, PC 110 NADDR: number as stresst address, e. g. 5000 Pennsylvania NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT, URL, etc SLNT: not spoken (KENT*REALTY) LING 180 Autumn 2007 21

More about the types 4 categories for alphabetic sequences: EXPN: expand to full word More about the types 4 categories for alphabetic sequences: EXPN: expand to full word or word seq (fplc for fireplace, NY for New York) LSEQ: say as letter sequence (IBM) ASWD: say as standard word (either OOV or acronyms) 5 main ways to read numbers: Cardinal (quantities) Ordinal (dates) String of digits (phone numbers) Pair of digits (years) Trailing unit: serial until last non-zero digit: 8765000 is “eight seven six five thousand” (some phone numbers, long addresses) But still exceptions: (947 -3030, 830 -7056) LING 180 Autumn 2007 22

Finally: expanding NSW Tokens Type-specific heuristics ASWD expands to itself LSEQ expands to list Finally: expanding NSW Tokens Type-specific heuristics ASWD expands to itself LSEQ expands to list of words, one for each letter NUM expands to string of words representing cardinal NYER expand to 2 pairs of NUM digits… NTEL: string of digits with silence for puncutation Abbreviation: – use abbrev lexicon if it’s one we’ve seen – Else use training set to know how to expand – Cute idea: if “eat in kit” occurs in text, “eat-in kitchen” will also occur somewhere. LING 180 Autumn 2007 23

2. Homograph disambiguation 19 most frequent homographs, from Liberman and Church use increase close 2. Homograph disambiguation 19 most frequent homographs, from Liberman and Church use increase close record house contract lead lives protest 319 230 215 195 150 143 131 130 105 94 survey project separate present read 72 subject rebel finance estimate 91 90 87 80 68 48 46 46 Not a huge problem, but still important LING 180 Autumn 2007 24

POS Tagging for homograph disambiguation Many homographs can be distinguished by POS use y POS Tagging for homograph disambiguation Many homographs can be distinguished by POS use y uw s y uw z close k l ow s k l ow z house h aw s h aw z live l ay v l ih v REcord re. CORD INsult in. SULT OBject ob. JECT OVERflow over. FLOW DIScount dis. COUNT CONtent con. TENT LING 180 Autumn 2007 25

3. Letter-to-Sound: Getting from words to phones Two methods: Dictionary-based Rule-based (Letter-to-sound=LTS) Early systems, 3. Letter-to-Sound: Getting from words to phones Two methods: Dictionary-based Rule-based (Letter-to-sound=LTS) Early systems, all LTS MITalk was radical in having huge 10 K word dictionary Now systems use a combination LING 180 Autumn 2007 26

Pronunciation Dictionaries: CMU dictionary: 127 K words http: //www. speech. cs. cmu. edu/cgi-bin/cmudict Some Pronunciation Dictionaries: CMU dictionary: 127 K words http: //www. speech. cs. cmu. edu/cgi-bin/cmudict Some problems: Has errors Only American pronunciations No syllable boundaries Doesn’t tell us which pronunciation to use for which homophones – (no POS tags) Doesn’t distinguish case – The word US has 2 pronunciations § [AH 1 S] and [Y UW 1 EH 1 S] LING 180 Autumn 2007 27

Pronunciation Dictionaries: UNISYN dictionary: 110 K words (Fitt 2002) http: //www. cstr. ed. ac. Pronunciation Dictionaries: UNISYN dictionary: 110 K words (Fitt 2002) http: //www. cstr. ed. ac. uk/projects/unisyn/ Benefits: Has syllabification, stress, some morphological boundaries Pronunciations can be read off in – – General American RP British Australia Etc (Other dictionaries like CELEX not used because too small, British-only) LING 180 Autumn 2007 28

Dictionaries aren’t sufficient Unknown words (= OOV = “out of vocabulary”) Increase with the Dictionaries aren’t sufficient Unknown words (= OOV = “out of vocabulary”) Increase with the (sqrt of) number of words in unseen text Black et al (1998) OALD on 1 st section of Penn Treebank: Out of 39923 word tokens, – 1775 tokens were OOV: 4. 6% (943 unique types): names unknown Typos/other 1360 351 64 76. 6% 19. 8% 3. 6% So commercial systems have 4 -part system: Big dictionary Names handled by special routines Acronyms handled by special routines (previous lecture) Machine learned g 2 p algorithm for other unknown words LING 180 Autumn 2007 29

Names Big problem area is names Names are common 20% of tokens in typical Names Big problem area is names Names are common 20% of tokens in typical newswire text will be names 1987 Donnelly list (72 million households) contains about 1. 5 million names Personal names: Mc. Arthur, D’Angelo, Jiminez, Rajan, Raghavan, Sondhi, Xu, Hsu, Zhang, Chang, Nguyen Company/Brand names: Infinit, Kmart, Cytyc, Medamicus, Inforte, Aaon, Idexx Labs, Bebe LING 180 Autumn 2007 30

Names Methods: Can do morphology (Walters -> Walter, Lucasville) Can write stress-shifting rules (Jordan Names Methods: Can do morphology (Walters -> Walter, Lucasville) Can write stress-shifting rules (Jordan -> Jordanian) Rhyme analogy: Plotsky by analogy with Trostsky (replace tr with pl) Liberman and Church: for 250 K most common names, got 212 K (85%) from these modified-dictionary methods, used LTS for rest. Can do automatic country detection (from letter trigrams) and then do country-specific rules Can train g 2 p system specifically on names – Or specifically on types of names (brand names, Russian names, etc) LING 180 Autumn 2007 31

Acronyms We saw above Use machine learning to detect acronyms EXPN ASWORD LETTERS Use Acronyms We saw above Use machine learning to detect acronyms EXPN ASWORD LETTERS Use acronym dictionary, hand-written rules to augment LING 180 Autumn 2007 32

Letter-to-Sound Rules Earliest algorithms: handwritten Chomsky+Halle-style rules: Festival version of such LTS rules: (LEFTCONTEXT Letter-to-Sound Rules Earliest algorithms: handwritten Chomsky+Halle-style rules: Festival version of such LTS rules: (LEFTCONTEXT [ ITEMS] RIGHTCONTEXT = NEWITEMS ) Example: (#[ch]C=k) ( # [ c h ] = ch ) # denotes beginning of word C means all consonants Rules apply in order “christmas” pronounced with [k] But word with ch followed by non-consonant pronounced [ch] – E. g. , “choice” LING 180 Autumn 2007 33

Stress rules in hand-written LTS English famously evil: one from Allen et al 1987 Stress rules in hand-written LTS English famously evil: one from Allen et al 1987 Where X must contain all prefixes: Assign 1 -stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final syllable containing a short vowel and 0 or more consonants (e. g. difficult) Assign 1 -stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final vowel (e. g. oregano) etc LING 180 Autumn 2007 34

Modern method: Learning LTS rules automatically Induce LTS from a dictionary of the language Modern method: Learning LTS rules automatically Induce LTS from a dictionary of the language Black et al. 1998 Applied to English, German, French Two steps: alignment (CART-based) rule-induction LING 180 Autumn 2007 35

Alignment Letters: c h e c k e d Phones: ch _ eh _ Alignment Letters: c h e c k e d Phones: ch _ eh _ k _ t Black et al Method 1: First scatter epsilons in all possible ways to cause letters and phones to align Then collect stats for P(phone|letter) and select best to generate new stats This iterated a number of times until settles (5 -6) This is EM (expectation maximization) alg LING 180 Autumn 2007 36

Alignment Black et al method 2 LING 180 Autumn 2007 37 Alignment Black et al method 2 LING 180 Autumn 2007 37

Hand specify which letters can be rendered as which phones C goes to k/ch/s/sh Hand specify which letters can be rendered as which phones C goes to k/ch/s/sh W goes to w/v/f, etc An actual list: Once mapping table is created, find all valid alignments, find p(letter|phone), score all alignments, take best LING 180 Autumn 2007 38

Alignment Some alignments will turn out to be really bad. These are just the Alignment Some alignments will turn out to be really bad. These are just the cases where pronunciation doesn’t match letters: Dept d ih p aa r t m ah n t CMU s iy eh m y uw Lieutenant l eh f t eh n ax n t (British) Also foreign words These can just be removed from alignment training LING 180 Autumn 2007 39

Building CART trees Build a CART tree for each letter in alphabet (26 plus Building CART trees Build a CART tree for each letter in alphabet (26 plus accented) using context of +-3 letters # # # c h e c -> ch c h e c k e d -> _ LING 180 Autumn 2007 40

Add more features Even more: for French liaison, we need to know what the Add more features Even more: for French liaison, we need to know what the next word is, and whether it starts with a vowel French six [s iy s] in j’en veux six [s iy z] in six enfants [s iy] in six filles LING 180 Autumn 2007 41

Prosody: from words+phones to boundaries, accent, F 0, duration Prosodic phrasing Need to break Prosody: from words+phones to boundaries, accent, F 0, duration Prosodic phrasing Need to break utterances into phrases Punctuation is useful, not sufficient Accents: Predictions of accents: which syllables should be accented Realization of F 0 contour: given accents/tones, generate F 0 contour Duration: Predicting duration of each phone LING 180 Autumn 2007 42

Defining Intonation Ladd (1996) “Intonational phonology” “The use of suprasegmental phonetic features Suprasegmental = Defining Intonation Ladd (1996) “Intonational phonology” “The use of suprasegmental phonetic features Suprasegmental = above and beyond the segment/phone F 0 Intensity (energy) Duration to convey sentence-level pragmatic meanings” I. e. meanings that apply to phrases or utterances as a whole, not lexical stress, not lexical tone. LING 180 Autumn 2007 43

Three aspects of prosody Prominence: some syllables/words are more prominent than others Structure/boundaries: sentences Three aspects of prosody Prominence: some syllables/words are more prominent than others Structure/boundaries: sentences have prosodic structure Some words group naturally together Others have a noticeable break or disjuncture between them Tune: the intonational melody of an utterance. From Ladd (1996) LING 180 Autumn 2007 44

Prosodic Prominence: Pitch Accents A: What types of foods are a good source of Prosodic Prominence: Pitch Accents A: What types of foods are a good source of vitamins? B 1: Legumes are a good source of VITAMINS. B 2: LEGUMES are a good source of vitamins. • Prominent syllables are: • • • Louder Longer Have higher F 0 and/or sharper changes in F 0 (higher F 0 velocity) Slide from Jennifer Venditti LING 180 Autumn 2007 45

Stress vs. accent (2) The speaker decides to make the word vitamin more prominent Stress vs. accent (2) The speaker decides to make the word vitamin more prominent by accenting it. Lexical stress tell us that this prominence will appear on the first syllable, hence VItamin. LING 180 Autumn 2007 46

Which word receives an accent? It depends on the context. For example, the ‘new’ Which word receives an accent? It depends on the context. For example, the ‘new’ information in the answer to a question is often accented, while the ‘old’ information usually is not. Q 1: What types of foods are a good source of vitamins? A 1: LEGUMES are a good source of vitamins. Q 2: Are legumes a source of vitamins? A 2: Legumes are a GOOD source of vitamins. Q 3: I’ve heard that legumes are healthy, but what are they a good source of ? A 3: Legumes are a good source of VITAMINS. LING 180 Autumn 2007 47 Slide from Jennifer Venditti

Factors in accent prediction Part of speech: Content words are usually accented Function words Factors in accent prediction Part of speech: Content words are usually accented Function words are rarely accented – Of, for, in on, that, the, a, an, no, to, and but or will may would can her is their its our there is am are was were, etc LING 180 Autumn 2007 48

Complex Noun Phrase Structure Sproat, R. 1994. English noun-phrase accent prediction for text-to-speech. Computer Complex Noun Phrase Structure Sproat, R. 1994. English noun-phrase accent prediction for text-to-speech. Computer Speech and Language 8: 79 -94. Proper Names, stress on right-most word New York CITY; Paris, FRANCE Adjective-Noun combinations, stress on noun Large HOUSE, red PEN, new NOTEBOOK Noun-Noun compounds: stress left noun HOTdog (food) versus HOT DOG (overheated animal) WHITE house (place) versus WHITE HOUSE (made of stucco) examples: MEDICAL Building, APPLE cake, cherry PIE. What about: Madison avenue, Park street ? ? ? Some Rules: Furniture+Room -> RIGHT (e. g. , kitchen TABLE) Proper-name + Street -> LEFT (e. g. PARK street) LING 180 Autumn 2007 49

State of the art Hand-label large training sets Use CART, SVM, CRF, etc to State of the art Hand-label large training sets Use CART, SVM, CRF, etc to predict accent Lots of rich features from context (parts of speech, syntactic structure, information structure, contrast, etc. ) Classic lit: Hirschberg, Julia. 1993. Pitch Accent in context: predicting intonational prominence from text. Artificial Intelligence 63, 305 -340 LING 180 Autumn 2007 50

Levels of prominence Most phrases have more than one accent The last accent in Levels of prominence Most phrases have more than one accent The last accent in a phrase is perceived as more prominent Called the Nuclear Accent Emphatic accents like nuclear accent often used for semantic purposes, such as indicating that a word is contrastive, or the semantic focus. The kind of thing you represent via ***s in IM, or capitalized letters ‘ I know SOMETHING interesting is sure to happen, ’ she said to herself. Can also have words that are less prominent than usual Reduced words, especially function words. Often use 4 classes of prominence: emphatic accent, pitch accent, unaccented, reduced LING 180 Autumn 2007 51

Yes-No question are legumes a good source of VITAMINS Rise from the main accent Yes-No question are legumes a good source of VITAMINS Rise from the main accent to the end of the sentence. LING 180 Autumn 2007 52 Slide from Jennifer Venditti

‘Surprise-redundancy’ tune [How many times do I have to tell you. . . ] ‘Surprise-redundancy’ tune [How many times do I have to tell you. . . ] legumes are a good source of vitamins Low beginning followed by a gradual rise to a high at the end. LING 180 Autumn 2007 53 Slide from Jennifer Venditti

‘Contradiction’ tune “I’ve heard that linguini is a good source of vitamins. ” linguini ‘Contradiction’ tune “I’ve heard that linguini is a good source of vitamins. ” linguini isn’t a good source of vitamins [. . . how could you think that? ] Sharp fall at the beginning, flat and low, then rising at the end. LING 180 Autumn 2007 54 Slide from Jennifer Venditti

Duration Simplest: fixed size for all phones (100 ms) Next simplest: average duration for Duration Simplest: fixed size for all phones (100 ms) Next simplest: average duration for that phone (from training data). Samples from SWBD in ms: aa ax ay eh ih 118 59 138 87 77 b d dh f g 68 68 44 90 66 Next Simplest: add in phrase-final and initial lengthening plus stress: LING 180 Autumn 2007 55

Intermediate representation: using Festival Do you really want to see all of it? LING Intermediate representation: using Festival Do you really want to see all of it? LING 180 Autumn 2007 56

Waveform Synthesis Given: String of phones Prosody – Desired F 0 for entire utterance Waveform Synthesis Given: String of phones Prosody – Desired F 0 for entire utterance – Duration for each phone – Stress value for each phone, possibly accent value Generate: Waveforms LING 180 Autumn 2007 57

Diphone TTS architecture Training: Choose units (kinds of diphones) Record 1 speaker saying 1 Diphone TTS architecture Training: Choose units (kinds of diphones) Record 1 speaker saying 1 example of each diphone Mark the boundaries of each diphones, – cut each diphone out and create a diphone database Synthesizing an utterance, grab relevant sequence of diphones from database Concatenate the diphones, doing slight signal processing at boundaries use signal processing to change the prosody (F 0, energy, duration) of selected sequence of diphones LING 180 Autumn 2007 58

Diphones Mid-phone is more stable than edge: LING 180 Autumn 2007 59 Diphones Mid-phone is more stable than edge: LING 180 Autumn 2007 59

Diphones mid-phone is more stable than edge Need O(phone 2) number of units Some Diphones mid-phone is more stable than edge Need O(phone 2) number of units Some combinations don’t exist (hopefully) ATT (Olive et al. 1998) system had 43 phones – 1849 possible diphones – Phonotactics ([h] only occurs before vowels), don’t need to keep diphones across silence – Only 1172 actual diphones May include stress, consonant clusters – So could have more Lots of phonetic knowledge in design Database relatively small (by today’s standards) Around 8 megabytes for English (16 KHz 16 bit) Slide from Richard Sproat LING 180 Autumn 2007 60

Voice Speaker voice talent Diphone database Called a voice LING 180 Autumn 2007 61 Voice Speaker voice talent Diphone database Called a voice LING 180 Autumn 2007 61

Prosodic Modification Modifying pitch and duration independently Changing sample rate modifies both: Chipmunk speech Prosodic Modification Modifying pitch and duration independently Changing sample rate modifies both: Chipmunk speech Duration: duplicate/remove parts of the signal Pitch: resample to change pitch LING 180 Autumn 2007 Text from Alan Black 62

Speech as Short Term signals LING 180 Autumn 2007 Alan Black 63 Speech as Short Term signals LING 180 Autumn 2007 Alan Black 63

Duration modification Duplicate/remove short term signals LING 180 Autumn 2007 Richard Sproat 64 Slide Duration modification Duplicate/remove short term signals LING 180 Autumn 2007 Richard Sproat 64 Slide from

Duration modification Duplicate/remove short term signals LING 180 Autumn 2007 65 Duration modification Duplicate/remove short term signals LING 180 Autumn 2007 65

Pitch Modification Move short-term signals closer together/further apart LING 180 Autumn 2007 Richard Sproat Pitch Modification Move short-term signals closer together/further apart LING 180 Autumn 2007 Richard Sproat 66 Slide from

TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Very TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Very efficient No FFT (or inverse FFT) required Can modify Hz up to two times or by half LING 180 Autumn 2007 Richard Sproat 67 Slide from

TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Windowed TD-PSOLA ™ Time-Domain Pitch Synchronous Overlap and Add Patented by France Telecom (CNET) Windowed Pitch-synchronous Overlap-and-add Very efficient Can modify Hz up to two times or by half LING 180 Autumn 2007 68

Unit Selection Synthesis Generalization of the diphone intuition Larger units – From diphones to Unit Selection Synthesis Generalization of the diphone intuition Larger units – From diphones to sentences Many many copies of each unit – 10 hours of speech instead of 1500 diphones (a few minutes of speech) LING 180 Autumn 2007 69

Unit Selection Intuition Given a big database Find the unit in the database that Unit Selection Intuition Given a big database Find the unit in the database that is the best to synthesize some target segment What does “best” mean? “Target cost”: Closest match to the target description, in terms of – Phonetic context – F 0, stress, phrase position “Join cost”: Best join with neighboring units – Matching formants + other spectral characteristics – Matching energy – Matching F 0 LING 180 Autumn 2007 70

Targets and Target Costs A measure of how well a particular unit in the Targets and Target Costs A measure of how well a particular unit in the database matches the internal representation produced by the prior stages Features, costs, and weights Examples: /ih-t/ from stressed syllable, phrase internal, high F 0, content word /n-t/ from unstressed syllable, phrase final, low F 0, content word /dh-ax/ from unstressed syllable, phrase initial, high F 0, from function word “the” LING 180 Autumn 2007 71 Slide from Paul Taylor

Target Costs Comprised of k subcosts Stress Phrase position F 0 Phone duration Lexical Target Costs Comprised of k subcosts Stress Phrase position F 0 Phone duration Lexical identity Target cost for a unit: LING 180 Autumn 2007 72 Slide from Paul Taylor

Join (Concatenation) Cost Measure of smoothness of join Measured between two database units (target Join (Concatenation) Cost Measure of smoothness of join Measured between two database units (target is irrelevant) Features, costs, and weights Comprised of k subcosts: Spectral features F 0 Energy Join cost: LING 180 Autumn 2007 73 Slide from Paul Taylor

Total Costs Hunt and Black 1996 We now have weights (per phone type) for Total Costs Hunt and Black 1996 We now have weights (per phone type) for features set between target and database units Find best path of units through database that minimize: Standard problem solvable with Viterbi search with beam width constraint for pruning LING 180 Autumn 2007 74 Slide from Paul Taylor

LING 180 Autumn 2007 75 LING 180 Autumn 2007 75

Unit Selection Summary Advantages Quality is far superior to diphones Natural prosody selection sounds Unit Selection Summary Advantages Quality is far superior to diphones Natural prosody selection sounds better Disadvantages: Quality can be very bad in places – HCI problem: mix of very good and very bad is quite annoying Synthesis is computationally expensive Can’t synthesize everything you want: – Diphone technique can move emphasis – Unit selection gives good (but possibly incorrect) result LING 180 Autumn 2007 Richard Sproat 76 Slide from

Evaluation of TTS Intelligibility Tests Diagnostic Rhyme Test (DRT) – Humans do listening identification Evaluation of TTS Intelligibility Tests Diagnostic Rhyme Test (DRT) – Humans do listening identification choice between two words differing by a single phonetic feature § Voicing, nasality, sustenation, sibilation – 96 rhyming pairs – Veal/feel, meat/beat, vee/bee, zee/thee, etc § Subject hears “veal”, chooses either “veal or “feel” § Subject also hears “feel”, chooses either “veal” or “feel” – % of right answers is intelligibility score. Overall Quality Tests Have listeners rate space on a scale from 1 (bad) to 5 (excellent) Preference Tests (prefer A, prefer B) LING 180 Autumn 2007 Huang, Acero, Hon 77

Recent stuff Problems with Unit Selection Synthesis Can’t modify signal (mixing modified and unmodified Recent stuff Problems with Unit Selection Synthesis Can’t modify signal (mixing modified and unmodified sounds bad) But database often doesn’t have exactly what you want Solution: HMM (Hidden Markov Model) Synthesis Won the last TTS bakeoff. Sounds less natural to researchers But naïve subjects preferred it Has the potential to improve on both diphone and unit selection. LING 180 Autumn 2007 78

HMM Synthesis Unit selection (Roger) HMM (Roger) Unit selection (Nina) HMM (Nina) LING 180 HMM Synthesis Unit selection (Roger) HMM (Roger) Unit selection (Nina) HMM (Nina) LING 180 Autumn 2007 79

Summary 1) Arpabet 2) TTS Architectures 3) TTS Components • Text Analysis • • Summary 1) Arpabet 2) TTS Architectures 3) TTS Components • Text Analysis • • • Text Normalization Homonym Disambiguation Grapheme-to-Phoneme (Letter-to-Sound) Intonation Waveform Generation • • Diphones Unit Selection LING 180 Autumn 2007 80

Optional: Articulatory Phonetics LING 180 Autumn 2007 81 Optional: Articulatory Phonetics LING 180 Autumn 2007 81

Speech Production Process Respiration: We (normally) speak while breathing out. Respiration provides airflow. “Pulmonic Speech Production Process Respiration: We (normally) speak while breathing out. Respiration provides airflow. “Pulmonic egressive airstream” Phonation Airstream sets vocal folds in motion. Vibration of vocal folds produces sounds. Sound is then modulated by: Articulation and Resonance Shape of vocal tract, characterized by: Oral tract – Teeth, soft palate (velum), hard palate – Tongue, lips, uvula Nasal tract LING Text. Autumn 2007 Sharon Rose 82 180 adopted from

Sagittal section of the vocal tract (Techmer 1880) Nasal Cavity Pharynx Vocal Folds (within Sagittal section of the vocal tract (Techmer 1880) Nasal Cavity Pharynx Vocal Folds (within the Larynx) Trachea Lungs Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide LING 180 Autumn 2007 83

From Mark Liberman’s website, from Ultimate Visual Dictionary LING 180 Autumn 2007 84 From Mark Liberman’s website, from Ultimate Visual Dictionary LING 180 Autumn 2007 84

From Mark Liberman’s Web Site, from Language Files (7 th ed) LING 180 Autumn From Mark Liberman’s Web Site, from Language Files (7 th ed) LING 180 Autumn 2007 85

Larynx and Vocal Folds The Larynx (voice box) A structure made of cartilage and Larynx and Vocal Folds The Larynx (voice box) A structure made of cartilage and muscle Located above the trachea (windpipe) and below the pharynx (throat) Contains the vocal folds (adjective for larynx: laryngeal) Vocal Folds (older term: vocal cords) Two bands of muscle and tissue in the larynx Can be set in motion to produce sound (voicing) LING 180 Autumn 2007 86 Text from slides by Sharon Rose UCSD LING 111 handout

The larynx, external structure, from front LING 180 Autumn 2007 87 Figure thnx to The larynx, external structure, from front LING 180 Autumn 2007 87 Figure thnx to John Coleman!!

Vertical slice through larynx, as seen from back LING 180 Autumn 2007 88 Figure Vertical slice through larynx, as seen from back LING 180 Autumn 2007 88 Figure thnx to John Coleman!!

Voicing: • Air comes up from lungs • Forces its way through vocal cords, Voicing: • Air comes up from lungs • Forces its way through vocal cords, pushing open (2, 3, 4) • This causes air pressure in glottis to fall, since: • when gas runs through constricted passage, its velocity increases (Venturi tube effect) • this increase in velocity results in a drop in pressure (Bernoulli principle) • Because of drop in pressure, vocal cords snap together again (6 -10) • Single cycle: ~1/100 of a second. LING 180 Autumn 2007 Figure & text from John Coleman’s web 89 site

Voicelessness When vocal cords are open, air passes through unobstructed Voiceless sounds: p/t/k/s/f/sh/th/ch If Voicelessness When vocal cords are open, air passes through unobstructed Voiceless sounds: p/t/k/s/f/sh/th/ch If the air moves very quickly, the turbulence causes a different kind of phonation: whisper LING 180 Autumn 2007 90

Vocal folds open during breathing From Mark Liberman’s web site, from Ultimate Visual Dictionary Vocal folds open during breathing From Mark Liberman’s web site, from Ultimate Visual Dictionary LING 180 Autumn 2007 91

Vocal Fold Vibration UCLA Phonetics Lab Demo LING 180 Autumn 2007 92 Vocal Fold Vibration UCLA Phonetics Lab Demo LING 180 Autumn 2007 92

Consonants and Vowels Consonants: phonetically, sounds with audible noise produced by a constriction Vowels: Consonants and Vowels Consonants: phonetically, sounds with audible noise produced by a constriction Vowels: phonetically, sounds with no audible noise produced by a constriction (it’s more complicated than this, since we have to consider syllabic function, but this will do for now) Text adapted from John Coleman LING 180 Autumn 2007 93

Place of Articulation Consonants are classified according to the location where the airflow is Place of Articulation Consonants are classified according to the location where the airflow is most constricted. This is called place of articulation Three major kinds of place articulation: Labial (with lips) Coronal (using tip or blade of tongue) Dorsal (using back of tongue) LING 180 Autumn 2007 94

Places of articulation dental labial alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal Figure thanks to Places of articulation dental labial alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal Figure thanks to LING 180 Autumn 2007 Jennifer Venditti 95

Labial place labiodental bilabial Bilabial: p, b, m Labiodental: f, v Figure thanks to Labial place labiodental bilabial Bilabial: p, b, m Labiodental: f, v Figure thanks to LING 180 Autumn 2007 Jennifer Venditti 96

Coronal place dental alveolar post-alveolar/palatal Dental: th/dh Alveolar: t/d/s/z/l Post: sh/zh/y Figure thanks to Coronal place dental alveolar post-alveolar/palatal Dental: th/dh Alveolar: t/d/s/z/l Post: sh/zh/y Figure thanks to LING 180 Autumn 2007 Jennifer Venditti 97

Dorsal Place velar uvular Velar: k/g/ng pharyngeal Figure thanks to LING 180 Autumn 2007 Dorsal Place velar uvular Velar: k/g/ng pharyngeal Figure thanks to LING 180 Autumn 2007 Jennifer Venditti 98

Manner of Articulation Stop: complete closure of articulators, so no air escapes through mouth Manner of Articulation Stop: complete closure of articulators, so no air escapes through mouth Oral stop: palate is raised, no air escapes through nose. Air pressure builds up behind closure, explodes when released p, t, k, b, d, g Nasal stop: oral closure, but palate is lowered, air escapes through nose. m, n, ng LING 180 Autumn 2007 99

Oral vs. Nasal Sounds LING 180 Autumn 2007 100 Thanks to Jong-bok Kim for Oral vs. Nasal Sounds LING 180 Autumn 2007 100 Thanks to Jong-bok Kim for this figure!

More on Manner of articulation of consonants Fricatives Close approximation of two articulators, resulting More on Manner of articulation of consonants Fricatives Close approximation of two articulators, resulting in turbulent airflow between them, producing a hissing sound. – f, v, s, z, th, dh Approximant Not quite-so-close approximation of two articulators, so no turbulence – y, r Lateral approximant Obstruction of airstream along center of oral tract, with opening around sides of tongue. – l LING 180 Autumn 2007 101 Text from Ladefoged “A Course in Phonetics”

More on manner of articulation of consonants Tap or flap Tongue makes a single More on manner of articulation of consonants Tap or flap Tongue makes a single tap against the alveolar ridge – dx in “butter” Affricate Stop immediately followed by a fricative – ch, jh LING 180 Autumn 2007 102

Articulatory parameters for English consonants (in ARPAbet) MANNER OF ARTICULATION PLACE OF ARTICULATION bilabial Articulatory parameters for English consonants (in ARPAbet) MANNER OF ARTICULATION PLACE OF ARTICULATION bilabial stop p labiodental interdental b fric. alveolar t f v th dh d s z palatal sh m w l/r q jh n approx g zh ch nasal from Jennifer Venditt!i glottal k affric. flap velar h ng y dx VOICING: voiceless voiced LING 180 Autumn 2007 103

Tongue position for vowels LING 180 Autumn 2007 104 Tongue position for vowels LING 180 Autumn 2007 104

Vowels IY AA UW LING 180 Autumn 2007 105 Fig. from Eric Keller Vowels IY AA UW LING 180 Autumn 2007 105 Fig. from Eric Keller

American English Vowel Space HIGH iy uw eh ae uh ow ey FRONT ux American English Vowel Space HIGH iy uw eh ae uh ow ey FRONT ux oy ax ah ay aw ix ih ao BACK aa LOW LING 180 Autumn 2007 Figure from Jennifer Venditti 106

More phonetic structure Syllables Composed of vowels and consonants. Not well defined. Something like More phonetic structure Syllables Composed of vowels and consonants. Not well defined. Something like a “vowel nucleus with some of its surrounding consonants”. Stress Some syllables have more energy than others Stressed syllables versus unstressed syllables (an) ‘INsult vs. (to) in’SULT (an) ‘OBject vs. (to) ob’JECT Unstressed vowels are generally transcribed as schwa: – ax LING 180 Autumn 2007 107

Where to go for more info Ladefoged, Peter. 1993. A Course in Phonetics Mark Where to go for more info Ladefoged, Peter. 1993. A Course in Phonetics Mark Liberman’s site http: //www. ling. upenn. edu/courses/Spring_2001/ling 001/phonetics. html John Coleman’s site http: //www. phon. ox. ac. uk/%7 Ejcoleman/mst_mphil_ phonetics_course_index. html LING 180 Autumn 2007 108