The summary of the means and the standard deviations of the differences from the two experiments. The numbers are given in msec.
Results & Conclusion The automated identification of the boundary (labeled auto) between /s/ and /h/ in the phrase Miss Henry produced by a female native speaker of English. The f and v represent the beginnings of /s/ and the vowel following /h/.
References [1] Boersma, Paul. 2001. Praat, a system for doing phonetics by computer. Glot International 5(9/10). pp. 341 -345. [2] Yoon, Kyuchul. 2002. A production and perception experiment of Korean alveolar fricatives. Speech Sciences. 9(3). pp. 169 -184. [3] Yoon, Kyuchul. 2005. Durational correlates of prosodic categories: The case of two Korean voiceless coronal fricatives. Speech Sciences. 12(1). pp. 89 -105.
2. The role of prosody in dialect synthesis and authentication Kyuchul Yoon School of English Language & Literature Yeungnam University Spring 2008 Joint Conference of KSPS & KASS 16
Goals 1. Synthesize Masan utterances from matching Seoul utterances by prosody cloning 2. Examine the role of prosody in the authentication of synthetic Masan utterances (Listening experiment)
Background • Differences among dialects – Segmental differences • Fricative differences in the time domain (Lee, 2002) – Busan fricatives have shorter frication/aspiration intervals than for Seoul • Fricative differences in the frequency domain (Kim et al. , 2002) – The low cutoff frequency of Kyungsang fricatives was higher than for Cholla fricatives (> 1, 000 Hz) – Non-segmental or prosodic differences • • Intonation or fundamental frequency (F 0) contour difference Intensity contour difference Segment durational difference Voice quality difference
Synthesis • Simulating (by prosody cloning) Masan dialect from Seoul dialect • The simulated Masan utterances will have – the speech segments of Seoul dialect – the prosody of Masan dialect • F 0 contour • Intensity contour • Segmental duration
Evaluation • Through a listening experiment • Stimuli consist of – – – – #1. Authentic, but synthetic, Masan utterance #2. Seoul utterance with Masan segmental durations (D) #3. Seoul utterance with Masan F 0 contour (F) #4. Seoul utterance with Masan intensity contour (I) #5. Seoul utterance with Masan durations and F 0 contour (D+F) #6. Seoul utterance with Masan durations and intensity contour (D+I) #7. Seoul utterance with Masan F 0 contour and intensity contour (F+I) #8. Seoul utterance with Masan durations, F 0 contour and intensity contour (D+F+I) (1) 동대구에 볼 일이 없습니다. (2) 바다에 보물섬이 없다 Listen to Stimuli
Prosody transfer (PSOLA algorithm) • Three aspects of the prosody – Fundamental frequency (F 0) contour – Intensity contour – Segmental durations • Pitch-Synchronous Over. Lap and Add (PSOLA) algorithm (Mouline & Charpentier, 1990) – Implemented in Praat (Boersma, 2005) – Use of a script for semi-automatic segment-by-segment manipulation (Yoon, 2007)
Prosody transfer (PSOLA algorithm) • Procedures for full prosody transfer – Align segments btw/ Masan and Seoul utterances – Make the segment durations of the two identical – Make the two F 0 contours identical – Make the two intensity contours identical
Prosody transfer (PSOLA algorithm) Align segments btw/ Masan and Seoul utterances Make the segment durations of the two utterances identical st re t c shrin Seoul “…바람…” k ㅂ ㅏ ㄹ ㅏ ㅁ h Masan ㅂ ㅏ ㄹ ㅏ ㅁ
Prosody transfer (PSOLA algorithm) Make the two F 0 contours identical Masan F 0 Masan ㅂ ㅏ ㄹ ㅏ ㅁ Seoul F 0
Prosody transfer (PSOLA algorithm) Make the two intensity contours identical Masan intensity Masan ㅂ ㅏ ㄹ ㅏ ㅁ Seoul intensity
Synthetic (simulated) Masan stimulus
Synthetic authentic Masan stimulus
Listening experiment • 16 stimuli (8 + 8) • Presented to 13 Masan/Changwon listeners – On a scale of 1 (worst) to 10 (best) – Used Praat Experiment. MFC object – Allowed repetition of stimulus: up to 10 times
Listening experiment See Demo
Results & Conclusion Histogram of listener responses
Results & Conclusion 1 … listener responses … 10 F 0 contour transfer
Results & Conclusion Masan F D FI DFI DI Seoul utterances with Masan prosody
Results & Conclusion • Main effects of – Segmental durations; F(1, 12)=11. 53, p=0. 005 – F 0 contour; F(1, 12)=141. 12, p=0. 00000005 • Regression analysis
Results & Conclusion • Prosody cloning not sufficient for dialect simulation – (Sub)Segmental differences may be at work – Quality of synthetic stimuli • F 0 contour transfer (from Masan to Seoul) – Most influential on shifting perception from Seoul to Masan utterances
References [1] Kyung-Hee Lee, “Comparison of acoustic characteristics between Seoul and Busan dialect on fricatives”, Speech Sciences, Vol. 9/3, pp. 223 -235, 2002. [2] Hyun-Gi Kim, Eun-Young Lee, and Ki-Hwan Hong, “Experimental phonetic study of Kyungsang and Cholla dialect using power spectrum and laryngeal fiberscope”, Speech Sciences, Vol. 9/2, pp. 25 -47, 2002. [3] Kyuchul Yoon, “Imposing native speakers’ prosody on non-native speakers’ utterances: The technique of cloning prosody”, Journal of the Modern British & American Language & Literature, Vol. 25(4). pp. 197 -215, 2007. [4] E. Moulines and F. Charpentier, “Pitch synchronouswaveform processing techniques for text-to-speech synthesis using diphones”, Speech Communication, 9 5 -6, 1990. [5] P. Boersma, “Praat, a system for doing phonetics by computer”, Glot International, Vol. 5, 9/10, pp. 341 -345, 2005.
3. Synthesis & evaluation of prosodically exaggerated utterances: A preliminary study Kyuchul Yoon School of English Language & Literature Yeungnam University Spring 2008 Joint Conference of KSPS & KASS 36
Contents • Synthesis & evaluation of human utterances with exaggerated prosody • Synthesis of exaggerated prosody – Useful for presenting native utterances to students – The definition of prosody “exaggeration” – The algorithm • Evaluation of exaggerated prosody – Useful for evaluating learner utterances – The algorithm & an experiment
Teaching & evaluating prosody • Teaching language prosody – The need for “exaggeration” of native utterances – How to define “exaggeration” • Evaluating language prosody – Given the native version of an utterance, evaluate learner’s atypical prosody – How to measure the differences btw/ the native and learner utterances
Exaggerating native prosody • Exaggeration of the F 0 contour – One way would be to make the pitch peaks/valleys higher/lower • Exaggeration of the intensity contour – One way would be to manipulate the intensity contour of the pitch peaks(or valleys) • Exaggeration of the segmental durations – One way would be to manipulate the segmental durations of the pitch peaks(or valleys) See Demo
Exaggerating native prosody F 0 The fundamental frequency (F 0) contour of an utterance Marianna!.
Exaggerating native prosody Intensity The intensity contour of an utterance Marianna!.
Exaggerating native prosody Duration The segmental durations of an utterance Marianna! before and after the exaggeration.
Algorithm: prosody exaggeration • Definition of prosody exaggeration – F 0 contour • Make pitch peaks/valleys higher/lower in Hz values – Intensity contour • Make pitch peaks higher in d. B values – Segmental durations • Make pitch peaks longer in times values
Algorithm: prosody exaggeration F 0
Algorithm: prosody exaggeration Intensity
Algorithm: prosody exaggeration Durations
How Praat script works
How Praat script works F 0 Intensity Durations
How Praat script works Original F 0 Durations Intensity
Evaluating learner prosody • Assumes the existence of the native version • Evaluates the learner versions • Evaluation of the F 0 & intensity contours – Is preceded by duration manipulation: • The durations of the matching segments of the two utterances are made identical [3] – Is preceded by F 0/intensity normalization & F 0 smoothing • The mean difference is added/subtracted to/from learner utterance – Is followed by pitch/intensity point-to-point comparison • Evaluation of segmental durations – Done without any duration manipulation. Segment-tosegment comparison • Evaluation measure: Euclidean distance metric
Algorithm: prosody evaluation Before & after duration manipulation native learner before learner after
Algorithm: prosody evaluation F 0 point-to-point comparison btw/ native and learner native learner after Normalization & smoothing were performed in prior steps
Algorithm: prosody evaluation Intensity point-to-point comparison btw/ native and learner native learner after Normalization was performed in prior steps
Algorithm: prosody evaluation Duration segment-to-segment comparison btw/ native and learner native learner before Euclidean distance metric for evaluation measure P = (p 1, p 2, p 3, . . . , pn) and Q = (q 1, q 2, q 3, . . . , qn) in Euclidean n-space
A pilot experiment native learner after D/F/I cloning An ideal case: Three Euclidean distances (Ed) should be minimum Ed 1: F 0 contour Ed 2: Intensity contour Ed 3: Segment durations
Creation of Stimuli: F 0 native + + learner after D cloning + + + F 0: -100 Hz to +100 Hz with a 10 Hz interval 21 stimuli Evaluation of the stimuli against the F 0 contour of the native utterance
Creation of Stimuli native learner after D cloning + + Intensity: -25 d. B to +25 d. B with a 5 d. B interval 11 stimuli Evaluation of the stimuli against the intensity contour of the native utterance
Creation of Stimuli native + learner + Duration: 0. 25, 0. 50, 0. 75, 1. 00, 1. 50, 2. 00, 2. 50, 3. 00 times the original 8 stimuli Evaluation of the stimuli against the segment durations of the native utterance
Results & Conclusion
Results & Conclusion
Results & Conclusion
Results & Conclusion • Prosody exaggeration – Can be a tool for teaching language prosody – Can be used to test measures for evaluating prosody • Limitation of the current prosody evaluation – Native utterances should exist to yield measures • TTS systems with advanced prosody models could be helpful to process any learner utterances – “Weights” of the three separate measures (F 0/intensity/duration) need to be determined • Experiments with human evaluators could provide the weights
References [1] Boersma, Paul. 2001. Praat, a system for doing phonetics by computer. Glot International 5(9/10). pp. 341 -345. [2] Moulines, E. & F. Charpentier. 1990. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9. pp. 453 -467. [3] Yoon, K. 2007. Imposing native speakers' prosody on non-native speakers' utterances: The technique of cloning prosody. Journal of the Modern British & American Language & Literature 25(4). pp. 197 -215.
4. Determining the weights of prosodic components in prosody evaluation • Problem – Raw components vs. Abstracted concepts – F 0, intensity, duration vs. Rhythm, tempo, etc. • Determine the weights of prosodic components in prosody evaluation – – Use raw units: F 0, intensity, duration Use cloning of prosody (problem of unequal number of segments) Create an “other-things-being-equal” environment Evaluation of • Each raw prosodic component • Overall prosodic fluency – Compare & Assess the weights of each component in prosody evaluation
Stimuli (4) Determining the weights of prosodic components in prosody evaluation • Given (a) model native utterance(s) – • (1) Its F 0 contour (learner utterance version 1) (2) Its intensity contour (learner utterance version 2) (3) Its segmental durations (learner utterance version 3) Evaluate the manipulated learner utterances – – – • Human evaluator evaluates the learner utterance in terms of its prosodic fluency = Overall Prosody Score (from the unmodified learner utterance) Manipulate the learner utterance to create an “other-things-being-equal” environment so that the learner utterance is the same as its native version except for – – – • and its learner version (1) F 0 score (from learner version 1) (2) Intensity score (from learner version 2) (3) Duration score (from learner version 3) Hypothesis: Overall prosody score = * (F 0 score) + * (Intensity score) + * (Duration score) • • • Repeat the evaluation for other utterances from the same learner to solve the equation Verify the coefficients with unevaluated utterances from the same learner If the hypothesis holds, make the prosody evaluation process automatic
Stimuli “The dancing queen likes only the apple pies” Native (5061_02) Evaluate overall prosody with respect to the native version (Overall Prosody Score) Learner (1047_02)
Stimuli “The dancing queen likes only the apple pies” Native Learner_DI Now has the native durations/intensity. Evaluate F 0 contour (F 0 Score) Learner_DF Now has the native durations/F 0 contour. Evaluate intensity contour (Intensity Score)
Stimuli “The dancing queen likes only the apple pies” Native Learner_FI Now has the native F 0/intensity. Evaluate segmental durations (Duration Score) Overall prosody score = * (F 0 score) + * (Intensity score) + * (Duration score)
5. Difference database of prosodic features for automatic prosody evaluation • Given (a) model native utterance(s) and its learner version, get difference values of – (1) F 0 contour – (2) intensity contour – (3) segmental durations between the two utterances • Use techniques & scripts used in – (3) Synthesis & evaluation of prosodically exaggerated utterances • Store difference values of each prosodic feature for each learner utterance in a database • Use the database to develop algorithms for automatic prosody scoring • Pilot study: labeled sentences from KT_K-SEC corpus
Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
Pilot data (5) Difference database of prosodic features for automatic prosody evaluation Intensity difference native learner num. Frames frame. No time natived. B learnerd. B diffd. B 5053_02. wav 1044_02. wav 482 1 0. 035 31. 86 42. 42 -10. 56 5053_02. wav 1044_02. wav 482 2 0. 043 30. 73 42. 45 -11. 72 5053_02. wav 1044_02. wav 482 3 0. 051 29. 33 41. 94 -12. 61 5053_02. wav 1044_02. wav 482 4 0. 059 29. 03 41. 00 -11. 97 5053_02. wav 1044_02. wav 482 5 0. 067 29. 11 40. 97 -11. 86 5053_02. wav 1044_02. wav 482 6 0. 075 29. 92 41. 97 -12. 05 5053_02. wav 1044_02. wav 482 7 0. 083 30. 27 42. 67 -12. 40 5053_02. wav 1044_02. wav 482 8 0. 091 31. 14 42. 63 -11. 49 5053_02. wav 1044_02. wav 482 9 0. 099 30. 27 44. 10 -13. 83 5053_02. wav 1044_02. wav 482 10 0. 107 30. 35 45. 12 -14. 77 5053_02. wav 1044_02. wav 482 11 0. 115 30. 73 43. 90 -13. 18 5053_02. wav 1044_02. wav 482 12 0. 123 30. 53 43. 15 -12. 62 5053_02. wav 1044_02. wav 482 13 0. 131 32. 44 42. 67 -10. 22 5053_02. wav 1044_02. wav 482 14 0. 139 31. 12 40. 94 -9. 82 5053_02. wav 1044_02. wav 482 15 0. 147 30. 97 38. 88 -7. 91 5053_02. wav 1044_02. wav 482 16 0. 155 33. 92 38. 15 -4. 24 5053_02. wav 1044_02. wav 482 17 0. 163 33. 78 37. 45 -3. 67 5053_02. wav 1044_02. wav 482 18 0. 171 32. 72 35. 75 -3. 03 Sums of squares of diffd. B's is 42114 Square root of the sums is 205
Pilot data (5) Difference database of prosodic features for automatic prosody evaluation Duration difference native learner num. Segs seg. No native. Seg. ID learner. Seg. ID time. Start native. Dur ratio norm. Native. Dur learner. Dur norm. Diff. Dur 5053_02. Text. Grid 1044_02. Text. Grid 33 1 SIL 0 330 1. 027 321 328 -7 5053_02. Text. Grid 1044_02. Text. Grid 33 2 dh dh 0. 330 22 1. 027 22 16 5 5053_02. Text. Grid 1044_02. Text. Grid 33 3 ax ax 0. 353 60 1. 027 59 86 -27 5053_02. Text. Grid 1044_02. Text. Grid 33 4 SIL 0. 413 104 1. 027 101 67 34 5053_02. Text. Grid 1044_02. Text. Grid 33 5 dd dd 0. 517 19 1. 027 19 14 5 5053_02. Text. Grid 1044_02. Text. Grid 33 6 ae ae 0. 536 151 1. 027 147 126 21 5053_02. Text. Grid 1044_02. Text. Grid 33 7 nn nn 0. 686 57 1. 027 55 92 -37 5053_02. Text. Grid 1044_02. Text. Grid 33 8 ss ss 0. 743 92 1. 027 89 102 -13 5053_02. Text. Grid 1044_02. Text. Grid 33 9 ih ih 0. 835 67 1. 027 66 111 -45 5053_02. Text. Grid 1044_02. Text. Grid 33 10 ng ng 0. 902 100 1. 027 98 70 28 5053_02. Text. Grid 1044_02. Text. Grid 33 11 kk kk 1. 002 147 Sums of squares of diff. Dur's is 59266 Square root of the sums is 243
Pilot data (5) Difference database of prosodic features for automatic prosody evaluation F 0 difference native learner num. Frames frame. No time native. F 0 learner. F 0 diff. F 0 5053_02. wav 1044_02. wav 388 1 0. 024 --undefined-- 5053_02. wav 1044_02. wav 388 2 0. 034 --undefined-- 5053_02. wav 1044_02. wav 388 3 0. 044 --undefined-- 5053_02. wav 1044_02. wav 388 4 0. 054 --undefined-- 5053_02. wav 1044_02. wav 388 35 0. 364 220 198 22 5053_02. wav 1044_02. wav 388 36 0. 374 213 197 16 5053_02. wav 1044_02. wav 388 37 0. 384 207 197 11 5053_02. wav 1044_02. wav 388 38 0. 394 203 196 7 5053_02. wav 1044_02. wav 388 39 0. 404 200 195 5 5053_02. wav 1044_02. wav 388 40 0. 414 198 194 4 5053_02. wav 1044_02. wav 388 41 0. 424 197 194 4 … … Sums of squares of diff. F 0's is 236363 Square root of the sums is 486
6. Transforming Korean alveolar lax fricatives into tense • Goal – Test factors that distinguish /ㅅ / from /ㅆ / • Type of factors – Consonantal: noise durations, center of gravity – Vocalic: formant/bandwidth switching – Prosodic: clone F 0/intensity/durations, switch source signals
Pilot data (6) Transforming Korean alveolar lax fricatives into tense 사자 vs. 싸자
Pilot data (6) Transforming Korean alveolar lax fricatives into tense 사자 사자 Prosody: Durations F 0 Intensity 싸자
Pilot data (6) Transforming Korean alveolar lax fricatives into tense 사자 사자 Prosody + Formants Bandwidths 싸자
Design (6) Transforming Korean alveolar lax fricatives into tense • Things to do – – Try the reverse: manipulate /ㅆ / to simulate /ㅅ / Try this with other lax/tense pairs of stops • – 사 싸, 다 따, 바 빠, 가 까 Try switching the source signal • Listening experiments – [1] Render /ssa/ from /sha/ • (1) prosody – – (3) source (1)+(2): shift? , (1)+(3): shift? , (1)+(2)+undo(1): see effect of (2) only, (1)+(3)+undo(1): see effect of (3) only, (1)+(2)+(3)+undo(1): see the effects of (2) and (3) only [2] Render /sha/ from /ssa/ • (1) prosody – – (2) formant/bandwidth (3) source (1)+(2): shift? , (1)+(3): shift? , (1)+(2)+undo(1): ? , (1)+(3)+undo(1): ? , (1)+(2)+(3)+undo(1): ? [3] Statistical analyses of formants/bandwidths • • Examine post-consonantal vowels in terms of their formants/bandwidths for any possible intra/inter-consonantal differences Identify the portion of the vowels that contributes to the distinction of lax/tense consonants, e. g ½, ¼ from the vowel onset
7. Gender transformation of utterances • Examine male vs. female utterances in terms of prosodic & segmental differences – Identify factors that differ – Refer to Praat’s change gender… under Convert button – Verify with synthesizing • Prosody manipulation – F 0/intensity/durations/source • Segment manipulation – Formant frequencies & bandwidths