
6ede863d86b780d8f5103ddb61947bba.ppt
- Количество слайдов: 26
Introduction to Speech Corpora@Stanford Neal Snider, snider@stanford. edu For LIN 110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
Before we get to the real stuff… n This presentation will be available online at: q n n n http: //www. stanford. edu/dept/linguistics/corpora/m aterial/ling 110/ Local support Where are our corpora? Setting up your account on AFS
Local support n Where can you get help with your project? q q Your TA The Corpora@Stanford website (http: //www. stanford. edu/dept/linguistics/corpora/) The corpora@csli. stanford. edu email list (you have to subscribe first) The corpus TA (snider@stanford. edu)
Where are our corpora? (1) n AFS: q q AFS is Stanford’s file sharing system The linguistic corpora are stored at: /afs/ir/data/linguistic-data/ q q You need to register for AFS access You need to set up your account
Where are our corpora? (2) n Corpus Computer q n n The computer is the one closest to the printer in the linguistics department’s computer cluster (MJH, 1 st floor) The corpora are stored on partition D: Mapping the drive via a network:
The real part n Example project n n Overview of available corpora Where to find them How does the annotation look like? n How to search speech corpora n
Example projects (1) n Differences in the realization of phonemes depending on their context q ‘Context’ can be segmental [1] n n q How does the realization of syllabic /m/ differ depending on the preceding onset? Word final vowel aspiration ‘Context’ can be supra-segmental: [3] n n How does the realization of syllabic /m/ differ at the beginning/end of conversations/utterances/sentences? Reduction of complex clusters
Example projects (2) q n ‘Context’ could also include the register, style (formal vs. informal), genre (reading a fairy tale vs. reading an article), different dialects, etc. [2] Pitch contours related to specific meanings [1] q Steady-state pitch contours
Available corpora n n Handout in http: //www. stanford. edu/dept/linguistics/corpo ra/material/X_speech_corpora/X_phonetic corpora. doc See also: q http: //www. stanford. edu/dept/linguistics/corpora/
Switchboard – spontaneous AE speech q Transcripts uploaded to AFS: n /afs/ir/data/linguistic-data/Switchboard/ q Sound files available on CD q available in several formats: n n All in one file Separate files for q q q Syllables Words Orthographic transcription
Example annotations (Switchboard) n Some files in Switchboard
Switchboard – all in one file Annotation key (1) Key: n SENTENCE: word 1 word 2. . . (2005_A_0041) n WORD: word canonical? [lm-probs] [rates] [positions] [morebigrams] part-of-speech phone 1 phone 2. . . n SYL: baseform transcribed syl_structure stress length [lm-probs] [rates] [positions] n PHONE: baseform stress syl_part [lm-probs] [rates] [positions] tran 1 tran 2. . .
Switchboard – all in one file Annotation key (2) n n n n [lm-probs]= trigram unigram trigram-unigram [rates]= seg_tr_syl seg_tr_phn lex_syl lex_phn enrate vrate nvrate mfrate enmmfrate [positions] = word_num_in_utterance word_num_in_turn [morebigrams] = bigram reverse-trigram center-trigram part-of-speech = syntactic part of speech (currently only done for the word "to") word. X= word number X in acoustically segmented `sentence' canonical? = can if canonical (pronlex) pronunciation, alt otherwise trigram= p(word | previous two words) unigram= p(word) trigram-unigram = difference between two probabilities seg_tr_syl= transcribed syllable rate between closest two pauses seg_tr_phn= transcribed phone rate between closest two pauses lex_syl= lexical syllabic rate (i. e. as determined from wd transcription) lex_phn= lexical phone rate (i. e. as determined from wd transcription)
Switchboard – all in one file Annotation key (3) n n n n enrate= old enrate measure vrate= voicing rate nvrate= another voicing rate mrate= sub-part of mrate measure mfrate= sub-part of mrate measure enmmfrate= *this is what we call mrate* average of enrate, mfrate mmffrate= average of mrate, mfrate baseform= pronunciation as written in dictionary transcribed= transcribed syllable syl_structure= onset/nucleus/coda markings from dictionary stress= syllable stress marking from dictionary P=primary S=secondary N=none length= syllable length tran. X= transcribed phone X corresponding to baseform phone
Arpabet
Example annotations (Switchboard – all in one file) SENTENCE: like finding a proper nursing home (2005_A_0041) WORD: like 1 can -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 l ay k SYL: l_ay_k O_N_C P 0. 258 -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 PHONE: l P O -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 l PHONE: ay P N -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 ay PHONE: k P C -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 k WORD: finding 2 alt -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 f ay n ih ng SYL: f_ay_n O_N_C P 0. 358 -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 PHONE: f P O -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 f PHONE: ay P N -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 ay PHONE: n P C -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 n SYL: d_ih_ng NULL_ih_ng O_N_C N 0. 117 -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 PHONE: d N O -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 NULL 1 27 PHONE: ih N N -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 ih PHONE: ng N C -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 ng
Boston Radio Transcripts n Includes read news etc. (i. e. nonspontaneous read speech) n Transcripts uploaded to AFS at: q n /afs/ir/data/linguistic-data/Boston-University-Radio Sound files available on CD
Example annotations (Boston Radio) n Boston News Corpus q q q q q H# 0 >endsil DH 4 IH+1 S 19 >This HH 28 AA+1 L 42 AX 54 DCL D 61 EY 62 >holiday S 78 IY+1 Z 103 EN 110 … 4 5 9 9 5 33 12 4 58 1 16 11 89 7 20 10 9 3 14
Example annotations (Boston Radio) n XWAVES/PRAAT readable: q q q q q signal st 43/f 3 ast 43 p 1 type 1 color 76 font -*-times-medium-r-*-*-17 -*-*-*-* separator ; nfields 1 # 0. 035000 76 H# 0. 085000 76 DH 0. 185000 76 IH+1 0. 275000 76 S 0. 325000 76 HH 0. 415000 76 AA+1 0. 535000 76 L 0. 575000 76 AX 0. 605000 76 DCL 0. 615000 76 D 0. 775000 76 EY 0. 885000 76 S …
CALLHOME Mandarin - Transcripts n CALLHOME – Mandarin q Transcripts uploaded to AFS: n q Lexicon with pronunciation information available at: n q /afs/ir/data/linguistic-data/CALLHOMEMandarin-Transcripts/ /afs/ir/data/linguistic-data/CALLHOMEMandarin-Lexicon/ Sound files only available on CD/DVD, but I could put them on the corpus computer
TIMIT – dialect variation n Telephone recording of 8 major dialects of American English (orthographic) transcripts on AFS, sound files available on CD Comparable dialect corpora exist for the British Isles (IVi. E; stored on the corpus computer)
Example annotations (TIMIT) n TIMIT q Word label (. wrd): n n n n q 7470 11362 she 11362 16000 had 15420 17503 your 17503 23360 dark 23360 28360 suit 28360 30960 in 30960 36971 greasy Phonetic label (. phn): n n n n n (Note: beginning and ending silence regions are marked with h#) 0 7470 h# 7470 9840 sh 9840 11362 iy 11362 12908 hv 12908 14760 ae 14760 15420 dcl 15420 16000 jh 16000 17503 axr
How to search transcribed corpora? n n Either load the files into your favorite text editor Or use a command from the ‘grep’ family (run on a UNIX shell) q q This allows you to search many files as once for patterns that are described by regular expressions For help, see our tutorial page at: n http: //www. stanford. edu/dept/linguistics/corpora/cas-tutgrep. html
Example annotations (Switchboard – all in one file) SENTENCE: like finding a proper nursing home (2005_A_0041) WORD: like 1 can -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 l ay k SYL: l_ay_k O_N_C P 0. 258 -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 PHONE: l P O -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 l PHONE: ay P N -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 ay PHONE: k P C -2. 408 -2. 152 -0. 256 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 0 26 k WORD: finding 2 alt -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 f ay n ih ng SYL: f_ay_n O_N_C P 0. 358 -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 PHONE: f P O -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 f PHONE: ay P N -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 ay PHONE: n P C -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 n SYL: d_ih_ng NULL_ih_ng O_N_C N 0. 117 -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 PHONE: d N O -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 NULL 1 27 PHONE: ih N N -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 ih PHONE: ng N C -3. 604 -4. 256 0. 652 4. 64 10. 43 3. 87 9. 89 3. 80 2. 32 5. 79 2. 32 4. 64 3. 59 3. 48 1 27 ng
Demo search egrep '^SYL: [a-z_]+ [a-z_]*ow. {1, 3}m[a-z_]* ’ Actual phonological pattern