abee87f57c2836c475a88f05652303bd.ppt
- Количество слайдов: 127
Adding typology to lexicostatistics: a combined approach to language classification ASJP Consortium < Dik Bakker et al. mult. > Language Classification 1
Overview Project ASJP (Started January 2007): (Automated Similarity Judgment Program) Language Classification 2
Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Language Classification 3
Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Basis: Distance matrix between individual languages based on lexical elements Language Classification 4
Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Basis: Distance matrix between individual languages Method: Lexicostatistics: mass comparison of basic lexical items Language Classification 5
Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: Language Classification 6
Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools Language Classification 7
Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools 2. methodology from classification in biology Language Classification 8
Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools 2. methodology from classification in biology 3. extended by all relevant data available Language Classification 9
Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups Language Classification 10
Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups BUT: 1. Optimize lexicostatistics on basis of expert knowledge on well-explored areas Language Classification 11
Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups BUT: 1. Optimize lexicostatistics on basis of expert knowledge 2. Provide method and tools to assess and improve classifications for un(der)explored areas Language Classification 12
Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Matthew Dryer Dmitry Egorov Pattie Epps Anthony Grant Eric W. Holman Hagen Jung Johann-Mattis List Robert Mailhammer André Müller Uri Tadmor Matthias Urban Viveka Velupillai Søren Wichmann Kofi Yakpo Language Classification 13
Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Matthew Dryer Dmitry Egorov Pattie Epps Anthony Grant Eric W. Holman Hagen Jung Johann-Mattis List Robert Mailhammer André Müller Uri Tadmor Matthias Urban Viveka Velupillai Søren Wichmann Kofi Yakpo Language Classification 14
Overview ASJP system LEX Language Classification 15
Overview ASJP system LEX Method ASJP software Language Classification 16
Overview ASJP system LEX ASJP software distance matrix Language Classification 17
Overview ASJP system LEX ASJP software distance matrix DUTCH ENGLISH 53. 3 DUTCH FRENCH 72. 7 DUTCH MANDARIN 93. 8 … Language Classification 18
Overview ASJP system LEX ASJP software distance matrix CLASSIF software Language Classification 19
Existing Expert Classifications: ETHN WALS EXPRT LEX ASJP software EVALUATION distance matrix STAT software CLASSIF software Language Classification 20
Existing Expert Classifications: ETHN WALS EXPRT LEX Method ASJP software CALIBRATION distance matrix STAT software CLASSIF software Language Classification 21
GEO GRAPH ETHN WALS EXPRT LEX ASJP software distance matrix MAP STAT software CLASSIF software Language Classification 22
HIST FACTS GEO GRAPH ETHN WALS EXPRT LEX ASJP software distance matrix MAP STAT software CLASSIF software Language Classification 23
HIST FACTS GEO GRAPH ETHN WALS EXPRT LEX TYPOL DATA ASJP software distance matrix MAP STAT software CLASSIF software Language Classification 24
Today … LEX ASJP software distance matrix CLASSIF software Language Classification 25
Today … LEX TYPOL DATA ASJP software distance matrix CLASSIF software Language Classification 26
List of basic lexical items Language Classification 27
Lexical items Word list Morris Swadesh (1955): 100 basic meanings Language Classification 28
1. I 21. dog 41. nose 61. die 81. smoke 2. you 22. louse 42. mouth 62. kill 82. fire 3. we 23. tree 43. tooth 63. swim 83. ash 4. this 24. seed 44. tongue 64. fly 84. burn 5. that 25. leaf 45. claw 65. walk 85. path 6. who 26. root 46. foot 66. come 86. mountain 7. what 27. bark 47. knee 67. lie 87. red 8. not 28. skin 48. hand 68. sit 88. green 9. all 29. flesh 49. belly 69. stand 89. yellow 10. many 30. blood 50. neck 70. give 90. white 11. one 31. bone 51. breasts 71. say 91. black 12. two 32. grease 52. heart 72. sun 92. night 13. big 33. egg 53. liver 73. moon 93. hot 14. long 34. horn 54. drink 74. star 94. cold 15. small 35. tail 55. eat 75. water 95. full 16. woman 36. feather 56. bite 76. rain 96. new 17. man 37. hair 57. see 77. stone 97. good 18. person 38. head 58. hear 78. sand 98. round 19. fish 39. ear 59. know 79. earth 99. dry 20. bird 40. eye 60. sleep 80. cloud 100. name Language Classification 29
Lexical items Swadesh list: assumptions Language Classification 30
Lexical items Swadesh list: - Word in most languages Language Classification 31
Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed Language Classification 32
Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed - Relatively stable over time Language Classification 33
Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed - Relatively stable over time - Easily accessible (fieldwork / grammars) Language Classification 34
Lexical items Languages transcribed to date: - Over 3500 languages (incl. dialects; around 45% of lgs of the world) Language Classification 35
Languages currently collected Language Classification 36
Lexical items: further reduction Reduction of the full Swadesh list: Language Classification 37
Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete list, only most stable items Language Classification 38
Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete list, only most stable items 2. Not full IPA representation, but generalized coding Language Classification 39
Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Language Classification 40
Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Nichols (1995): lg pairs (wordk=wordk) +++ all pairs Language Classification 41
Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Nichols (1995): lg pairs (wordk=wordk) all pairs What is optimal number … ? Language Classification 42
Ethnologue Classification* WALS Classification** + Stability - Language Classification *Goodman-Kruskal **Pearson
Ethnologue Classification WALS Classification + Stability - Language Classification
Ethnologue Classification WALS Classification Language Classification 45
Ethnologue Classification WALS Classification Language Classification 46
Ethnologue Classification WALS Classification 40 Language Classification 47
Ethnologue Classification WALS Classification Language Classification 48
Ethnologue Classification WALS Classification Language Classification 49
I dog nose die smoke you louse mouth kill fire we tree tooth swim ash this seed tongue fly burn that leaf claw walk path who root foot come mountain what bark knee lie red not skin hand sit green all flesh belly stand yellow many blood neck give white one breast say black two grease heart sun night big egg liver moon hot long horn drink star cold small tail eat water full woman feather bite rain new man hair see stone good person head hear sand round fish ear know earth dry bird eye sleep cloud name Language Classification 40 Most Stable 50
Lexical items: transcription 2. NOT full IPA but ASJPcode: 7 Vowels 34 Consonants All other phonemes to ‘closest sound’ (automatic) Language Classification 51
Abaza (Caucasian): Meaning IPA PERSON ʕʷɨʧʼʲʷʕʷɨs LEAF bɣʲɨ SKIN ʧʷazʲ HORN ʧʼʷɨʕʷa NOSE pɨnʦʼa TOOTH pɨʦ Language Classification 52
Abaza (Caucasian): Meaning IPA ASJPcode PERSON ʕʷɨʧʼʲʷʕʷɨs Xw 3 Cw"y. Xw 3 s LEAF bɣʲɨ bxy 3 SKIN ʧʷazʲ Cwazy HORN ʧʼʷɨʕʷa Cw"3 Xwa NOSE pɨnʦʼa p 3 nc"a TOOTH pɨʦ p 3 c Language Classification 53
Loss of information? Shown for representative groups: - ASJP as good for separating language families as full IPA Language Classification 54
Loss of information? Shown for representative groups: - ASJP as good for separating language families as full IPA - More accurate for precise genetic classification than IPA (under our current method) Language Classification 55
Comparing words and languages Language Classification 56
Comparing words Most successful measure to date: Levenshtein Distance Language Classification 57
Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form Language Classification 58
Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form ALT ASJP Language Classification 59
Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form ALT ASJP xxx = 3 Language Classification 60
Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form 1. Normalization: LDN = ( LD / Lmax ) Language Classification 0. 0 – 1. 0 61
Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form 1. Normalization: LDN = ( LD / Lmax ) 0. 0 – 1. 0 2. Eliminate ‘ background noise’: LDND = ( LDN / LDNdifferent pairs ) Language Classification 62
Classifying languages Language Classification 63
LNG SIL I YOU WE ONE TWO CANTONESE yue Noh neihdeih Nhdeih yat yih HAINAN_MINNAN nan va lu vane. N zy~a 7 no*|no HAKKA hak Nai Ni Naiteu yit ly~o. N|Ni MANDARIN cmn wo nimen women i el SUZHOU_WU wuu No n. E Sia*nj 3 ji 7 lia* A_TONG aot a. N ni. N sa ni MIKIR mjw ne n. Eng netum isi hini TARAON mhu ha* nu* ni. N kai. N NAXI nbf N 3 nv N 3 Ng 3 d 3 5 i CHIANGRAI_MIEN ium yia mei bua yet i HMONG_DAW mww ku ko pe i o SUYONG_HMONG mww ko ko pe i au TAK_HMONG mww ku ko pe i … o … Language Classification 64
Swadesh (3500) AJSP Language Classification 65
Swadesh (3500) AJSP distance matrix Language Classification 66
LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN SUZHOU_WU 85. 87 MANDARIN DHAMMAI 97. 48 MANDARIN A_TONG 97. 91 MANDARIN KAYAH_LI_EASTERN 94. 75 MANDARIN MIKIR 99. 05 MANDARIN LEPCHA 97. 24 MANDARIN APATANI 92. 24 MANDARIN BENGNI 96. 91 MANDARIN BOKAR 95. 28 … Language Classification 67
LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN SUZHOU_WU 85. 87 MANDARIN DHAMMAI 97. 48 MANDARIN A_TONG 97. 91 MANDARIN KAYAH_LI_EASTERN 94. 75 MANDARIN MIKIR 99. 05 MANDARIN LEPCHA 97. 24 MANDARIN APATANI 92. 24 MANDARIN BENGNI 96. 91 MANDARIN BOKAR 95. 28 … Language Classification 68
LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN SUZHOU_WU 85. 87 MANDARIN DHAMMAI 97. 48 MANDARIN A_TONG 97. 91 MANDARIN KAYAH_LI_EASTERN 94. 75 MANDARIN MIKIR 99. 05 MANDARIN LEPCHA 97. 24 MANDARIN APATANI 92. 24 MANDARIN BENGNI 96. 91 MANDARIN BOKAR 95. 28 … Language Classification 69
LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN SUZHOU_WU 85. 87 MANDARIN DHAMMAI 97. 48 MANDARIN A_TONG 97. 91 MANDARIN KAYAH_LI_EASTERN 94. 75 MANDARIN MIKIR 99. 05 MANDARIN LEPCHA 97. 24 MANDARIN APATANI 92. 24 MANDARIN BENGNI 96. 91 MANDARIN BOKAR 95. 28 3500 languages ~ 240. 000 comp Language Classification 70
Processing problems … Language Classification 71
Solution: parallel processing Language Classification 72
Swadesh (3500) AJSP distance matrix http: //www. megasoftware. net/ MEGA 4 DNA patterns Language Classification 73
Swadesh (3500) AJSP distance matrix MEGA 4 Language Classification Neighbour Joining 74
SEE COMPLETE TREE-OF-THE-MONTH ON: email. eva. mpg. de/~wichmann/ASJPHome. Page Language Classification 75
Mayan (34 / 69 Ethn) LDND + Mega 4
Mayan (34 / 69) < all & only > LDND + Mega 4
Mayan (34 / 69) cholan LDND + Mega 4
Mayan (34 / 69) tzeltalan cholan LDND + Mega 4 cholan
Mayan (34 / 69) tzeltalan cholan yucatecan LDND + Mega 4
Mayan (34 / 69) Ethnologue/experts: yucatecan cholan LDND + Mega 4 tzeltalan
ASJP and genetic classification - Method works at a global level Language Classification 82
ASJP and genetic classification - Method works at a global level - Often also at the lowest levels Language Classification 83
ASJP and genetic classification - Method works at a global level - Often also at the lowest levels - Refinement necessary at intermediate level Language Classification 84
Adding typological data Language Classification 85
Trying to improve the fit … Enrich lexical with typological data: Haspelmath, M. Dryer, D. Gil & B. Comrie (eds) (2005). The World Atlas Of Language Structures. Oxford: Oxford University Press WALS Online: http: //wals. info/ Language Classification 86
Lexical plus typological data Swadesh (3500) + WALS (2580) < 140 FEATURES > ASJP distance matrix TREE SFTW Language Classification 87
‘SWALSH’ ASJP distance matrix TREE SFTW Language Classification 88
Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP languages Language Classification 89
SWALSH (1250) ASJP distance matrix TREE SFTW Language Classification 90
Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP languages - WALS matrix very UNevenly filled (16%) cf. Cysouw (2008) – STUF 61. 3 Language Classification 91
Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP languages - WALS features very unevenly filled Determine most stable features Language Classification 92
Feature Stability Nichols (1995): metric for S(Ftrk) in Gx: pairs (valk=valk) all pairs Language Classification 93
Feature Stability ASJP: metric for Stability Ftrk: For Gx: pairs (valk=valk) all pairs Language Classification 94
Feature Stability ASJP: metric for stability Ftrk: For Gx: pairs (valk=valk) all pairs Size differences between G Language Classification 95
Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs Language Classification 96
Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs U all pairs (valk=valk) all pairs ‘Background noise’ Language Classification 97
Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs U all pairs (valk=valk) all pairs (1 – U) Normalization: SFk comparable Language Classification 98
Most stable WALS features 31. Sex-based and Non-sex-based Gender Systems 118. Predicative Adjectives 0. 81 0. 74 30. Number of Genders 0. 73 119. Nominal and Locational Predication 29. Syncretism in Verbal Person/Number Marking Language Classification 0. 71 99
Most instable WALS features 128. Utterance Complement Clauses 0. 07 115. Negative Indefinite Pronouns/Predicate Negation 0. 07 59. Possessive Classification 135. Red and Yellow 0. 01 -0. 07 58. Obligatory Possessive Inflection Language Classification -0. 25 100
Correlation with Ethnologue Min ftrs 20 Language Classification 101
Correlation with Ethnologue Min ftrs 40 20 Language Classification 102
Correlation with Ethnologue Min ftrs 60 40 20 Language Classification 103
Correlation with Ethnologue Min ftrs 80 60 40 20 Language Classification 104
Correlation with Ethnologue Min ftrs 100 80 60 40 20 Language Classification 105
Correlation with Ethnologue Min ftrs 100 80 60 40 20 + Stability - Language Classification 106
Correlation with Ethnologue Min ftrs 100 80 60 40 20 20 Language Classification 107
Correlation with Ethnologue Min ftrs 100 80 60 40 20 40 Language Classification 108
Correlation with Ethnologue Min ftrs 100 80 60 40 20 60 Language Classification 109
Correlation with Ethnologue Min ftrs 100 80 60 40 20 85 Language Classification 110
Correlation with Ethnologue Min ftrs 100 80 60 40 20 Language Classification 111
WALS Language Classification 112
WALS Swadesh 40 Language Classification 113
Improving the fit Typological variables* do not perform better than lexical ones to establish genetic relationships *WALS! Language Classification 114
Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships What about a combination? Language Classification 115
Ftrs 100 80 60 40 20 Only WALS Language Classification Only Sw 40 Lgs 79 109 139 218 341 116
Ftrs 100 80 60 40 20 Only WALS Language Classification Only Sw 40 Lgs 79 109 139 218 341 117
Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 85: 15 Only WALS Language Classification Only Sw 40 118
Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 70: 30 Only WALS Language Classification Only Sw 40 119
0. 91 Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 50: 50 Only WALS Language Classification Only Sw 40 120
Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 35: 65 Only WALS Language Classification Only Sw 40 121
Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … Language Classification 122
Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … … at a much higher cost per language than just lexicostatistics: 84% WALS to be filled in … Language Classification 123
Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … … at a much higher cost per language Continue extension/optimization of lexical method Language Classification 124
Publications 2008 - 2009 1. Brown, Cecil H. , Eric W. Holman, Søren Wichmann & Viveka Velupillai (2008). Automated Classification of the World’s languages: a description of the method and prelimary results. Sprachtypologie und Universalienforschung 61: 285 -308. 2. Holman, E. W. , S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D. Bakker (2008) 'Advances in automated language classification'. In A. Arppe, K. Sinnemäke and U. Nikanne (eds) Quantitative Investigations in Theoretical Linguistics. Helsinki: University of Helsinki, 40 -43. 3. Holman, E. W. , S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D. Bakker (2008). ‘Explorations in automated language classification’. Folia Linguistica 42 -2, 331 -354. 4. Bakker, D. , A. Müller, V. Velupillai, S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R. Mailhammer, A. Grant, E. W. Holman (2009). ’Adding typology to lexicostatistics: a combined approach to language classification’. Linguistic Typology 13, 167 -179. Language Classification 125
? Language Classification 126
ASJP Overall goal: - Method + Tools for Reconstruction of Language Relationships Derived goals: - Critical assessment and refinement of existing classifications - Classify newly described and unclassified languages - Search for (ir)regularities in family reconstructions - Test hypotheses about families - Experimentally find an optimal dating method - Automatically detect borrowings Language Classification 127
abee87f57c2836c475a88f05652303bd.ppt