Скачать презентацию Adding typology to lexicostatistics a combined approach to Скачать презентацию Adding typology to lexicostatistics a combined approach to

abee87f57c2836c475a88f05652303bd.ppt

  • Количество слайдов: 127

Adding typology to lexicostatistics: a combined approach to language classification ASJP Consortium < Dik Adding typology to lexicostatistics: a combined approach to language classification ASJP Consortium < Dik Bakker et al. mult. > Language Classification 1

Overview Project ASJP (Started January 2007): (Automated Similarity Judgment Program) Language Classification 2 Overview Project ASJP (Started January 2007): (Automated Similarity Judgment Program) Language Classification 2

Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Language Classification 3

Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Basis: Distance matrix between individual languages based on lexical elements Language Classification 4

Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Overview Project: ASJP (Automated Similarity Judgment Program) Overall goal: Automatic reconstruction of language relationships Basis: Distance matrix between individual languages Method: Lexicostatistics: mass comparison of basic lexical items Language Classification 5

Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: Language Classification Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: Language Classification 6

Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools Language Classification 7

Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools 2. methodology from classification in biology Language Classification 8

Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use Overview Project: ASJP (Automated Similarity Judgment Program) As in traditional lexicostatistics, but: 1. use of computational algorithms and tools 2. methodology from classification in biology 3. extended by all relevant data available Language Classification 9

Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups Language Classification 10

Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups BUT: 1. Optimize lexicostatistics on basis of expert knowledge on well-explored areas Language Classification 11

Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification Caveat: ASJP goal: Reconstruction of relationships between languages NOT: better than experts in classification of areas/groups BUT: 1. Optimize lexicostatistics on basis of expert knowledge 2. Provide method and tools to assess and improve classifications for un(der)explored areas Language Classification 12

Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Matthew Dryer Dmitry Egorov Pattie Epps Anthony Grant Eric W. Holman Hagen Jung Johann-Mattis List Robert Mailhammer André Müller Uri Tadmor Matthias Urban Viveka Velupillai Søren Wichmann Kofi Yakpo Language Classification 13

Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Overview Current collaborators: Dik Bakker David Beck Oleg Belyaev Cecil H. Brown Pamela Brown Matthew Dryer Dmitry Egorov Pattie Epps Anthony Grant Eric W. Holman Hagen Jung Johann-Mattis List Robert Mailhammer André Müller Uri Tadmor Matthias Urban Viveka Velupillai Søren Wichmann Kofi Yakpo Language Classification 14

Overview ASJP system LEX Language Classification 15 Overview ASJP system LEX Language Classification 15

Overview ASJP system LEX Method ASJP software Language Classification 16 Overview ASJP system LEX Method ASJP software Language Classification 16

Overview ASJP system LEX ASJP software distance matrix Language Classification 17 Overview ASJP system LEX ASJP software distance matrix Language Classification 17

Overview ASJP system LEX ASJP software distance matrix DUTCH ENGLISH 53. 3 DUTCH FRENCH Overview ASJP system LEX ASJP software distance matrix DUTCH ENGLISH 53. 3 DUTCH FRENCH 72. 7 DUTCH MANDARIN 93. 8 … Language Classification 18

Overview ASJP system LEX ASJP software distance matrix CLASSIF software Language Classification 19 Overview ASJP system LEX ASJP software distance matrix CLASSIF software Language Classification 19

Existing Expert Classifications: ETHN WALS EXPRT LEX ASJP software EVALUATION distance matrix STAT software Existing Expert Classifications: ETHN WALS EXPRT LEX ASJP software EVALUATION distance matrix STAT software CLASSIF software Language Classification 20

Existing Expert Classifications: ETHN WALS EXPRT LEX Method ASJP software CALIBRATION distance matrix STAT Existing Expert Classifications: ETHN WALS EXPRT LEX Method ASJP software CALIBRATION distance matrix STAT software CLASSIF software Language Classification 21

GEO GRAPH ETHN WALS EXPRT LEX ASJP software distance matrix MAP STAT software CLASSIF GEO GRAPH ETHN WALS EXPRT LEX ASJP software distance matrix MAP STAT software CLASSIF software Language Classification 22

HIST FACTS GEO GRAPH ETHN WALS EXPRT LEX ASJP software distance matrix MAP STAT HIST FACTS GEO GRAPH ETHN WALS EXPRT LEX ASJP software distance matrix MAP STAT software CLASSIF software Language Classification 23

HIST FACTS GEO GRAPH ETHN WALS EXPRT LEX TYPOL DATA ASJP software distance matrix HIST FACTS GEO GRAPH ETHN WALS EXPRT LEX TYPOL DATA ASJP software distance matrix MAP STAT software CLASSIF software Language Classification 24

Today … LEX ASJP software distance matrix CLASSIF software Language Classification 25 Today … LEX ASJP software distance matrix CLASSIF software Language Classification 25

Today … LEX TYPOL DATA ASJP software distance matrix CLASSIF software Language Classification 26 Today … LEX TYPOL DATA ASJP software distance matrix CLASSIF software Language Classification 26

List of basic lexical items Language Classification 27 List of basic lexical items Language Classification 27

Lexical items Word list Morris Swadesh (1955): 100 basic meanings Language Classification 28 Lexical items Word list Morris Swadesh (1955): 100 basic meanings Language Classification 28

1. I 21. dog 41. nose 61. die 81. smoke 2. you 22. louse 1. I 21. dog 41. nose 61. die 81. smoke 2. you 22. louse 42. mouth 62. kill 82. fire 3. we 23. tree 43. tooth 63. swim 83. ash 4. this 24. seed 44. tongue 64. fly 84. burn 5. that 25. leaf 45. claw 65. walk 85. path 6. who 26. root 46. foot 66. come 86. mountain 7. what 27. bark 47. knee 67. lie 87. red 8. not 28. skin 48. hand 68. sit 88. green 9. all 29. flesh 49. belly 69. stand 89. yellow 10. many 30. blood 50. neck 70. give 90. white 11. one 31. bone 51. breasts 71. say 91. black 12. two 32. grease 52. heart 72. sun 92. night 13. big 33. egg 53. liver 73. moon 93. hot 14. long 34. horn 54. drink 74. star 94. cold 15. small 35. tail 55. eat 75. water 95. full 16. woman 36. feather 56. bite 76. rain 96. new 17. man 37. hair 57. see 77. stone 97. good 18. person 38. head 58. hear 78. sand 98. round 19. fish 39. ear 59. know 79. earth 99. dry 20. bird 40. eye 60. sleep 80. cloud 100. name Language Classification 29

Lexical items Swadesh list: assumptions Language Classification 30 Lexical items Swadesh list: assumptions Language Classification 30

Lexical items Swadesh list: - Word in most languages Language Classification 31 Lexical items Swadesh list: - Word in most languages Language Classification 31

Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed Language Classification 32

Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed - Relatively stable over time Language Classification 33

Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed Lexical items Swadesh list: - Word in most languages - Inherited rather than borrowed - Relatively stable over time - Easily accessible (fieldwork / grammars) Language Classification 34

Lexical items Languages transcribed to date: - Over 3500 languages (incl. dialects; around 45% Lexical items Languages transcribed to date: - Over 3500 languages (incl. dialects; around 45% of lgs of the world) Language Classification 35

Languages currently collected Language Classification 36 Languages currently collected Language Classification 36

Lexical items: further reduction Reduction of the full Swadesh list: Language Classification 37 Lexical items: further reduction Reduction of the full Swadesh list: Language Classification 37

Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete list, only most stable items Language Classification 38

Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete Lexical items: further reduction Reduction of the full Swadesh list: 1. Not the complete list, only most stable items 2. Not full IPA representation, but generalized coding Language Classification 39

Lexical items: further reduction 1. Not the complete list - Most stable items = Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Language Classification 40

Lexical items: further reduction 1. Not the complete list - Most stable items = Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Nichols (1995): lg pairs (wordk=wordk) +++ all pairs Language Classification 41

Lexical items: further reduction 1. Not the complete list - Most stable items = Lexical items: further reduction 1. Not the complete list - Most stable items = least formal variation in well-established genetic groups (Dryer’s genera) Nichols (1995): lg pairs (wordk=wordk) all pairs What is optimal number … ? Language Classification 42

Ethnologue Classification* WALS Classification** + Stability - Language Classification *Goodman-Kruskal **Pearson Ethnologue Classification* WALS Classification** + Stability - Language Classification *Goodman-Kruskal **Pearson

Ethnologue Classification WALS Classification + Stability - Language Classification Ethnologue Classification WALS Classification + Stability - Language Classification

Ethnologue Classification WALS Classification Language Classification 45 Ethnologue Classification WALS Classification Language Classification 45

Ethnologue Classification WALS Classification Language Classification 46 Ethnologue Classification WALS Classification Language Classification 46

Ethnologue Classification WALS Classification 40 Language Classification 47 Ethnologue Classification WALS Classification 40 Language Classification 47

Ethnologue Classification WALS Classification Language Classification 48 Ethnologue Classification WALS Classification Language Classification 48

Ethnologue Classification WALS Classification Language Classification 49 Ethnologue Classification WALS Classification Language Classification 49

I dog nose die smoke you louse mouth kill fire we tree tooth swim I dog nose die smoke you louse mouth kill fire we tree tooth swim ash this seed tongue fly burn that leaf claw walk path who root foot come mountain what bark knee lie red not skin hand sit green all flesh belly stand yellow many blood neck give white one breast say black two grease heart sun night big egg liver moon hot long horn drink star cold small tail eat water full woman feather bite rain new man hair see stone good person head hear sand round fish ear know earth dry bird eye sleep cloud name Language Classification 40 Most Stable 50

Lexical items: transcription 2. NOT full IPA but ASJPcode: 7 Vowels 34 Consonants All Lexical items: transcription 2. NOT full IPA but ASJPcode: 7 Vowels 34 Consonants All other phonemes to ‘closest sound’ (automatic) Language Classification 51

Abaza (Caucasian): Meaning IPA PERSON ʕʷɨʧʼʲʷʕʷɨs LEAF bɣʲɨ SKIN ʧʷazʲ HORN ʧʼʷɨʕʷa NOSE pɨnʦʼa Abaza (Caucasian): Meaning IPA PERSON ʕʷɨʧʼʲʷʕʷɨs LEAF bɣʲɨ SKIN ʧʷazʲ HORN ʧʼʷɨʕʷa NOSE pɨnʦʼa TOOTH pɨʦ Language Classification 52

Abaza (Caucasian): Meaning IPA ASJPcode PERSON ʕʷɨʧʼʲʷʕʷɨs Xw 3 Cw Abaza (Caucasian): Meaning IPA ASJPcode PERSON ʕʷɨʧʼʲʷʕʷɨs Xw 3 Cw"y. Xw 3 s LEAF bɣʲɨ bxy 3 SKIN ʧʷazʲ Cwazy HORN ʧʼʷɨʕʷa Cw"3 Xwa NOSE pɨnʦʼa p 3 nc"a TOOTH pɨʦ p 3 c Language Classification 53

Loss of information? Shown for representative groups: - ASJP as good for separating language Loss of information? Shown for representative groups: - ASJP as good for separating language families as full IPA Language Classification 54

Loss of information? Shown for representative groups: - ASJP as good for separating language Loss of information? Shown for representative groups: - ASJP as good for separating language families as full IPA - More accurate for precise genetic classification than IPA (under our current method) Language Classification 55

Comparing words and languages Language Classification 56 Comparing words and languages Language Classification 56

Comparing words Most successful measure to date: Levenshtein Distance Language Classification 57 Comparing words Most successful measure to date: Levenshtein Distance Language Classification 57

Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form Language Classification 58

Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form ALT ASJP Language Classification 59

Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form ALT ASJP xxx = 3 Language Classification 60

Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form 1. Normalization: LDN = ( LD / Lmax ) Language Classification 0. 0 – 1. 0 61

Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get Comparing words Levenshtein Distance (LD) = Number of transformations (=changes & additions) to get from the shorter form to the longer form 1. Normalization: LDN = ( LD / Lmax ) 0. 0 – 1. 0 2. Eliminate ‘ background noise’: LDND = ( LDN / LDNdifferent pairs ) Language Classification 62

Classifying languages Language Classification 63 Classifying languages Language Classification 63

LNG SIL I YOU WE ONE TWO CANTONESE yue Noh neihdeih Nhdeih yat yih LNG SIL I YOU WE ONE TWO CANTONESE yue Noh neihdeih Nhdeih yat yih HAINAN_MINNAN nan va lu vane. N zy~a 7 no*|no HAKKA hak Nai Ni Naiteu yit ly~o. N|Ni MANDARIN cmn wo nimen women i el SUZHOU_WU wuu No n. E Sia*nj 3 ji 7 lia* A_TONG aot a. N ni. N sa ni MIKIR mjw ne n. Eng netum isi hini TARAON mhu ha* nu* ni. N kai. N NAXI nbf N 3 nv N 3 Ng 3 d 3 5 i CHIANGRAI_MIEN ium yia mei bua yet i HMONG_DAW mww ku ko pe i o SUYONG_HMONG mww ko ko pe i au TAK_HMONG mww ku ko pe i … o … Language Classification 64

Swadesh (3500) AJSP Language Classification 65 Swadesh (3500) AJSP Language Classification 65

Swadesh (3500) AJSP distance matrix Language Classification 66 Swadesh (3500) AJSP distance matrix Language Classification 66

LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN SUZHOU_WU 85. 87 MANDARIN DHAMMAI 97. 48 MANDARIN A_TONG 97. 91 MANDARIN KAYAH_LI_EASTERN 94. 75 MANDARIN MIKIR 99. 05 MANDARIN LEPCHA 97. 24 MANDARIN APATANI 92. 24 MANDARIN BENGNI 96. 91 MANDARIN BOKAR 95. 28 … Language Classification 67

LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN SUZHOU_WU 85. 87 MANDARIN DHAMMAI 97. 48 MANDARIN A_TONG 97. 91 MANDARIN KAYAH_LI_EASTERN 94. 75 MANDARIN MIKIR 99. 05 MANDARIN LEPCHA 97. 24 MANDARIN APATANI 92. 24 MANDARIN BENGNI 96. 91 MANDARIN BOKAR 95. 28 … Language Classification 68

LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN SUZHOU_WU 85. 87 MANDARIN DHAMMAI 97. 48 MANDARIN A_TONG 97. 91 MANDARIN KAYAH_LI_EASTERN 94. 75 MANDARIN MIKIR 99. 05 MANDARIN LEPCHA 97. 24 MANDARIN APATANI 92. 24 MANDARIN BENGNI 96. 91 MANDARIN BOKAR 95. 28 … Language Classification 69

LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN LG 1 LG 2 LDND MANDARIN MIDDLE_CHINESE 81. 75 MANDARIN OLD_CHINESE 94. 30 MANDARIN SUZHOU_WU 85. 87 MANDARIN DHAMMAI 97. 48 MANDARIN A_TONG 97. 91 MANDARIN KAYAH_LI_EASTERN 94. 75 MANDARIN MIKIR 99. 05 MANDARIN LEPCHA 97. 24 MANDARIN APATANI 92. 24 MANDARIN BENGNI 96. 91 MANDARIN BOKAR 95. 28 3500 languages ~ 240. 000 comp Language Classification 70

Processing problems … Language Classification 71 Processing problems … Language Classification 71

Solution: parallel processing Language Classification 72 Solution: parallel processing Language Classification 72

Swadesh (3500) AJSP distance matrix http: //www. megasoftware. net/ MEGA 4 DNA patterns Language Swadesh (3500) AJSP distance matrix http: //www. megasoftware. net/ MEGA 4 DNA patterns Language Classification 73

Swadesh (3500) AJSP distance matrix MEGA 4 Language Classification Neighbour Joining 74 Swadesh (3500) AJSP distance matrix MEGA 4 Language Classification Neighbour Joining 74

SEE COMPLETE TREE-OF-THE-MONTH ON: email. eva. mpg. de/~wichmann/ASJPHome. Page Language Classification 75 SEE COMPLETE TREE-OF-THE-MONTH ON: email. eva. mpg. de/~wichmann/ASJPHome. Page Language Classification 75

Mayan (34 / 69 Ethn) LDND + Mega 4 Mayan (34 / 69 Ethn) LDND + Mega 4

Mayan (34 / 69) < all & only > LDND + Mega 4 Mayan (34 / 69) < all & only > LDND + Mega 4

Mayan (34 / 69) cholan LDND + Mega 4 Mayan (34 / 69) cholan LDND + Mega 4

Mayan (34 / 69) tzeltalan cholan LDND + Mega 4 cholan Mayan (34 / 69) tzeltalan cholan LDND + Mega 4 cholan

Mayan (34 / 69) tzeltalan cholan yucatecan LDND + Mega 4 Mayan (34 / 69) tzeltalan cholan yucatecan LDND + Mega 4

Mayan (34 / 69) Ethnologue/experts: yucatecan cholan LDND + Mega 4 tzeltalan Mayan (34 / 69) Ethnologue/experts: yucatecan cholan LDND + Mega 4 tzeltalan

ASJP and genetic classification - Method works at a global level Language Classification 82 ASJP and genetic classification - Method works at a global level Language Classification 82

ASJP and genetic classification - Method works at a global level - Often also ASJP and genetic classification - Method works at a global level - Often also at the lowest levels Language Classification 83

ASJP and genetic classification - Method works at a global level - Often also ASJP and genetic classification - Method works at a global level - Often also at the lowest levels - Refinement necessary at intermediate level Language Classification 84

Adding typological data Language Classification 85 Adding typological data Language Classification 85

Trying to improve the fit … Enrich lexical with typological data: Haspelmath, M. Dryer, Trying to improve the fit … Enrich lexical with typological data: Haspelmath, M. Dryer, D. Gil & B. Comrie (eds) (2005). The World Atlas Of Language Structures. Oxford: Oxford University Press WALS Online: http: //wals. info/ Language Classification 86

Lexical plus typological data Swadesh (3500) + WALS (2580) < 140 FEATURES > ASJP Lexical plus typological data Swadesh (3500) + WALS (2580) < 140 FEATURES > ASJP distance matrix TREE SFTW Language Classification 87

‘SWALSH’ ASJP distance matrix TREE SFTW Language Classification 88 ‘SWALSH’ ASJP distance matrix TREE SFTW Language Classification 88

Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP languages Language Classification 89

SWALSH (1250) ASJP distance matrix TREE SFTW Language Classification 90 SWALSH (1250) ASJP distance matrix TREE SFTW Language Classification 90

Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP languages - WALS matrix very UNevenly filled (16%) cf. Cysouw (2008) – STUF 61. 3 Language Classification 91

Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP Improving the fit Enrich lexical with typological data: - NOT 1: 1 with ASJP languages - WALS features very unevenly filled Determine most stable features Language Classification 92

Feature Stability Nichols (1995): metric for S(Ftrk) in Gx: pairs (valk=valk) all pairs Language Feature Stability Nichols (1995): metric for S(Ftrk) in Gx: pairs (valk=valk) all pairs Language Classification 93

Feature Stability ASJP: metric for Stability Ftrk: For Gx: pairs (valk=valk) all pairs Language Feature Stability ASJP: metric for Stability Ftrk: For Gx: pairs (valk=valk) all pairs Language Classification 94

Feature Stability ASJP: metric for stability Ftrk: For Gx: pairs (valk=valk) all pairs Size Feature Stability ASJP: metric for stability Ftrk: For Gx: pairs (valk=valk) all pairs Size differences between G Language Classification 95

Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs Language Classification Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs Language Classification 96

Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs U all Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs U all pairs (valk=valk) all pairs ‘Background noise’ Language Classification 97

Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs U all Feature Stability ASJP: metric for stability Ftrk: SFk= pairs (valk=valk) all pairs U all pairs (valk=valk) all pairs (1 – U) Normalization: SFk comparable Language Classification 98

Most stable WALS features 31. Sex-based and Non-sex-based Gender Systems 118. Predicative Adjectives 0. Most stable WALS features 31. Sex-based and Non-sex-based Gender Systems 118. Predicative Adjectives 0. 81 0. 74 30. Number of Genders 0. 73 119. Nominal and Locational Predication 29. Syncretism in Verbal Person/Number Marking Language Classification 0. 71 99

Most instable WALS features 128. Utterance Complement Clauses 0. 07 115. Negative Indefinite Pronouns/Predicate Most instable WALS features 128. Utterance Complement Clauses 0. 07 115. Negative Indefinite Pronouns/Predicate Negation 0. 07 59. Possessive Classification 135. Red and Yellow 0. 01 -0. 07 58. Obligatory Possessive Inflection Language Classification -0. 25 100

Correlation with Ethnologue Min ftrs 20 Language Classification 101 Correlation with Ethnologue Min ftrs 20 Language Classification 101

Correlation with Ethnologue Min ftrs 40 20 Language Classification 102 Correlation with Ethnologue Min ftrs 40 20 Language Classification 102

Correlation with Ethnologue Min ftrs 60 40 20 Language Classification 103 Correlation with Ethnologue Min ftrs 60 40 20 Language Classification 103

Correlation with Ethnologue Min ftrs 80 60 40 20 Language Classification 104 Correlation with Ethnologue Min ftrs 80 60 40 20 Language Classification 104

Correlation with Ethnologue Min ftrs 100 80 60 40 20 Language Classification 105 Correlation with Ethnologue Min ftrs 100 80 60 40 20 Language Classification 105

Correlation with Ethnologue Min ftrs 100 80 60 40 20 + Stability - Language Correlation with Ethnologue Min ftrs 100 80 60 40 20 + Stability - Language Classification 106

Correlation with Ethnologue Min ftrs 100 80 60 40 20 20 Language Classification 107 Correlation with Ethnologue Min ftrs 100 80 60 40 20 20 Language Classification 107

Correlation with Ethnologue Min ftrs 100 80 60 40 20 40 Language Classification 108 Correlation with Ethnologue Min ftrs 100 80 60 40 20 40 Language Classification 108

Correlation with Ethnologue Min ftrs 100 80 60 40 20 60 Language Classification 109 Correlation with Ethnologue Min ftrs 100 80 60 40 20 60 Language Classification 109

Correlation with Ethnologue Min ftrs 100 80 60 40 20 85 Language Classification 110 Correlation with Ethnologue Min ftrs 100 80 60 40 20 85 Language Classification 110

Correlation with Ethnologue Min ftrs 100 80 60 40 20 Language Classification 111 Correlation with Ethnologue Min ftrs 100 80 60 40 20 Language Classification 111

WALS Language Classification 112 WALS Language Classification 112

WALS Swadesh 40 Language Classification 113 WALS Swadesh 40 Language Classification 113

Improving the fit Typological variables* do not perform better than lexical ones to establish Improving the fit Typological variables* do not perform better than lexical ones to establish genetic relationships *WALS! Language Classification 114

Improving the fit Typological variables do not perform better than lexical ones to establish Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships What about a combination? Language Classification 115

Ftrs 100 80 60 40 20 Only WALS Language Classification Only Sw 40 Lgs Ftrs 100 80 60 40 20 Only WALS Language Classification Only Sw 40 Lgs 79 109 139 218 341 116

Ftrs 100 80 60 40 20 Only WALS Language Classification Only Sw 40 Lgs Ftrs 100 80 60 40 20 Only WALS Language Classification Only Sw 40 Lgs 79 109 139 218 341 117

Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 85: 15 Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 85: 15 Only WALS Language Classification Only Sw 40 118

Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 70: 30 Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 70: 30 Only WALS Language Classification Only Sw 40 119

0. 91 Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 0. 91 Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 50: 50 Only WALS Language Classification Only Sw 40 120

Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 35: 65 Ftrs 100 80 60 40 20 Lgs 79 109 139 218 341 35: 65 Only WALS Language Classification Only Sw 40 121

Improving the fit Typological variables do not perform better than lexical ones to establish Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … Language Classification 122

Improving the fit Typological variables do not perform better than lexical ones to establish Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … … at a much higher cost per language than just lexicostatistics: 84% WALS to be filled in … Language Classification 123

Improving the fit Typological variables do not perform better than lexical ones to establish Improving the fit Typological variables do not perform better than lexical ones to establish genetic relationships A combined, balanced approach is superior, but … … at a much higher cost per language Continue extension/optimization of lexical method Language Classification 124

Publications 2008 - 2009 1. Brown, Cecil H. , Eric W. Holman, Søren Wichmann Publications 2008 - 2009 1. Brown, Cecil H. , Eric W. Holman, Søren Wichmann & Viveka Velupillai (2008). Automated Classification of the World’s languages: a description of the method and prelimary results. Sprachtypologie und Universalienforschung 61: 285 -308. 2. Holman, E. W. , S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D. Bakker (2008) 'Advances in automated language classification'. In A. Arppe, K. Sinnemäke and U. Nikanne (eds) Quantitative Investigations in Theoretical Linguistics. Helsinki: University of Helsinki, 40 -43. 3. Holman, E. W. , S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D. Bakker (2008). ‘Explorations in automated language classification’. Folia Linguistica 42 -2, 331 -354. 4. Bakker, D. , A. Müller, V. Velupillai, S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R. Mailhammer, A. Grant, E. W. Holman (2009). ’Adding typology to lexicostatistics: a combined approach to language classification’. Linguistic Typology 13, 167 -179. Language Classification 125

? Language Classification 126 ? Language Classification 126

ASJP Overall goal: - Method + Tools for Reconstruction of Language Relationships Derived goals: ASJP Overall goal: - Method + Tools for Reconstruction of Language Relationships Derived goals: - Critical assessment and refinement of existing classifications - Classify newly described and unclassified languages - Search for (ir)regularities in family reconstructions - Test hypotheses about families - Experimentally find an optimal dating method - Automatically detect borrowings Language Classification 127