cb4a4fea26b5ff3f1e0cf5befd6cde81.ppt
- Количество слайдов: 33
STRUCTURE AND FREQUENCY OF LEXICAL SEMANTIC CLASSES Paola Merlo University of Geneva Suzanne Stevenson University of Toronto
What is the role of quantitative approaches? • Can quantitative investigations be the subject matter of linguistic research, or are they only methodological tools? • Investigating the relationship between richly structured representations and distributional properties of language - provides richer data - supports falsifiable and predictive reasoning within weaker theories
Case Study: Verb Classes Manner of Motion TRANS INTR Change of State TRANS INTR . 23 The rider . 77 The horse raced past the barn . 40 The cook melted . 60 The butter melted Creation/Transformation TRANS. 62 INTR . 38 raced the horse past the barn the butter The contractors built the house The contractors built all summer
Quantitative Investigations Observation: Different lexical semantic classes have different patterns of frequency distributions • Q 1: Are these distributional properties related to other underlying properties? • Q 2: Are the differences in distribution strong enough to support generalisation to new verbs and other verb classes? • Q 3: Does the relation between underlying properties and frequency hold typologically?
Frequency and thematic roles • Different lexical semantic classes show different frequency distributions in the use of the transitive construction • The difference in the frequency of the transitive use is related to different thematic assignments
English Verb Classes Manner of Motion The rider (Causal) Agent raced the horse Agent past the barn The horse raced past the barn Agent Change of State melted The butter Theme Creation/Transformation The cook (Causal) Agent the butter Theme melted The contractors Agent built the house Theme The contractors Agent built all summer
Transitive Use Transitive Classes Subject Example Object Mo. M (Causal) Agent The jockey raced the horse Co. S (Causal) Agent Theme The cook melted the butter C/T Agent Theme The workers built the house • Transitivity by causation: Mo. M, Co. S • Agentive object : Mo. M
Relationship between frequency and transitivity • Transitivity by causation: Mo. M, Co. S - Greater complexity, two events • Agentive object : Mo. M (transitive unergative) - Infrequent in English: only Mo. M and SE - Infrequent typologically (* Italian, French, German, Portuguese, and Czech. Vietnamese only comitative) - Difficult to process (Stevenson Merlo 97, Filip et al. CUNY 98) Explains frequency of transitive use Mo. M < Co. S < C/T
Other frequency facts Are there are other properties specific to verb classes that we can expect to surface as statistical differences?
Animacy Subject of Classes Transitive Example Intransitive Mo. M (Causal) Agent The jockey raced the horse The horse raced Co. S (Causal) Agent Theme The cook melted the butter The butter melted C/T Agent The workers built the house The workers built Agent Themes are more likely to be inanimate
Animacy and thematic hierarchies Thematic hierarchy Animacy hierarchy AGENT > THEME 1, 2>3, Proper>Human>Animate>Inanimate Harmonic Alignment 1, 2/AG>3, Proper/AG>Human/AG>Animate/AG>Inanimate/AG 1, 2/TH<3, Proper/TH
Causative use Object Transitive Subject Intransitive Mo. M Agent The jockey raced the horse The horse raced Co. S Theme The cook melted the butter The butter melted C/T Agent Theme No causative alternation Classes Example • Transitivity by causation: Mo. M, Co. S Causer subject, same thematic role between subj intr and obj trans • Expected frequency of overlap: Mo. M, C/T < Co. S
Empirical Validation How do we verify empirically that the distributional properties are as predicted based on the verb class representation? The properties we have hypothesized are abstract, how do we count them in a sufficiently large corpus? - by hand, sampling - automatically by approximation with indicators
Data Collection – Materials Verbs Manner of motion (20) -- jump, march Change of state (19) -- open, explode Creation/Transformation/Performance (C/T) (20) -- played, painted Verb Form``-ed'' form assumed to be representative Corpora 65 million words tagged Brown + tagged WSJ corpus (LDC) 29 million words parsed WSJ (LDC corpus, Collins 97 parser)
Data Collection -- Method TRANS Verb token immediately followed by potential object counted as transitive else intransitive. Potential object = Closest nominal group after verb token. (or also count passive or past participle frequency) CAUS Calculate overlap of multiset of subjects and multiset of objects Take ratio between cardinality of the overlap multiset, and the sum of the cardinality of the subject and object multisets. ANIM Ratio of occurrences of pronoun subjects to all subjects
Statistical Analysis of the Data Mean relative frequencies TRANS CAUS ANIM Mo. M . 23 . 00 . 25 Co. S . 40 . 12 . 07 C/T . 62 . 04 . 15 All statistically significant at p<. 01
Conclusions Answer to Q 1 - different lexical semantic classes have different frequency distributions of properties systematically related to the verb’s thematic assignments
Generalising How well do these distributional properties generalise - across verbs - across classes - across languages ?
The Classification Problem The Given Statistics reflecting thematic information about a given set of verb classes The Goal Automatically classify unseen verbs Experimental Setup Materials Vector template: [ verb, TRANS, …, CAUS, ANIM, class] Example: [ open, . 69, …, . 16, . 36, Co. S ] Method Learner: C 5. 0 (decision tree induction algorithm) Training/Testing: 10 -fold cross-validation repeated 50 times
Results • Overall results: accuracy 69. 8% (baseline 33. 9, expert upper bound 86. 5%) 54% reduction in error rate on previously unseen verbs (recent extension range from 62% to 82% accuracy) • Effectiveness of frequency distributions All distributions are useful in classification • Class by class accuracy Mo. M verbs are most accurately classified • Analysis of Errors TRANS sharpens 3 way distinction ANIM particularly helpful in discriminating Co. S Relation between frequencies and thematic assignments is confirmed
Generalising to a new class • New Class Psychological State Verbs • New thematic roles Experiencer Stimulus Example The rich love Experiencer money Stimulus The rich love Experiencer too • Properties: TRANS, CAUS, ANIM PROG use of the progressive (stative/non stative) • Results 74. 6% accuracy (baseline 57%) TRANS, CAUS, ANIM best features
Discussion • Relationship between frequencies and thematic properties holds across classes • Some specific frequency distributions carry across thematic roles Discovery We do not need to investigate new frequency distributions for each new class Conjecture: Thematic roles are decomposed in more primitive features
Multi-lingual Generalisations • Accurate investigation of relation between grammar and frequency requires - a well-founded theory of lexical representation - a distributional analysis of language • Multi-linguality provides - abstract, general level of linguistic description - more data • Greater coverage and accuracy are possible by looking at several languages
Multi-lingual Generalisations • Extension of mono-lingual method to a new language (Italian) - Shows similarities in the relations between frequency distributions and thematic relations across languages - Extends coverage to new languages • Extension to the use of multi-lingual data to classify verbs in a given language (Chinese and English data to classify English verbs) - Shows that surface differences across languages are related to a similar underlying representation - Improves accuracy in the classification of a given language
Exploiting similarities: Extension to Italian (Merlo, Stevenson, Tsang, and Allaria 2002, Allaria 2001) Verbs: 20 Co. S, 20 C/T, 19 Psych Properties: TRANS, CAUS, ANIM, PROG Corpus: PAROLE 22 million words (CNR, Pisa) Counts: relative frequencies, hand counts (exact)
Results Properties Acc% TRANS CAUS ANIM PROG 85. 4 TRANS (CAUS) ANIM 86. 4 • 79% reduction in error rate on unseen verbs TRANS ANIM best • Relationship between frequencies and thematic properties holds across languages
Leveraging Cross-language Differences (Tsang, Stevenson and Merlo, 2002) • What is abstract/underlying in one language might be explicit in another • Revealing an underlying common/similar classification e. g. - Causative forms in Chinese are morphologically marked Data from several languages classify one language Training Testing Chinese English
Materials and Method English verb classes: Mo. M, Co. S, C/T, 20 verbs in each class English properties: TRANS, PASS, VBN, CAUS, ANIM Chinese translations of the verbs (several) Counts of new frequencies adapted to Chinese: Relative frequencies of - POS tags (indication of subcategorization and stative/active) - passive particle - periphrastic causative particle
Materials and Method English data from BNC (tagged and chunked), Chinese data from Mandarin News (165 million characters) Learning: C 5. 0 (decision tree induction) Training/Testing: 10 -fold cross-validation 50 repeats
Results • Best result in classification of English verbs: • combination of Chinese and English frequencies ANIM, TRANS, CKIP 83. 5% accuracy (English frequencies 67. 6%) • same or at least similar underlying abstract classification otherwise different views would make the classifier diverge • advantage of working at different levels of description
Conclusions • Distributional properties are correlated to thematic properties for several verb classes, several thematic roles, several languages Relevant for Notion of verb class: point in a multi-dimensional space? Representation and inventory of thematic roles Language acquisition studies: what are the properties necessary to learning verb meaning (Gillette et al 99)
Thank you to our students Gianluca Allaria (Geneva) Eva Esteve Ferrer (Geneva) Eric Joanis (Toronto) Vivian Tsang (Toronto)
THANK YOU


