Скачать презентацию Modelli simulativi nelle Scienze Cognitive Il lessico modelli Скачать презентацию Modelli simulativi nelle Scienze Cognitive Il lessico modelli

214cd5f62a787d1ee5551b716cab892b.ppt

  • Количество слайдов: 61

Modelli simulativi nelle Scienze Cognitive Il lessico: modelli linguistici, Word. Net, acquisizione lessicale Massimo Modelli simulativi nelle Scienze Cognitive Il lessico: modelli linguistici, Word. Net, acquisizione lessicale Massimo Poesio 2004/05 Modelli simulativi 1

PART I: LEXICON AND LEXICAL SEMANTICS WORDNET 2004/05 Modelli simulativi 2 PART I: LEXICON AND LEXICAL SEMANTICS WORDNET 2004/05 Modelli simulativi 2

What’s in a lexicon A lexicon is a repository of lexical knowledge The simplest What’s in a lexicon A lexicon is a repository of lexical knowledge The simplest form of lexicon: a list of words But even for English – let alone languages with a more complex morphology, such as Italian – it makes sense to split WORD FORMS from LEXICAL ENTRIES or LEXEMEs: LEXEME BANK POS: N WORD BANKS LEXEME: BANK SYN: NUM: PLUR And lexical knowledge also includes information about the MEANING of words 2004/05 Metodi simulativi 3

Meaning …. • Characterizing the meaning of words not easy • Most of the Meaning …. • Characterizing the meaning of words not easy • Most of the methods considered in these lecture characterize the meaning of a word by stating its relations with other words • This method however doesn’t say much about what the word ACTUALLY mean (e. g. , what can you do with a car) 2004/05 Metodi simulativi 4

Un esempio di lexical entry: VICINO (da it. wiktionary. org) vicino sostantivo m (vicina Un esempio di lexical entry: VICINO (da it. wiktionary. org) vicino sostantivo m (vicina f, vicini pl m, vicine pl f) 1. Colui che abita accanto. (“I miei vicini vengono da Frosinone” vicino aggettivo m (vicina f, vicini pl m, vicine pl f) (“La piu’ vicina stella a neutroni e’ RX J 185635 -3754”) vicino avverbio (invariabile) (“Itunes visto da vicino”) 2004/05 Metodi simulativi 5

Lexical resources for computers: MACHINE READABLE DICTIONARIES A traditional DICTIONARY is a database containing Lexical resources for computers: MACHINE READABLE DICTIONARIES A traditional DICTIONARY is a database containing information about the PRONUNCIATION of a certain word its possible PARTS of SPEECH its possible SENSES (or MEANINGS) In recent years, most dictionaries have appeared in Machine Readable form (MRD) English: Oxford English Dictionary Collins Longman Dictionary of Ordinary Contemporary English (LDOCE) Italian: Garzanti Zanichelli Paravia it. wiktionary. org 2004/05 Metodi simulativi 6

An example LEXICAL ENTRY from a machinereadable dictionary: STOCK, from the LDOCE 0100 a An example LEXICAL ENTRY from a machinereadable dictionary: STOCK, from the LDOCE 0100 a supply (of something) for use: a good stock of food 0200 goods for sale: Some of the stock is being taken without being paid for 0300 the thick part of a tree trunk 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed 0600 a group of animals used for breeding 0700 farm animals usu. cattle; LIVESTOCK 0800 a family line, esp. of the stated character 0900 money lent to a government at a fixed rate of interest 1000 the money (CAPITAL) owned by a company, divided into SHAREs 1100 a type of garden flower with a sweet smell 1200 a liquid made from the juices of meat, bones, etc. , used in cooking …. . 2004/05 Metodi simulativi 7

Homonymy Word-strings like STOCK are used to express apparently unrelated senses / meanings, even Homonymy Word-strings like STOCK are used to express apparently unrelated senses / meanings, even in contexts in which their part-of-speech has been determined Other well-known examples: BANK, LIME, RIGHT, SET, SCALE Italian: CALCIO, OBBIETTIVO An example of the problems homonimy may cause for IR systems Search for 'West Bank' with Google 2004/05 Metodi simulativi 8

CALCIO, da “Il grande dizionario Garzanti” calcio 1 [càl-cio] s. m. 1. colpo dato CALCIO, da “Il grande dizionario Garzanti” calcio 1 [càl-cio] s. m. 1. colpo dato con il piede o con la zampa; pedata; dare, assestare, ricevere un _ 2. (sport) gioco che si svolge tra due squadre di undici giocatori ciascuna … 3. nel football, colpo dato con il piede al pallone: - di punizione, … - di rigore …. – d’angolo …. – piazzato calcio 2 parte inferiore della cassa di un fucile … derivato del lat. calx calcis …. calcio 3 elemento chimico il cui simbolo è Ca; metallo alcalinoterroso …… 2004/05 Metodi simulativi 9

Omonimia in un MRD per l’Italiano (Ital. Word. Net) obbiettivo, Nome [1] - scopo Omonimia in un MRD per l’Italiano (Ital. Word. Net) obbiettivo, Nome [1] - scopo di un'operazione militare. (obbiettivo [1], obiettivo [1]) [2] - bersaglio nel tiro di artiglieria (obbiettivo [2], obiettivo [2]) [4] - sistema di lenti per proiettare l'immagine reale di un oggetto (obbiettivo [4], obiettivo [4]) 2004/05 Metodi simulativi 10

Homonymy and machine translation 2004/05 Metodi simulativi 11 Homonymy and machine translation 2004/05 Metodi simulativi 11

Meaning in MRDs, 2: SYNONYMY Two words are SYNONYMS if they have the same Meaning in MRDs, 2: SYNONYMY Two words are SYNONYMS if they have the same meaning at least in some contexts E. g. , PRICE and FARE; CHEAP and INEXPENSIVE; LAPTOP and NOTEBOOK; HOME and HOUSE I’m looking for a CHEAP FLIGHT / INEXPENSIVE FLIGHT From Roget’s thesaurus: OBLITERATION, erasure, cancellation, deletion But few words are truly synonymous in ALL contexts: I wanna go HOME / ? ? I wanna go HOUSE The flight was CANCELLED / ? ? OBLITERATED / ? ? ? DELETED Knowing about synonyms may help in IR: NOTEBOOK (get LAPTOPs as well) CHEAP PRICE (get INEXPENSIVE FARE) 2004/05 Metodi simulativi 14

Sinonimia in Italiano scorza, Nome [1] - (corteccia [1], scorza [1]) [2] - parte Sinonimia in Italiano scorza, Nome [1] - (corteccia [1], scorza [1]) [2] - parte esterna, involucro dei frutti (buccia [1], scorza [2]) [4] - (scorza [4]) "sotto la sua scorza scortese si nasconde un animo nobile" 2004/05 Metodi simulativi 15

Problems and limitations of MRDs Identifying distinct senses always difficult - Sense distinctions often Problems and limitations of MRDs Identifying distinct senses always difficult - Sense distinctions often subjective Definitions often circular Very limited characterization of the meaning of words 2004/05 Metodi simulativi 16

Homonymy vs polysemy 0100 a supply (of something) for use: a good stock of Homonymy vs polysemy 0100 a supply (of something) for use: a good stock of food 0200 goods for sale: Some of the stock is being taken without being paid for 0300 the thick part of a tree trunk 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed 0600 a group of animals used for breeding 0700 farm animals usu. cattle; LIVESTOCK 0800 a family line, esp. of the stated character 0900 money lent to a government at a fixed rate of interest 1000 the money (CAPITAL) owned by a company, divided into SHAREs 1100 a type of garden flower with a sweet smell 1200 a liquid made from the juices of meat, bones, etc. , used in cooking …. . 2004/05 Metodi simulativi 17

POLYSEMY vs HOMONIMY In cases like BANK, it’s fairly easy to identify two distinct POLYSEMY vs HOMONIMY In cases like BANK, it’s fairly easy to identify two distinct senses (etymology also different). But in other cases, distinctions more questionable E. g. , senses 0100 and 0200 of stock clearly related, like 0600 and 0700, or 0900 and 1000 In some cases, syntactic tests may help. E. g. , KEEP (Hirst, 1987): Ross KEPT staring at Nadia’s decolletage Nadia KEPT calm and made a cutting remark Ross wrote of his embarassment in the diary that he KEPT. POLYSEMOUS WORDS: meanings are related to each other Cfr. Human’s foot vs. mountain’s foot In general, distinction between HOMONIMY and POLYSEMY not always easy (especially with VERBS) 2004/05 Metodi simulativi 18

Other aspects of lexical meaning not captured by MRDs Other semantic relations: HYPONYMY ANTONYMY Other aspects of lexical meaning not captured by MRDs Other semantic relations: HYPONYMY ANTONYMY A lot of other information typically considered part of ENCYCLOPEDIAs: Trees grow bark and twigs Adult trees are much taller than human beings 2004/05 Metodi simulativi 19

Hyponymy and Hypernymy HYPONYMY is the relation between a subclass and a superclass: CAR Hyponymy and Hypernymy HYPONYMY is the relation between a subclass and a superclass: CAR and VEHICLE DOG and ANIMAL BUNGALOW and HOUSE Generally speaking, a hyponymy relation holds between X and Y whenever it is possible to substitute Y for X: That is a X -> That is a Y E. g. , That is a CAR -> That is a VEHICLE. HYPERNYMY is the opposite relation Knowledge about TAXONOMIES useful to classify web pages Eg. , Semantic Web Automatically (e. g. , Udo Kruschwitz’s system) This information not generally contained in MRD 2004/05 Metodi simulativi 20

The organization of the lexicon “eat” “eats” EAT-LEX-1 eat 0600 eat 0700 “ate” “eaten” The organization of the lexicon “eat” “eats” EAT-LEX-1 eat 0600 eat 0700 “ate” “eaten” WORD-FORMS LEXEMES 2004/05 Metodi simulativi SENSES 22

The organization of the lexicon stock 0100 STOCK-LEX-1 “stock” STOCK-LEX-2 stock 0200 stock 0600 The organization of the lexicon stock 0100 STOCK-LEX-1 “stock” STOCK-LEX-2 stock 0200 stock 0600 stock 0700 STOCK-LEX-3 stock 0900 stock 1000 WORD-STRINGS LEXEMES 2004/05 Metodi simulativi SENSES 23

Synonymy cheap 0100 “cheap” CHEAP-LEX-1 CHEAP-LEX-2 …. …… cheap. XXXX “inexpensive” INEXP-LEX-3 inexp 0900 Synonymy cheap 0100 “cheap” CHEAP-LEX-1 CHEAP-LEX-2 …. …… cheap. XXXX “inexpensive” INEXP-LEX-3 inexp 0900 inexp. YYYY WORD-STRINGS LEXEMES 2004/05 Metodi simulativi SENSES 24

A more advanced lexical resource: Word. Net A lexical database created at Princeton Freely A more advanced lexical resource: Word. Net A lexical database created at Princeton Freely available for research from the Princeton site http: //www. cogsci. princeton. edu/~wn/ Information about a variety of SEMANTICAL RELATIONS Three sub-databases (supported by psychological research as early as (Fillenbaum and Jones, 1965)) NOUNs VERBS ADJECTIVES and ADVERBS Each database organized around SYNSETS 2004/05 Metodi simulativi 25

The noun database About 90, 000 forms, 116, 000 senses Relations: hypernym hyponym meal The noun database About 90, 000 forms, 116, 000 senses Relations: hypernym hyponym meal -> lunch has-member faculty -> professor member-of copilot -> crew has-Part table -> leg part-of course -> meal antonym 2004/05 breakfast -> meal leader -> follower Metodi simulativi 26

Synsets Senses (or `lexicalized concepts’) are represented in Word. Net by the set of Synsets Senses (or `lexicalized concepts’) are represented in Word. Net by the set of words that can be used in AT LEAST ONE CONTEXT to express that sense / lexicalized concept: the SYNSET E. g. , {chump, fish, fool, gull, mark, patsy, fall guy, sucker, shlemiel, soft touch, mug} (gloss: person who is gullible and easy to take advantage of) 2004/05 Metodi simulativi 27

Hypernyms 2 senses of robin Sense 1 robin, redbreast, robin redbreast, Old World robin, Hypernyms 2 senses of robin Sense 1 robin, redbreast, robin redbreast, Old World robin, Erithacus rubecola - (small Old World songbird with a reddish breast) => thrush - (songbirds characteristically having brownish upper plumage with a spotted breast) => oscine, oscine bird -- (passerine bird having specialized vocal apparatus) => passerine, passeriform bird - (perching birds mostly small and living near the ground with feet having 4 toes arranged to allow f or gripping the perch; most are songbirds; hatchlings are helpless) => bird -- (warm-blooded egglaying vertebrates characterized by feathers and forelimbs modified as wings) => vertebrate, craniate - (animals having a bony or cartilaginous skeleton with a segmented spinal column and a large brai n enclosed in a skull or cranium) => chordate - (any animal of the phylum Chordata having a notochord or spinal column) => animal, animate being, beast, brute, creature, fauna - (a living organism characterized by voluntary movement) => organism, being - (a living that has (or can develop) the ability to act or function independently) => living thing, animate thing -- (a living (or once living) entity) => object, physical object - => entity, physical thing -28 2004/05 Metodi simulativi

Meronymy wn beak –holon Holonyms of noun beak 1 of 3 senses of beak Meronymy wn beak –holon Holonyms of noun beak 1 of 3 senses of beak Sense 2 beak, bill, neb, nib PART OF: bird 2004/05 Metodi simulativi 29

The verb database About 10, 000 forms, 20, 000 senses Relations between verb meanings: The verb database About 10, 000 forms, 20, 000 senses Relations between verb meanings: Hypernym Troponym Walk -> stroll Entails Snore -> sleep Antonym 2004/05 fly-> travel Increase -> decrease Metodi simulativi 30

Relations between verbal meanings V 1 ENTAILS V 2 when Someone V 1 (logically) Relations between verbal meanings V 1 ENTAILS V 2 when Someone V 1 (logically) entails Someone V 2 - e. g. , snore entails sleep TROPONYMY when To do V 1 is To do V 2 in some manner - e. g. , limp is a troponym of walk 2004/05 Metodi simulativi 31

The adjective and adverb database About 20, 000 adjective forms, 30, 000 senses 4, The adjective and adverb database About 20, 000 adjective forms, 30, 000 senses 4, 000 adverbs, 5600 senses Relations: Antonym (adjective) Antonym (adverb) 2004/05 Heavy <-> light Quickly <-> slowly Metodi simulativi 32

How to use Online: http: //cogsci. princeton. edu/cgi-bin/webwn Command line: Get synonyms: wn –synsn How to use Online: http: //cogsci. princeton. edu/cgi-bin/webwn Command line: Get synonyms: wn –synsn bank Get hypernyms: wn –hypen robin (also for adjectives and verbs): get antonyms wn –antsa right 2004/05 Metodi simulativi 33

Ital. Word. Net (una produzione locale) Euro. Word. Net: creato da un consorzio Europeo Ital. Word. Net (una produzione locale) Euro. Word. Net: creato da un consorzio Europeo Ital. Word. Net: creato da ITC http: //www. ilc. cnr. it/iwndb_php/ 2004/05 Metodi simulativi 34

Other machine-readable lexical resources Machine readable dictionaries: LDOCE Roget’s Thesaurus The biggest encyclopedia: CYC Other machine-readable lexical resources Machine readable dictionaries: LDOCE Roget’s Thesaurus The biggest encyclopedia: CYC Italian: http: //multiwordnet. itc. it/ (IRST) 2004/05 Metodi simulativi 36

Readings Word. Net online manuals C. Fellbaum (ed), Wordnet: An Electronic Lexical Database, The Readings Word. Net online manuals C. Fellbaum (ed), Wordnet: An Electronic Lexical Database, The MIT Press 2004/05 Metodi simulativi 37

PART II: VECTOR-BASED MODELS OF THE LEXICON AND LEXICAL ACQUISITION 2004/05 Modelli simulativi 38 PART II: VECTOR-BASED MODELS OF THE LEXICON AND LEXICAL ACQUISITION 2004/05 Modelli simulativi 38

VECTOR-BASED LEXICAL MODELS Both in Linguistics and in Psychology researchers have developed theories of VECTOR-BASED LEXICAL MODELS Both in Linguistics and in Psychology researchers have developed theories of the lexicon in which concepts are characterized in terms of FEATURES E. g. , Smith and Medin, 1981; Sartori and Job, 1988 This type of approach leads to a ‘geometrical’ view of lexical entries as points , or VECTORS, in FEATURE SPACE This type of model can account for which words ‘mean the same’ A particularly simple version of this theory is the one in which the ‘features’ are simply other words Vector-space models have been shown to correlate well with the results of psychological experiments, particularly about SEMANTIC PRIMING 2004/05 Metodi simulativi 41

VECTOR-BASED MODELS AND LEXICAL ACQUISITION Vector-based models (both the feature-based and the wordbased variety) VECTOR-BASED MODELS AND LEXICAL ACQUISITION Vector-based models (both the feature-based and the wordbased variety) also interesting because they can serve as the basis for models of lexical acquisition These models are interesting From a psychological point of view, to explain how concepts are stored in memory In neural science, they are being used to investigate SEMANTIC CATEGORY DEFICITS (e. g. , Caramazza, Tyler et al, Vigliocco et al) From a linguistic point of view, because they can address the problems encountered by lexicographers when trying to specify word senses From a practical point of view: most MRD these days contain at least some information derived by computational means 2004/05 Metodi simulativi 42

Feature-based lexical semantics Very old idea in Linguistics: the meaning of a word can Feature-based lexical semantics Very old idea in Linguistics: the meaning of a word can be specified in terms of the values of certain `features’ (`DECOMPOSITIONAL SEMANTICS’) dog : ANIMATE= +, EAT=MEAT, SOCIAL=+ horse : ANIMATE= +, EAT=GRASS, SOCIAL=+ cat : ANIMATE= +, EAT=MEAT, SOCIAL=- E. g. , Katz and Fodor, 1968 2004/05 Metodi simulativi 43

PSYCHOLOGY: THE FUSS MODEL (Vinson and Vigliocco, 2002, 2003) 2004/05 Metodi simulativi 44 PSYCHOLOGY: THE FUSS MODEL (Vinson and Vigliocco, 2002, 2003) 2004/05 Metodi simulativi 44

Vector-based lexical semantics CAT DOG HORSE 2004/05 Metodi simulativi 45 Vector-based lexical semantics CAT DOG HORSE 2004/05 Metodi simulativi 45

WORD-BASED VECTOR-SPACE LEXICAL MODELS, I 2004/05 Metodi simulativi 46 WORD-BASED VECTOR-SPACE LEXICAL MODELS, I 2004/05 Metodi simulativi 46

WORD-BASED VECTOR SPACE MODELS, II 2004/05 Metodi simulativi 47 WORD-BASED VECTOR SPACE MODELS, II 2004/05 Metodi simulativi 47

WORD-BASED VECTOR-SPACE MODELS, III 2004/05 Metodi simulativi 48 WORD-BASED VECTOR-SPACE MODELS, III 2004/05 Metodi simulativi 48

Measures of semantic similarity Euclidean distance: Cosine: Manhattan Metric: 2004/05 Metodi simulativi 49 Measures of semantic similarity Euclidean distance: Cosine: Manhattan Metric: 2004/05 Metodi simulativi 49

DIMENSIONALITY REDUCTION 2004/05 Metodi simulativi 50 DIMENSIONALITY REDUCTION 2004/05 Metodi simulativi 50

Concept clustering (aka: automatic taxonomy discovery) Year Day Van Airplane Time 2004/05 Car Month Concept clustering (aka: automatic taxonomy discovery) Year Day Van Airplane Time 2004/05 Car Month Joy Love Fear Vehicle Feeling Metodi simulativi 51

Some psychological evidence for vectorspace representations Burgess and Lund (1996, 1997): the clusters found Some psychological evidence for vectorspace representations Burgess and Lund (1996, 1997): the clusters found with HAL correlate well with those observed using semantic priming experiments. Landauer, Foltz, and Laham (1997): scores overlap with those of humans on standard vocabulary and topic tests; mimic human scores on category judgments; etc. Evidence about `prototype theory’ (Rosch et al, 1976) Posner and Keel, 1968 subjects presented with patterns of dots that had been obtained by variations from single pattern (`prototype’) Later, they recalled prototypes better than samples they had actually seen Rosch et al, 1976: `basic level’ categories (apple, orange, potato, carrot) have higher `cue validity’ than elements higher in the hierarchy (fruit, vegetable) or lower (red delicious, cox) 2004/05 Metodi simulativi 52

General characterization of vector-based semantics (from Charniak) Vectors as models of concepts The CLUSTERING General characterization of vector-based semantics (from Charniak) Vectors as models of concepts The CLUSTERING approach to lexical semantics: 1. Define properties one cares about, and give values to each property (generally, numerical) 2. Create a vector of length n for each item to be classified 3. Viewing the n-dimensional vector as a point in n-space, cluster points that are near one another What changes between models: 1. The properties used in the vector 2. The distance metric used to decide if two points are `close’ 3. The algorithm used to cluster 2004/05 Metodi simulativi 53

Using words as features in a vector-based semantics The old decompositional semantics approach requires Using words as features in a vector-based semantics The old decompositional semantics approach requires i. Specifying the features ii. Characterizing the value of these features for each lexeme Simpler approach: use as features the WORDS that occur in the proximity of that word / lexical entry Intuition: “You can tell a word’s meaning from the company it keeps” More specifically, you can use as `values’ of these features The FREQUENCIES with which these words occur near the words whose meaning we are defining Or perhaps the PROBABILITIES that these words occur next to each other Alternative: use the DOCUMENTS in which these words occur (e. g. , LSA) 2004/05 Metodi simulativi 54

Using neighboring words to specify the meaning of words Take, e. g. , the Using neighboring words to specify the meaning of words Take, e. g. , the following corpus: 1. John ate a banana. 2. John ate an apple. 3. John drove a lorry. We can extract the following co-occurrence matrix: john ate drove banana apple lorry john 0 2 1 1 ate 2 0 0 1 1 0 drove 1 0 0 1 banana 1 1 0 0 apple 1 1 0 0 lorry 1 0 0 0 2004/05 Metodi simulativi 55

Acquiring lexical vectors from a corpus (Schuetze, 1991; Burgess and Lund, 1997) To construct Acquiring lexical vectors from a corpus (Schuetze, 1991; Burgess and Lund, 1997) To construct vectors C(w) for each word w: 1. Scan a text 2. Whenever a word w is encountered, increment all cells of C(w) corresponding to the words v that occur in the vicinity of w, typically within a window of fixed size Differences among methods: Size of window Weighted or not Whether every word in the vocabulary counts as a dimension (including function words such as the or and) or whether instead only some specially chosen words are used (typically, the m most common content words in the corpus; or perhaps modifiers only). The words chosen as dimensions are often called CONTEXT WORDS Whether dimensionality reduction methods are applied 2004/05 Metodi simulativi 56

The HAL model (Burgess and Lund, 1995, 1997) A 160 million words corpus of The HAL model (Burgess and Lund, 1995, 1997) A 160 million words corpus of articles extracted from all newsgroups containing English dialogue Context words: the 70, 000 most frequently occurring symbols within the corpus Window size: 10 words to the left and the right of the word Measure of similarity: cosine 2004/05 Metodi simulativi 60

Latent Semantic Analysis (LSA) (Landauer et al, 1997) Goal: extract relatons of expected contextual Latent Semantic Analysis (LSA) (Landauer et al, 1997) Goal: extract relatons of expected contextual usage from passages Two steps: 1. Build a word / document cooccurrence matrix 2. `Weigh’ each cell 3. Perform a DIMENSIONALITY REDUCTION Argued to correlate well with humans on a number of tests 2004/05 Metodi simulativi 61

LSA: the method, 1 2004/05 Metodi simulativi 62 LSA: the method, 1 2004/05 Metodi simulativi 62

LSA: Singular Value Decomposition 2004/05 Metodi simulativi 63 LSA: Singular Value Decomposition 2004/05 Metodi simulativi 63

LSA: Reconstructed matrix 2004/05 Metodi simulativi 64 LSA: Reconstructed matrix 2004/05 Metodi simulativi 64

Topic correlations in `raw’ and `reconstructed’ data 2004/05 Metodi simulativi 65 Topic correlations in `raw’ and `reconstructed’ data 2004/05 Metodi simulativi 65

SEXTANT (Grefenstette, 1992) It was concluded that the carcinoembryonic antigens represent cellular constituents which SEXTANT (Grefenstette, 1992) It was concluded that the carcinoembryonic antigens represent cellular constituents which are repressed during the course of differentiation the normal digestive system epithelium and reappear in the corresponding malignant cells by a process of derepressive dedifferentiation antigen carcinoembryonic-ADJ antigen repress-DOBJ antigen represent-SUBJ constituent cellular-ADJ constituent represent-DOBJ course repress-IOBJ ……. . 2004/05 Metodi simulativi 69

SEXTANT: Similarity measure DOG dog pet-DOBJ dog eat-SUBJ dog shaggy-ADJ dog brown-ADJ dog leash-NN SEXTANT: Similarity measure DOG dog pet-DOBJ dog eat-SUBJ dog shaggy-ADJ dog brown-ADJ dog leash-NN CAT cat pet-DOBJ cat hairy-ADJ cat leash-NN Jaccard: 2004/05 Metodi simulativi 70

Some caveats Two senses of `similarity’ Schuetze: two words are similar if one can Some caveats Two senses of `similarity’ Schuetze: two words are similar if one can replace the other Brown et al: two words are similar if they occur in similar contexts What notion of `meaning’ is learned here? “One might consider LSA’s maximal knowledge of the world to be analogous to a well-read nun’s knowledge of sex, a level of knowledge often deemed a sufficient basis for advising the young” (Landauer et al, 1997) Can one do semantics with these representations? Our own experience: using HAL-style vectors for resolving bridging references Very limited success Applying dimensionality reduction didn’t seem to help 2004/05 Metodi simulativi 71

Applications of these techniques: Information Retrieval cosmonaut astronaut car truck d 1 1 0 Applications of these techniques: Information Retrieval cosmonaut astronaut car truck d 1 1 0 d 2 0 1 1 0 0 d 3 1 0 0 d 4 0 0 0 1 1 d 5 0 0 0 1 0 d 6 2004/05 moon 0 0 1 Metodi simulativi 72

Readings Jurafsky and Martin, chapter 17. 3 Also useful: Manning and Schuetze, chapter 8 Readings Jurafsky and Martin, chapter 17. 3 Also useful: Manning and Schuetze, chapter 8 Charniak, chapters 9 -10 Some papers: HAL: see the Higher Dimensional Space page LSA: Various papers on the Colorado site Good reference: Landauer, Foltz, and Laham. (1997). Introduction to Latent Semantic Analysis. Discourse Processes. 2004/05 Metodi simulativi 73