330d8bf07ebf2551c7e44d0799617df5.ppt
- Количество слайдов: 31
Latent Semantic Analysis Probabilistic Topic Models & Associative Memory
The Psychological Problem n How do we learn semantic structure? n Covariation between words and the contexts they appear in (e. g. LSA) n How do we represent semantic structure? n Semantic Spaces (e. g. LSA) n Probabilistic Topics
Latent Semantic Analysis (Landauer & Dumais, 1997) high dimensional space SVD word-document counts STREAM RIVER BANK MONEY n n Each word is a single point in semantic space Similarity measured by cosine of angle between word vectors
Critical Assumptions of Semantic Spaces (e. g. LSA) n Psychological distance should obey three axioms n Minimality n Symmetry n Triangle inequality
For conceptual relations, violations of distance axioms often found n Similarities can often be asymmetric “North-Korea” is more similar to “China” than vice versa “Pomegranate” is more similar to “Apple” than vice versa n Violations of triangle inequality: AC AB BC Euclidian distance: AC AB + BC
Triangle Inequality in Semantic Spaces might not always hold THEATER w 1 w 2 PLAY Euclidian distance: AC AB + BC Cosine similarity: cos(w 1, w 3) ≥ cos(w 1, w 2)cos(w 2, w 3) – sin(w 1, w 2)sin(w 2, w 3) w 3 SOCCER
Nearest neighbor problem (Tversky & Hutchinson (1986) • In similarity data, “Fruit” is nearest neighbor in 18 out of 20 fruit words • In 2 D solution, “Fruit” can be nearest neighbor of at most 5 items • High-dimensional solutions might solve this but these are less appealing
Probabilistic Topic Models n A probabilistic version of LSA: no spatial constraints. n Originated in domain of statistics & machine learning n (e. g. , Hoffman, 2001; Blei, Ng, Jordan, 2003) n Extracts topics from large collections of text n Topics are interpretable unlike the arbitrary dimensions of LSA
Model is Generative Find parameters that “reconstruct” data DATA Corpus of text: Word counts for each document Topic Model
Probabilistic Topic Models n Each document is a probability distribution over topics (distribution over topics = gist) n Each topic is a probability distribution over words
Document generation as a probabilistic process 1. for each document, choose a mixture of topics 2. For every word slot, sample a topic [1. . T] from the mixture TOPICS MIXTURE sample a word from the topic . . . TOPIC WORD 3. TOPIC . . . WORD
Example mo ey on ne loan y m DOCUMENT 1: money 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 bank 1 money 1 stream 2 . 8 n ba loan bank k bank . 2 TOPIC 1 riv er . 3 nk river ba stream . 7 r rive DOCUMENT 2: river 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 bank 2 money 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 money 1 bank 1 stream 2 river 2 bank 2 stream 2 bank 2 money 1 str ea m ban k TOPIC 2 Mixture components Mixture weights Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b )
Inverting (“fitting”) the model ? TOPIC 1 DOCUMENT 1: money? bank? loan? river? stream? bank? money? river? bank? money? bank? loan? money? stream? bank? money? bank? loan? river? stream? bank? money? river? bank? money? bank? loan? bank? money? stream? ? DOCUMENT 2: river? stream? bank? money? loan? river? stream? loan? bank? river? bank? stream? river? loan? bank? stream? bank? money? loan? river? stream? bank? money? river? stream? loan? bank? river? bank? money? bank? stream? river? bank? stream? bank? money? ? TOPIC 2 Mixture components Mixture weights
Application to corpus data n TASA corpus: text from first grade to college n representative sample of text n 26, 000+ word types (stop words removed) 37, 000+ documents 6, 000+ word tokens n n
Example: topics from an educational corpus (TASA) • 37 K docs, 26 K words • 1700 topics, e. g. : PRINTING PAPER PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
Polysemy PRINTING PAPER PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW
Three documents with the word “play” (numbers & colors topic assignments)
No Problem of Triangle Inequality TOPIC 1 TOPIC 2 SOCCER FIELD MAGNETIC Topic structure easily explains violations of triangle inequality
Applications
Enron email data 500, 000 emails 5000 authors 1999 -2002
Enron topics TEXANS WIN FOOTBALL FANTASY SPORTSLINE PLAY TEAM GAME SPORTS GAMES GOD LIFE MAN PEOPLE CHRIST FAITH LORD JESUS SPIRITUAL VISIT May 22, 2000 Start of California energy crisis ENVIRONMENTAL AIR MTBE EMISSIONS CLEAN EPA PENDING SAFETY WATER GASOLINE FERC MARKET ISO COMMISSION ORDER FILING COMMENTS PRICE CALIFORNIA FILED TIMELINE POWER CALIFORNIA ELECTRICITY UTILITIES PRICES MARKET PRICE UTILITY CUSTOMERS ELECTRIC STATE PLAN CALIFORNIA DAVIS RATE BANKRUPTCY SOCAL POWER BONDS MOU
Applying Model to Psychological Data
Network of Word Associations BAT BALL BASEBALL GAME PLAY STAGE THEATER (Association norms by Doug Nelson et al. 1998)
Explaining structure with topics BAT BASEBALL topic 1 BALL GAME PLAY topic 2 STAGE THEATER
Modeling Word Association n Word association modeled as prediction n Given that a single word is observed, what future other words might occur? n Under a single topic assumption: Response Cue
Observed associates for the cue “play”
Model predictions RANK 9
Median Rank Median rank of first associate
Recall: example study List STUDY: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy FALSE RECALL: “Sleep” 61%
Recall as a reconstructive process n Reconstruct study list based on the stored “gist” n The gist can be represented by a distribution over topics n Under a single topic assumption: Retrieved word Study list
Predictions for the “Sleep” list STUDY LIST EXTRA LIST (top 8)
330d8bf07ebf2551c7e44d0799617df5.ppt