
d2625dcb33e646a6a291142f4e1ee091.ppt
- Количество слайдов: 45
Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with cc. LDA • Other applications of cc. LDA • Model evaluation • An alternative cross-collection model
Outline • Overview of topic models • PLSI and LDA • Some slides borrowed from CS 410 – Cheng. Xiang Zhai • Cross-Collection LDA • Cross-cultural analysis with cc. LDA • Other applications of cc. LDA • Model evaluation • An alternative cross-collection model
Probabilistic Topic Models • Idea: each document is some mix of topics • Each word in the document belongs to a topic
Document as a Sample of Mixed Topics Topic 1 Topic 2 … Topic k Background B government 0. 3 response 0. 2. . . city 0. 2 new 0. 1 orleans 0. 05. . . donate 0. 1 relief 0. 05 help 0. 02. . . is 0. 05 the 0. 04 a 0. 03. . . [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1. 3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … • Applications of topic models: – – – Summarize themes/aspects Facilitate navigation/browsing Retrieve documents Segment documents Many others • How can we discover these topic word distributions? 5
Probabilistic Latent Semantic Indexing [Hofmann, 1999] • Each token in a document is associated with 2 variables: • a word w (observable) • a topic z (hidden) • P(w, z|d) = P(z|d) P(w|z)
PLSA as a Mixture Model Document d Topic 1 warning 0. 3 ? system 0. 2. . ? Topic 2 aid 0. 1 ? ? donation 0. 05 support 0. 02. . ? … Topic k statistics 0. 2 ? loss ? 0. 1 dead 0. 05. . ? Background B is 0. 05 ? ? the 0. 04 a 0. 03. . ? d, 1 1 2 “Generating” word w in doc d in the collection d, 2 1 - B d, k k W B B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood 7
How to Estimate Multiple Topics? (Expectation Maximization) Known Background p(w | B) Unknown topic model p(w| 1)=? “Text mining” the 0. 2 a 0. 1 we 0. 01 to 0. 02 … … text =? mining =? association =? word =? … … Unknown topic model p(w| 2)=? “information retrieval” E-Step: Predict topic labels using Bayes Rule Observed Doc(s) M-Step: Max. Likelihood Estimator based on “fractional counts” … information =? retrieval =? query =? document =? … 8
PLSI - Problems • Each document is represented as a dummy variable d • Number of parameters grows linearly with corpus size • Overfitting • Not fully generative • Not clear how to model previously unseen documents
Latent Dirichlet Allocation [Blei et al, 2003] • Per-document topic mixtures and word multinomials come from Dirichlet priors • Exact solution is intractable – Inference is more complicated • Variational methods • Monte Carlo
Dirichlet Distribution • Conjugate prior of multinomial distribution
Latent Dirichlet Allocation
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with cc. LDA • Other applications of cc. LDA • Model evaluation • An alternative cross-collection model
Cross-Collection LDA (cc. LDA) • LDA extension for modeling multiple text collections • Each topic has a probability distribution that is shared among all collections as well as word distributions that are unique to each collection • Automatically discovers differences between collections and organizes them by topic
Example • Topic of weather and the outdoors in travel forums Topic weather time day going rain summer month high days thanks UK wind waterproof ending rolling walkers rochdale layers snow footwear ankle India leh monsoon road manali ladakh trekking trek season rains monsoons Singapore hot humidity heat degree equator sweat bring rain umbrella
cc. LDA Graphical representation: α φ β T C θ The generative process: z w c x N D γ 0 ψ σ δ γ 1 TC • Inference can be done with Gibbs sampling
Previous Work • Comparative mixture model (CCMix) – Cheng. Xiang Zhai, Atulya Velivelli, Bei Yu. A cross-collection mixture model for comparative text mining. Proceedings of ACM KDD 2004. • Improvements in cc. LDA: – – Does not rely on user-defined parameters Distributions have Dirichlet/Beta priors Document-topic distributions have collection-dependent priors P(x) depends on the topic and collection cc. Mix (2004) cc. LDA (2009) Common Dell Apple IBM cd drive rw combo dvd apoint blah hook tug 2499 airport burn 4 x read schools t 20 ultrabay tells device number drive cd dvd hard rw battery laptop bay inspiron media itunes burn imovie burning minutes 2000 ultrabay hot device swappable
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with cc. LDA • Other applications of cc. LDA • Model evaluation • An alternative cross-collection model
Cross-Cultural Analysis Documents from or about 3 countries: United Kingdom India Singapore 3, 266 forum discussions collected from lonelyplanet. com represents the perspective of tourists 7, 388 English-language blogs collected through blogcatalog. com represents the perspective of locals
Cross-Cultural Analysis • Topic of religion from the blogs Topic: god jesus lord life faith holy man christ church love UK church god john todd bentley christ luke bible christian sermon India krishna religion religious spiritual guru lord sri shri baba hindu Singapore god sin john spirit things lamb exodus suffering cross lives
Cross-Cultural Analysis • Topic of entertainment from the blogs • Compare against cc. Mix cc. LDA cc. Mix Topic: music song new songs like album dance comments rock guitar UK music band album dance festival sound bands remix tracks amp India movie film movies songs films director best bollywood indian awards Singapore band music american japanese mark world video sound idol week Topic: comment posted like music just blog time labels post love UK music album band songs new review track bands pop India kerala india tiger rajasthan birds water park city temple sanctuary Singapore kids baby cool desktop miss fun wallpaper love dont little
Cross-Cultural Analysis • Topic of travel from the blogs • Compare against LDA (on each collection individually) cc. LDA Topic: travel hotels city best place visit holiday trip world UK India holidays hotels spain london great surf breaks train ski india delhi indian mumbai bangalore tour air dubai city mahindra Singapore singapore hong kong spa hotel beach chinese pictures restaurant bangkok Topic: travel city hotel park holiday hotels place beach road visit UK travel holiday hotel city london park hotel place holidays hall India travel city beach place hotel temple road park hotels tourism Singapore travel hotel city park place beach trip hotels spa visit
Cross-Cultural Analysis • Topic of food from both datasets • Compare the view of tourists and locals Perspective of Locals food add chicken recipe cooking taste rice recipes sugar soup Perspective of Tourists food eat restaurants tea cheap meal eating cafe drink UK India Singapore food wine restaurant coffee cheese soup eat chef english drink recipes powder indian salt tsp rice masala oil coriander coffee cup oil comments fried add restaurant rice tea seafood fish haggis chips respectability decent veggie pudding photoblog sausages sandwiches cooking spices sick flour tomato batter ate cook olive recipe hawker satay stalls noodles roti stall seafood malay rochester noodle
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with cc. LDA • Other applications of cc. LDA – Scientific research/literature analysis – Media analysis and bias detection • Model evaluation • An alternative cross-collection model
Research Analysis • 16, 186 abstracts from computational linguistics and linguistics journals • Interdisciplinary research topic discovery • Topic evolution over time
Research Analysis • Topic of communication Topic: speech spoken interaction human discourse paper understanding task context communication goal users Comp Ling dialogue user systems information utterances dialogues utterance agent plan recognition agents research multi Linguistics social communication verbal women speakers speaker relationship interaction ways means behavior face men
Research Analysis • Topic of parsing/grammars across two time intervals Topic: parser grammar tree parsers grammars free context syntactic parse structure Old (<2000) number result corresponding networks known binding lr introduce consider recognition transformational ambiguous networks New (>= 2000) dependency probabilistic stochastic treebank pcfg constraint lexicalized ccg projective robustness hpsg modeling treebanks
Media Analysis • 623 news articles from msnbc. com and foxnews. com from August 2008 • Discover editorial differences within topics Topic: percent economy prices market MSNBC stocks account trades tools spending consumers sales investors trading company FOX News oil drilling poverty offshore coverage insurance growing uninsured census congress Topic: car vehicle cars fuel drive MSNBC diesel says autos camaro tax credit smaller mileage hybrid chevrolet FOX News mazda gallardo chrysler minivan horsepower lamborghini mph sports lp traffic
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with cc. LDA • Other applications of cc. LDA • Model evaluation • An alternative cross-collection model
Model Evaluation Greater likelihood of held-out data than alternative models
Model Evaluation Document classification – new vs old Compare to NB and SVM (linear kernel)
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with cc. LDA • Other applications of cc. LDA • Model evaluation • An alternative cross-collection model
Alternative Model • Similar to hierarchical Pachinko Allocation [Mimno et al, 2007] • Model as 2 -level hierarchy
Alternative Model • Single, global set of “super-topics” • One set of “sub-topics” for each collection • • Choose super-topic T from P(T|d) Choose sub-topic t from P(t|T, c) Choose hierarchy level l from P(l|t, T) if l = 0, choose word from P(w|T) else if l = 1, choose word from P(w|t)
Alternative Model • This is just a generalization of cc. LDA! • cc. LDA = special case, constrained such that for each super-topic T=j there is exactly one sub-topic such that P(t=j|T=j)=1 and P(t=i|T=j)=0 for all i ≠ j
Alternative Model • Topic of religion in the blogs Super-Topic god 0. 046994 lord 0. 015877 jesus 0. 012076 life 0. 01143 faith 0. 010692 church 0. 010185 holy 0. 009189 man 0. 00882 world 0. 00869 people 0. 007574 UK 1 0. 970483 church 0. 030402 john 0. 017007 todd 0. 016154 jesus 0. 015552 bentley 0. 014348 luke 0. 012693 religion 0. 012592 christ 0. 012091 cross 0. 011388 neville 0. 009482
Alternative Model • Topic of religion in the blogs Super-Topic god 0. 046994 lord 0. 015877 jesus 0. 012076 life 0. 01143 faith 0. 010692 church 0. 010185 holy 0. 009189 man 0. 00882 world 0. 00869 people 0. 007574 India 1 0. 984414 religion 0. 021439 krishna 0. 019062 spiritual 0. 014765 hindu 0. 012343 lord 0. 01216 religious 0. 012114 guru 0. 011108 mother 0. 01088 shri 0. 010194 sri 0. 009646
Alternative Model • Topic of religion in the blogs Super-Topic god 0. 046994 lord 0. 015877 jesus 0. 012076 life 0. 01143 faith 0. 010692 church 0. 010185 holy 0. 009189 man 0. 00882 world 0. 00869 people 0. 007574 SG 1 god 0. 032249 christ 0. 018867 cross 0. 015467 sin 0. 012505 grace 0. 012395 jesus 0. 011957 john 0. 011628 49 lamb 0. 009982 517 0. 8 mahendra 0. 009489 good 0. 009434 0. 1 025 34 SG 2 daily 0. 020028 free 0. 016023 fast 0. 014822 silent 0. 014221 wait 0. 012418 going 0. 011818 sign 0. 009414 friday 0. 009214 health 0. 008413 star 0. 008413
cc. LDA • Topic of religion from the blogs Topic: god jesus lord life faith holy man christ church love UK church god john todd bentley christ luke bible christian sermon India krishna religion religious spiritual guru lord sri shri baba hindu Singapore god sin john spirit things lamb exodus suffering cross lives
Alternative Model • Topic of politics in the blogs Super-Topic people 0. 021148 government 0. 016807 world 0. 010694 obama 108 9 0. 2 0. 009229 political 0. 00902 media 0. 008975 politics 0. 008669 country 0. 008534 state 0. 007906 rights 0. 007413 UK 1 labour 0. 049547 british 0. 041125 workers 0. 029925 european 0. 026252 bbc 0. 024908 david 0. 017203 crisis 0. 016934 immigration 0. 014694 left 0. 014336 trade 0. 011648 0. 6 992 27 UK 2 war 0. 023458 world 0. 01909 wales 0. 019002 welsh 0. 017823 brown 0. 014503 britain 0. 013498 gordon 0. 012188 london 0. 011445 politics 0. 010004 anti 0. 009916
Alternative Model • Topic of politics in the blogs Super-Topic India 1 people 0. 021148 government 0. 016807 world 0. 010694 obama 0. 009229 political 0. 00902 media 0. 008975 politics 0. 008669 country 0. 008534 state 0. 007906 rights 0. 007413 pakistan 0. 052105 india 0. 038041 kashmir 0. 037222 state 0. 023186 muslims 0. 017312 muslim 0. 016634 political 0. 010647 taliban 0. 010647 jammu 0. 009461 kashmiri 0. 00932 0. 987059
Alternative Model • Topic of politics in the blogs Super-Topic SG 1 people 0. 021148 government 0. 016807 world 0. 010694 obama 0. 009229 political 0. 00902 media 0. 008975 politics 0. 008669 country 0. 008534 state 0. 007906 rights 0. 007413 singapore 0. 04263 world 0. 027554 singaporeans 0. 014817 people 0. 013387 earth 0. 012478 malaysia 0. 011698 global 0. 010398 say 0. 010398 myanmar 0. 009488 workers 0. 008838 0. 970675
cc. LDA • Topic of politics from the blogs Topic: people government war world state political human rights said country UK news politics london media post obama war labour world bbc India pakistan india kashmir indian pakistani muslims state muslim brigade taliban Singapore singapore comments singaporeans labels chinese ago news world joo posted
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with cc. LDA • Other applications of cc. LDA • Model evaluation • An alternative cross-collection model
Questions?
d2625dcb33e646a6a291142f4e1ee091.ppt