Text Mining and Topic Modeling Padhraic Smyth Department

Скачать презентацию Text Mining and Topic Modeling Padhraic Smyth Department

6ed63039fb4950d3132f145be3d746dc.ppt

Количество слайдов: 137

Text Mining and Topic Modeling Padhraic Smyth Department of Computer Science University of California, Irvine

Progress Report • New deadline • • In class, Thursday February 18 th (not Tuesday) Outline • • 3 to 5 pages maximum Suggested content • • • Brief restatement of the problem you are addressing (no need to repeat everything in your original proposal), e. g. , ½ a page Summary of progress so far n Background papers read n Data preprocessing, exploratory data analysis n Algorithms/Software reviewed, implemented, tested n Initial results (if any) Challenges and difficulties encountered Brief outline of plans between now and end of quarter Use diagrams, figures, tables, where possible Write clearly: check what you write n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Road Map • Topics covered • • • Exploratory data analysis and visualization Regression Classification Text classification Yet to come…. • • Unsupervised learning with text (topic models) Social networks Recommender systems (including Netflix) Mining of Web data n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Text Mining • Document classification • Information extraction • • Named-entity extraction: recognize names of people, places, genes, etc Document summarization • Google news, Newsblaster (http: //www 1. cs. columbia. edu/nlp/newsblaster/) • Document clustering • Topic modeling • • Representing document as mixtures of topics And many more… n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Named Entity-Extraction • Often a combination of • • Non-trivial since entity-names can be confused with real names • • E. g. , gene name ABS and abbreviation ABS Also can look for co-references • • Knowledge-based approach (rules, parsers) Machine learning (e. g. , hidden Markov model) Dictionary E. g. , “IBM today…… Later, the company announced…. . ” Useful as a preprocessing step for data mining, e. g. , use entity-names to train a classifier to predict the category of an article n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Example: GATE/ANNIE extractor • GATE: free software infrastructure for text analysis (University of Sheffield, UK) • ANNIE: widely used entity-recognizer, part of GATE http: //www. gate. ac. uk/annie/ n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Information Extraction From Seymore, Mc. Callum, Rosenfeld, Learning Hidden Markov Model Structure for Information Extration, AAAI 1999 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Topic Models • Background on graphical models • Unsupervised learning from text documents • • Motivation Topic model and learning algorithm Results Extensions • Topics over time, author topic models, etc n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Pennsylvania Gazette 1728 -1800 80, 000 articles 25 million words www. accessible. com n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Enron email data 250, 000 emails 28, 000 authors 1999 -2002 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Other Examples of Large Corpora • Cite. Seer digital collection: • • 700, 000 papers, 700, 000 authors, 1986 -2005 MEDLINE collection • 16 million abstracts in medicine/biology • US Patent collection • and many more. . n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Unsupervised Learning from Text • Large collections of unlabeled documents. . • • • Web Digital libraries Email archives, etc • Often wish to organize/summarize/index/tag these documents automatically • We will look at probabilistic techniques for clustering and topic extraction from sets of documents n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • Who is likely to write about topic Y? • Who wrote this specific document? • and so on…. . n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Review Slides on Graphical Models n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Multinomial Models for Documents • Example: 50, 000 possible words in our vocabulary • Simple memoryless model • 50, 000 -sided die • a non-uniform die: each side/word has its own probability • to generate N words we toss the die N times • This is a simple probability model: • p( document | f ) = P p(word i | f ) • to "learn" the model we just count frequencies • p(word i) = number of occurences of i / total number • Typically interested in conditional multinomials, e. g. , • p(words | spam) versus p(words | non-spam) n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Real examples of Word Multinomials n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

A Graphical Model for Multinomials p( doc | f ) = P p(wi | f ) = "parameter vector" = set of probabilities one per word w 1 p( w | f ) w 2 wn n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Another view. . p( doc | f ) = P p(wi | f ) This is “plate notation” Items inside the plate are conditionally independent given the variable outside the plate wi i=1: n There are “n” conditionally independent replicates represented by the plate n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Being Bayesian. . a This is a prior on our multinomial parameters, e. g. , a simple Dirichlet smoothing prior with symmetric parameter a to avoid estimates of probabilities that are 0 wi i=1: n n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Being Bayesian. . a Learning: infer p( f | words, a ) proportional to p( words | f) p(f|a) wi i=1: n n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Multiple Documents p( corpus | f ) = P p( doc | f ) a wi 1: n 1: D n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Different Document Types p( w | f) is a multinomial over words a wi 1: n n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Different Document Types p( w | f) is a multinomial over words a wi 1: n 1: D n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Different Document Types p( w | f a , zd) is a multinomial over words zd is the "label" for each doc zd wi 1: n 1: D n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Different Document Types p( w | f a , zd) is a multinomial over words zd is the "label" for each doc Different multinomials, depending on the value of zd (discrete) zd f now represents |z| different 1: n i multinomials w 1: D n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Unknown Document Types a b Now the values of z for each document are unknown - hopeless? zd wi 1: n 1: D n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Unknown Document Types a b Now the values of z for each document are unknown - hopeless? Not hopeless : ) Can learn about both z and q zd e. g. , EM algorithm This gives probabilistic clustering p(w | z=k , wi f) is the kth multinomial over words 1: n 1: D n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Topic Model b a zi is a "label" for each word d q: p( zi | qd ) = distribution over topics that is document specific zi f: p( w | f , zi = k) wi = multinomial over words = a "topic" 1: n 1: D n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Example of generating words 1. 0 1 1 1 MONEY BANK LOAN BANK MONEY BANK 1 1 1 MONEY BANK LOAN BANK MONEY 1 1. . . 6 2 1 2 2 2 RIVER MONEY BANK STREAM BANK . 4 1. 0 1 2 1 MONEY RIVER MONEY BANK 2 2 LOAN 1 MONEY 1. . 2 2 RIVER BANK STREAM BANK RIVER BANK Topics Mixtures θ 1 2. . Documents and topic assignments n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Learning ? ? ? MONEY BANK LOAN BANK MONEY BANK ? ? ? MONEY BANK LOAN BANK MONEY ? ? . . ? RIVER MONEY BANK STREAM BANK ? ? ? MONEY RIVER MONEY BANK LOAN MONEY ? ? ? RIVER BANK STREAM BANK RIVER BANK Topics Mixtures θ ? ? ? . . . . Documents and topic assignments n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Key Features of Topic Models • Model allows a document to be composed of multiple topics • • Completely unsupervised • • • Topics learned directly from data Leverages strong dependencies at word level Learning algorithm • • More powerful than 1 doc -> 1 cluster Gibbs sampling is the method of choice Scalable • • Linear in number of word tokens Can be run on millions of documents n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Document generation as a probabilistic process • • • Each topic is a distribution over words Each document a mixture of topics Each word chosen from a single topic From parameters (j) From parameters (d) n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Learning the Model • Three sets of latent variables we can learn • topic-word distributions • • • document-topic distributions θ topic assignments for each word z Options: • • • EM algorithm to find point estimates of f and q • e. g. , Chien and Wu, IEEE Trans ASLP, 2008 Gibbs sampling • Find p(f | data), p(q | data), p(z| data) • Can be slow to converge Collapsed Gibbs sampling • Most widely used method [See also Asuncion, Welling, Smyth, Teh, UAI 2009 for additional discussion ] n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Gibbs Sampling • Say we have 3 parameters x, y, z, and some data • Bayesian learning: • • We want to compute p(x, y, z | data) But frequently it is impossible to compute this exactly However, often we can compute conditionals for individual variables, e. g. , p(x | y, z, data) Not clear how this is useful yet, since it assumes y and z are known (i. e. , we condition on them). n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Gibbs Sampling 2 • Example of Gibbs sampling: • • • Initialize with x’, y’, z’ (e. g. , randomly) Iterate: • Sample new x’ ~ P(x | y’, z’, data) • Sample new y’ ~ P(y | x’, z’, data) • Sample new z’ ~ P(z | x’, y’, data) Continue for some number (large) of iterations Each iteration consists of a sweep through the hidden variables or parameters (here, x, y, and z) Gibbs = a Markov Chain Monte Carlo (MCMC) method In the limit, the samples x’, y’, z’ will be samples from the true joint distribution P(x , y, z | data) This gives us an empirical estimate of P(x , y, z | data) n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Example of Gibbs Sampling in 2 d From online MCMC tutorial notes by Frank Dellaert, Georgia Tech n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Computation • Convergence • • • In the limit, samples x’, y’, z’ are from P(x , y, z | data) How many iterations are needed? • Cannot be computed ahead of time • Early iterations are discarded (“burn-in”) • Typically monitor some quantities of interest to monitor convergence • Convergence in Gibbs/MCMC is a tricky issue! Complexity per iteration • • Linear in number of hidden variables and parameters Times the complexity of generating a sample each time n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Gibbs Sampling for the Topic Model • Recall: 3 sets of latent variables we can learn • topic-word distributions • • • document-topic distributions θ topic assignments for each word z Gibbs sampling algorithm • • Initialize all the z’s randomly to a topic, z 1, …. . z. N Iteration • For i = 1, …. N n Sample zi ~ p(zi | all other z’s, data) • Continue for a fixed number of iterations or convergence • Note that this is collapsed Gibbs sampling • Sample from p(z 1, …. . z. N | data), “collapsing” over f and q n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Topic Model d zi wi 1: n 1: D n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Sampling each Topic-Word Assignment count of topic t assigned to doc d count of word w assigned to topic t probability that word i is assigned to topic t n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Convergence Example (from Newman et al, JMLR, 2009) n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Complexity • Time • O(N T) per iteration, where N is the number of “tokens”, T the number of topics • For fast sampling, see “Fast-LDA”, Porteous et al, ACM SIGKDD, 2008; also Yao, Mimno, Mc. Callum, ACM SIGKDD 2009. For distributed algorithms, see Newman et al. , Journal of Machine Learning Research, 2009, e. g. , T = 1000, N = 100 million • • Space • • O(D T + T W + N), where D is the number of documents and W is the number of unique words (size of vocabulary) Can reduce these size by using sparse matrices • Store non-zero counts for doc-topic and topic-word • Only apply smoothing at prediction time n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

16 Artificial Documents documents Can we recover the original topics and topic mixtures from this data? n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Starting the Gibbs Sampling Assign word tokens randomly to topics (●=topic 1; ●=topic 2 ) • n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

After 1 iteration n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

After 4 iterations n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

After 32 iterations ● ● n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Software for Topic Modeling • Mark Steyver’s public-domain MATLAB toolbox for topic modeling on the Web psiexp. ss. uci. edu/research/programs_data/toolbox. htm n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

History of topic models • origins in statistics: • • • applications in computer science • • latent class models in social science admixture models in statistical genetics Hoffman, SIGIR, 1999 Blei, Jordan, Ng, JMLR 2003 (known as “LDA”) Griffiths and Steyvers, PNAS, 2004 more recent work • • • author-topic models: Steyvers et al, Rosen-Zvi et al, 2004 Hierarchical topics: Mc. Callum et al, 2006 Correlated topic models: Blei and Lafferty, 2005 Dirichlet process models: Teh, Jordan, et al large-scale web applications: Buntine et al, 2004, 2005 undirected models: Welling et al, 2004 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Topic = probability distribution over words Important point: these distributions are learned in a completely automated “unsupervised” fashion from the data n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Four example topics from NIPS n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Comparing Topics and Other Approaches • Clustering documents • • • LSI/LSA/SVD • • • Computationally simpler… But a less accurate and less flexible model Linear project of V-dim word vectors into lower dimensions Less interpretable Not generalizable • E. g. , authors or other side-information Not as accurate • E. g. , precision-recall: Hoffman, Blei et al, Buntine, etc Probabilistic models such as topic Models • • “next-generation” text modeling, after LSI provide a modular extensible framework n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. Mc. Clure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied online or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Clusters v. Topics One Cluster Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. Mc. Clure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied online or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Clusters v. Topics One Cluster Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. Mc. Clure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied online or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov Multiple Topics [topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequences genes n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Examples of Topics learned from Proceedings of the National Academy of Sciences Griffiths and Steyvers, PNAS, 2004 FORCE HIV SURFACE VIRUS MOLECULES INFECTED SOLUTION IMMUNODEFICIENCY SURFACES CD 4 MICROSCOPY INFECTION WATER HUMAN FORCES VIRAL PARTICLES TAT STRENGTH GP 120 POLYMER REPLICATION IONIC TYPE ATOMIC ENVELOPE AQUEOUS AIDS MOLECULAR REV PROPERTIES BLOOD LIQUID CCR 5 SOLUTIONS INDIVIDUALS BEADS ENV MECHANICAL PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Examples of PNAS topics CHROMOSOME ADULT MALE PARASITE REGION DEVELOPMENT FEMALE PARASITES CHROMOSOMES FETAL MALES FALCIPARUM KB DAY FEMALES MALARIA MAP DEVELOPMENTAL SEX HOST MAPPING POSTNATAL SEXUAL PLASMODIUM CHROMOSOMAL BEHAVIOR EARLY ERYTHROCYTES HYBRIDIZATION OFFSPRING DAYS ERYTHROCYTE ARTIFICIAL REPRODUCTIVE NEONATAL MAJOR MAPPED LIFE MATING LEISHMANIA PHYSICAL DEVELOPING SOCIAL INFECTED MAPS EMBRYONIC SPECIES BLOOD GENOMIC BIRTH REPRODUCTION INFECTION DNA NEWBORN FERTILITY MOSQUITO LOCUS MATERNAL TESTIS INVASION GENOME PRESENT MATE TRYPANOSOMA GENE PERIOD GENETIC CRUZI HUMAN ANIMALS GERM BRUCEI SITU NEUROGENESIS CHOICE HUMAN CLONES ADULTS SRY HOSTS MODEL STUDIES MODELS PREVIOUS EXPERIMENTAL SHOWN BASED RESULTS PROPOSED RECENT DATA PRESENT SIMPLE STUDY DYNAMICS DEMONSTRATED PREDICTED INDICATE EXPLAIN WORK BEHAVIOR SUGGEST THEORETICAL SUGGESTED ACCOUNT USING THEORY FINDINGS PREDICTS DEMONSTRATE COMPUTER REPORT QUANTITATIVE INDICATED PREDICTIONS CONSISTENT REPORTS PARAMETERS CONTRAST MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN

What can Topic Models be used for? • Queries • Who writes on this topic? n • e. g. , finding experts or reviewers in a particular area What topics does this person do research on? • Comparing groups of authors or documents • Discovering trends over time • Detecting unusual papers and authors • Interactive browsing of a digital library via topics • Parsing documents (and parts of documents) by topic • and more…. . n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

What is this paper about? Empirical Bayes screening for multi-item associations Bill Du. Mouchel and Daryl Pregibon, ACM SIGKDD 2001 Most likely topics according to the model are… 1. 2. 3. 4. data, mining, discovery, association, attribute. . set, subset, maximal, minimal, complete, … measurements, correlation, statistical, variation, Bayesian, model, prior, data, mixture, …. . n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

3 of 300 example topics (TASA) n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Automated Tagging of Words (numbers & colors topic assignments) n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Four example topics from Cite. Seer (T=300)

More Cite. Seer Topics

Temporal patterns in topics: hot and cold topics • Cite. Seer papers from 1986 -2002, about 200 k papers • For each year, calculate the fraction of words assigned to each topic • -> a time-series for topics • • Hot topics become more prevalent Cold topics become less prevalent n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Four example topics from NIPS (T=100)

NIPS: support vector topic n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

NIPS: neural network topic n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

NIPS Topics over Time ntopic mass (in vertical height) Figure courtesy of Xuerie Wang and Andrew Mc. Callum, U Mass Amherst time

Pennsylvania Gazette (courtesy of David Newman & Sharon Block, UC Irvine) 1728 -1800 80, 000 articles n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Pennsylvania Gazette Data courtesy of David Newman (CS Dept) and Sharon Block (History Dept) n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Topic trends from New York Times Tour-de-France 330, 000 articles 2000 -2002 Quarterly Earnings Anthrax TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Enron email data 250, 000 emails 28, 000 authors 1999 -2002 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Enron email: business topics n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Enron: non-work topics… n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Enron: public-interest topics. . . n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Comparing Predictive Power • Train models on part of a new document and predict remaining words n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Using Topic Models for Document Search n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

References on Topic Models Overviews: • Steyvers, M. & Griffiths, T. (2006). Probabilistic topic models. In T. Landauer, D Mc. Namara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum • D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. Taylor and Francis, 2009. More specific: • Latent Dirichlet allocation David Blei, Andrew Y. Ng and Michael Jordan. Journal of Machine Learning Research, 3: 993 -1022, 2003. Finding scientific topics Griffiths, T. , & Steyvers, M. (2004). Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228 -5235 • • Probabilistic author-topic models for information discovery M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths, in Proceedings of the ACM SIGKDD Conference on Data Mining and Knowledge Discovery, August 2004. n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Examples of Topics from New York Times Terrorism Wall Street Firms Stock Market Bankruptcy SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Examples of Topics from New York Times Terrorism Wall Street Firms SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS Stock Market Bankruptcy WEEK BANKRUPTCY DOW_JONES CREDITORS POINTS BANKRUPTCY_PROTECTION 10_YR_TREASURY_YIELD ASSETS PERCENT COMPANY CLOSE FILED NASDAQ_COMPOSITE BANKRUPTCY_FILING STANDARD_POOR ENRON CHANGE BANKRUPTCY_COURT FRIDAY KMART DOW_INDUSTRIALS CHAPTER_11 GRAPH_TRACKS FILING EXPECTED COOPER BILLIONS NASDAQ_COMPOSITE_INDEX COMPANIES EST_02 BANKRUPTCY_PROCEEDINGS PHOTO_YESTERDAY DEBTS YEN RESTRUCTURING 10 CASE 500_STOCK_INDEX GROUP n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Collocation Topic Model For each document, choose a mixture of topics For every word slot, sample a topic If x=0, sample a word from the topic If x=1, sample a word from the distribution based on previous word TOPIC MIXTURE TOPIC . . . WORD . . . X X . . . n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Collocation Topic Model Example: “DOW JONES RISES” JONES is more likely explained as a word following DOW than as word sampled from topic Result: DOW_JONES recognized as collocation TOPIC MIXTURE TOPIC DOW JONES X=1 . . . RISES . . . X=0 . . . n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Using Topic Models for Information Retrieval n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Stability of Topics • Content of topics is arbitrary across runs of model (e. g. , topic #1 is not the same across runs) • However, • • • Majority of topics are stable over processing time Majority of topics can be aligned across runs Topics appear to represent genuine structure in data n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Comparing NIPS topics from the same Markov chain KL distance Re-ordered topics at t 2=2000 BEST KL = 0. 54 WORST KL = 4. 78 topics at t 1=1000 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Comparing NIPS topics from two different Markov chains KL distance Re-ordered topics from chain 2 BEST KL = 1. 03 topics from chain 1 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Outline • Background on statistical text modeling • Unsupervised learning from text documents • • Extensions • • Author-topic models Applications • • Motivation Topic model and learning algorithm Results Demo of topic browser Future directions

Approach • The author-topic model • extension of the topic model: linking authors and topics • authors -> topics -> words • learned from data • completely unsupervised, no labels • generative model • Different questions or queries can be answered by appropriate probability calculus • E. g. , p(author | words in document) • E. g. , p(topic | author)

Graphical Model x Author z Topic

Graphical Model x Author z Topic w Word

Graphical Model x Author z Topic w Word n

Graphical Model a x Author z Topic w Word n D

Graphical Model a Author-Topic distributions Topic-Word distributions x Author z Topic w Word n D

Generative Process • Let’s assume authors A 1 and A 2 collaborate and produce a paper • • A 1 has multinomial topic distribution q 1 A 2 has multinomial topic distribution q 2 • For each word in the paper: 1. Sample an author x (uniformly) from A 1, A 2 2. Sample a topic z from q. X 3. Sample a word w from a multinomial topic distribution z

Graphical Model a Author-Topic distributions Topic-Word distributions x Author z Topic w Word n D

Learning • Observed • • Unknown • • • x, z : hidden variables Θ, : unknown parameters Interested in: • • • W = observed words, A = sets of known authors p( x, z | W, A) p( θ , | W, A) But exact learning is not tractable

Step 1: Gibbs sampling of x and z a Average over unknown parameters x Author z Topic w Word n D

Step 2: estimates of θ and a x Author z Topic w Word Condition on particular samples of x and z n D

Gibbs Sampling • Need full conditional distributions for variables • The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

Authors and Topics (Cite. Seer Data)

Some likely topics per author (Cite. Seer) • Author = Andrew Mc. Callum, U Mass: • • Topic 1: classification, training, generalization, decision, data, … Topic 2: learning, machine, examples, reinforcement, inductive, …. . Topic 3: retrieval, text, document, information, content, … Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging…. . • Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

Finding unusual papers for an author Perplexity = exp [entropy (words | model) ] = measure of surprise for model on data Can calculate perplexity of unseen documents, conditioned on the model for a particular author n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models 687 Learning from Incomplete Data 702 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models 687 Learning from Incomplete Data 702 MEDIAN PERPLEXITY 2567 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models 687 Learning from Incomplete Data 702 MEDIAN PERPLEXITY 2567 Defining and Handling Transient Fields in Pjama 14555 An Orthogonally Persistent JAVA 16021 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY 2837 Text Classification from Labeled and Unlabeled Documents using EM 3802 A Method for Estimating Occupational Radiation Dose… 8814 n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Who wrote what? n Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

The Author-Topic Browser Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Comparing Predictive Power • Train models on part of a new document and predict remaining words n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine

Online Demonstration of Topic Browser for UCI and UCSD Faculty n. Data Mining Lectures Text Mining and Topic Models © Padhraic Smyth, UC Irvine