Machine Learning How to put things together

Скачать презентацию Machine Learning How to put things together

db45dd0d21ed2cac095922a046b90c77.ppt

Количество слайдов: 52

Machine Learning How to put things together ? A case-study of model design, inference, learning, evaluation in text analysis Eric Xing Lecture 20, August 16, 2010 Reading: Eric Xing © Eric Xing @ CMU, 2006 -2010

Need computers to help us… (from images. google. cn) l l Humans cannot afford to deal with (e. g. , search, browse, or measure similarity) a huge number of text documents We need computers to help out … 2 Eric Xing © Eric Xing @ CMU, 2006 -2010

NLP and Data Mining We want: l Semantic-based search l infer topics and categorize documents l Multimedia inference l Automatic translation l Predict how topics evolve l … 3 Eric Xing © Eric Xing @ CMU, 2006 -2010

How to get started? l Here are some important elements to consider before you start: l Task: l l Data representation: l l MLE? MCLE? Max margin? Evaluation: l l Exact inference? MCMC? Variational? Learning: l l BN? MRF? Regression? SVM? Inference: l l Input and output (e. g. , continuous, binary, counts, …) Model: l l Embedding? Classification? Clustering? Topic extraction? … Visualization? Human interpretability? Perperlexity? Predictive accuracy? It is better to consider one element at a time! 4 Eric Xing © Eric Xing @ CMU, 2006 -2010

Tasks: l Say, we want to have a mapping …, so that Þ l Compare similarity l Classify contents l Cluster/group/categorizing l Distill semantics and perspectives l . . Eric Xing 5 © Eric Xing @ CMU, 2006 -2010

Modeling document collections l A document collection is a dataset where each data point is itself a collection of simpler data. l l Segmented images are collections of regions. l l Text documents are collections of words. User histories are collections of purchased items. Many modern problems ask questions of such data. l Is this text document relevant to my query? l Which category is this image in? l What movies would I probably like? l Create a caption for this image. l Modeling document collections 6 Eric Xing © Eric Xing @ CMU, 2006 -2010

Representation: l Data: Bag of Words Representation As for the Arabian and Palestinean voices that are against the current negotiations and the so-called peace process, they are not against peace per se, but rather for their wellfounded predictions that Israel would NOT give an inch of the West bank (and most probably the same for Golan Heights) back to the Arabs. An 18 months of "negotiations" in Madrid, and Washington proved these predictions. Now many will jump on me saying why are you blaming israelis for no-result negotiations. I would say why would the Arabs stall the negotiations, what do they have to loose ? l negotiations against peace Israel Arabs blaming Each document is a vector in the word space Ignore the order of words in a document. Only count matters! l Arabian A high-dimensional and sparse representation l – – Not efficient text processing tasks, e. g. , search, document classification, or similarity measure Not effective for browsing 7 Eric Xing © Eric Xing @ CMU, 2006 -2010

How to Model Semantic? l Q: What is it about? l A: Mainly MT, with syntax, some learning MT Syntax Learning Source Target SMT Alignment Score BLEU Parse Tree Noun Phrase Grammar CFG Unigram over vocabulary Eric Xing likelihood EM Hidden Parameters Estimation arg. Max Mixing Proportion Topics 0. 6 0. 3 0. 1 Topic Models © Eric Xing @ CMU, 2006 -2010 A Hierarchical Phrase-Based Model for Statistical Machine Translation We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntax based translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7. 5% over Pharaoh, a state-of-the-art phrase-based system. 8

Why this is Useful? l Q: What is it about? l A: Mainly MT, with syntax, some learning 0. 6 0. 3 0. 1 MT Syntax Learning l Q: give me similar document? l l Mixing Proportion Structured way of browsing the collection Other tasks l Dimensionality reduction l We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntax based translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7. 5% over Pharaoh, a state-of-the-art phrase-based system. TF-IDF vs. topic mixing proportion l A Hierarchical Phrase-Based Model for Statistical Machine Translation Classification, clustering, and more … 9 Eric Xing © Eric Xing @ CMU, 2006 -2010

Topic Models: The Big Picture Unstructured Collection wn w 1 x x w 2 Word Simplex Topic Discovery Structured Topic Network T 1 Dimensionalit y Reduction Tk x x xx T 2 Topic Simplex 10 Eric Xing © Eric Xing @ CMU, 2006 -2010

Topic Models Generating a document Prior θ z β w K Which prior to use? Nd N 11 Eric Xing © Eric Xing @ CMU, 2006 -2010

Choices of Priors l Dirichlet (LDA) (Blei et al. 2003) l l l Conjugate prior means efficient inference Can only capture variations in each topic’s intensity independently Logistic Normal (CTM=Lo. NTAM) (Blei & Lafferty 2005, Ahmed & Xing 2006) l Capture the intuition that some topics are highly correlated and can rise up in intensity together l Not a conjugate prior implies hard inference 12 Eric Xing © Eric Xing @ CMU, 2006 -2010

Generative Semantic of Lo. NTAM Generating a document μ Σ g z β K w le b m Nd N ro P - Log Partition Function - Normalization Constant Eric Xing © Eric Xing @ CMU, 2006 -2010 13

Using the Model l Inference l Given a Document D l l l Posterior: P(Θ | μ, Σ, β , D) Evaluation: P(D| μ, Σ, β ) Learning l Given a collection of documents {Di} l Parameter estimation 14 Eric Xing © Eric Xing @ CMU, 2006 -2010

Inference μ Σ g z β K w Nd Intractable approximate inference N 15 Eric Xing © Eric Xing @ CMU, 2006 -2010

Variational Inference μ Σ g β μ* Approximate the Integral z g β w Approximate the Posterior Σ* z Φ* w μ*, Σ*, φ1: n* Solve Optimization Problem Eric Xing © Eric Xing @ CMU, 2006 -2010 16

Variational Inference With no Tears μ Σ Iterate until Convergence l g β z l w Pretend you know E[Z 1: n] l l P(g|E[z 1: n], μ, Σ) Now you know E[g] l P(z 1: n|E[g], w 1: n, β 1: k) More Formally: Message Passing Scheme (GMF) Equivalent to previous method (Xing et. al. 2003) Eric Xing © Eric Xing @ CMU, 2006 -2010 17

Lo. NTAM Variations Inference l μ Fully Factored Distribution Σ g β l l z w Two clusters: l and Z 1: n Fixed Point Equations μ Σ g β z w 18 Eric Xing © Eric Xing @ CMU, 2006 -2010

Variational g μ Σ g Now what is ? z qz μ Σ g β z w 19 Eric Xing © Eric Xing @ CMU, 2006 -2010

Approximation Quality 20 Eric Xing © Eric Xing @ CMU, 2006 -2010

Variational Z g b qg z w μ Σ g β z w 21 Eric Xing © Eric Xing @ CMU, 2006 -2010

Variational Inference: Recap l Run one document at a time l Message Passing Scheme l GMF{z} g= l GMFg z= l Iterate until Convergence l Posterior over g is a MVN with full covariance 22 Eric Xing © Eric Xing @ CMU, 2006 -2010

Now you’ve got an algorithm, don’t forget to compare with related work Ahmed&Xing μ Blei&Lafferty Σ μ* Σ* γ β γ z β w Σ* is assumed to be diagonal Log Partition Function Tangent Approx. Numerical Optimization to fit μ*, Diag(Σ*) Closed Form Solution for μ*, Σ* Eric Xing φ w Σ* is full matrix Multivariate Quadratic Approx. z © Eric Xing @ CMU, 2006 -2010 23

Tangent Approximation 24 Eric Xing © Eric Xing @ CMU, 2006 -2010

Evaluation l A common (but not so right) practice l Try models on real data for some empirical task, say classifications or topic extraction; two reactions l Hmm! The results “make sense to me”, so the model is good! § l Objective? Gee! The results are terrible! Model is bad! § Where does the error come from? 25 Eric Xing © Eric Xing @ CMU, 2006 -2010

Evaluation: testing inference l Simulated Data l We know the ground truth for Θ , l l Vary model dimensions l l This is a crucial step because it can discern performance loss due to modeling insufficiency from inference inaccuracy K= Number of topics M= vocabulary size Nd= number of words per document Test l Inference l l l Accuracy of the recovered Θ Number of Iteration to converge (1 e-6, default setting) Parameter Estimation l l Eric Xing Goal: Standard VEM + Deterministic Annealing © Eric Xing @ CMU, 2006 -2010 26

μ Test on Synthetic Text Σ q β z w 27 Eric Xing © Eric Xing @ CMU, 2006 -2010

Comparison: accuracy and speed L 2 error in topic vector est. and # of iterations l Varying Num. of Topics l Varying Voc. Size l Varying Num. Words Per Document 28 Eric Xing © Eric Xing @ CMU, 2006 -2010

Parameter Estimation l Goal: l Standard VEM l fitting μ*, Σ* for each document l get model parameter using their expected sufficient Statistics l Problems l Having full covariance for the posterior traps our model in local maxima l Solution: § Deterministic Annealing μ Σ q β z 29 Eric Xing © Eric Xing @ CMU, 2006 -2010

Deterministic Annealing: Big Picture 30 Eric Xing © Eric Xing @ CMU, 2006 -2010

Deterministic Annealing l EM l DA-EM 31 Eric Xing © Eric Xing @ CMU, 2006 -2010

Deterministic Annealing l EM l DA-EM 32 Eric Xing © Eric Xing @ CMU, 2006 -2010

Deterministic Annealing l EM l DA-EM Life is not always that Good 33 Eric Xing © Eric Xing @ CMU, 2006 -2010

Deterministic Annealing l EM l DA-EM l VEM l DA-VEM -For exponential Families, this requires two line change to standard (V)EM -Read more on that (Noah Smith & Jason Eisner ACL 2004, COLLING-ACL 2006) 34 Eric Xing © Eric Xing @ CMU, 2006 -2010

Result on NIPS collection l NIPS proceeding from 1988 -2003 l 14036 words l 2484 docs l 80% for training and 20% for testing l Fit both models with 10, 20, 30, 40 topics l Compare perplexity on held out data l The perplexity of a language model with respect to text x is the reciprocal of the geometric average of the probabilities of the predictions in text x. So, if text x has k words, then the perplexity of the language model with respect to that text is Pr(x) -1/k 35 Eric Xing © Eric Xing @ CMU, 2006 -2010

Classification Result on PNAS collection l PNAS abstracts from 1997 -2002 l 2500 documents l Average of 170 words per document l Fitted 40 -topics model using both approaches l Use low dimensional representation to predict the abstract category l Use SVM classifier l 85% for training and 15% for testing Classification Accuracy -Notable Difference -Examine the low dimensional representations below 38 Eric Xing © Eric Xing @ CMU, 2006 -2010

Are we done? l What was our task? l Embedding (lower dimensional representation): yes, Dec q l Distillation of semantics: kind of, we’ve learned “topics” b l Classification: is it good? l Clustering: is it reasonable? l Other predictive tasks? 39 Eric Xing © Eric Xing @ CMU, 2006 -2010

Some shocking results on LDA Classification l Retrieval Annotation LDA is actually doing very poor on several “objectively” evaluatable predictive tasks 40 Eric Xing © Eric Xing @ CMU, 2006 -2010

Why? l LDA is not designed, nor trained for such tasks, such as classification, there is not warrantee that the estimated topic vector q is good at discriminating documents 41 Eric Xing © Eric Xing @ CMU, 2006 -2010

Supervised Topic Model (s. LDA) l LDA ignores documents’ side information (e. g. , categories or rating score), thus lead to suboptimal topic representation for supervised tasks l Supervised Topic Models handle such problems, e. g. , s. LDA (Blei & Mc. Auliffe, 2007) and Disc. LDA(Simon et al. , 2008) Generative Procedure (s. LDA): For each document d: Sample a topic proportion For each word: – – Sample a topic Sample a word (Blei & Mc. Auliffe, 2007) Sample Continuous (regression) Discrete (classification) Joint distribution: Variational inference: 42 Eric Xing © Eric Xing @ CMU, 2006 -2010

Med. LDA: a max-margin approach l Big picture of supervised topic models – s. LDA: optimizes the joint likelihood for regression and classification – Disc. LDA: optimizes the conditional likelihood for classification ONLY – Med. LDA: based on max-margin learning for both regression and classification 43 Eric Xing © Eric Xing @ CMU, 2006 -2010

Med. LDA Regression Model Generative Procedure (Bayesian s. LDA): Sample a parameter For each document d: Sample a topic proportion For each word: – Sample a topic – Sample a word Sample l : Def: predictive accuracy model fitting 44 Eric Xing © Eric Xing @ CMU, 2006 -2010

Experiments l Goals: l l To qualitatively and quantitatively evaluate how the max-margin estimates of Med. LDA affect its topic discovering procedure Data Sets： l 20 Newsgroups l l Documents from 20 categories ~ 20, 000 documents in each group Remove stop word as listed in Movie Review l l l 5006 documents, and 1. 6 M words Dictionary: 5000 terms selected by tf-idf Preprocessing to make the response approximately normal (Blei & Mc. Auliffe, 2007) 45 Eric Xing © Eric Xing @ CMU, 2006 -2010

Document Modeling l Data Set: 20 Newsgroups l 110 topics + 2 D embedding with t-SNE (var der Maaten & Hinton, 2008) Med. LDA Eric Xing LDA © Eric Xing @ CMU, 2006 -2010 46

Classification l Data Set: 20 Newsgroups – – l l Binary classification: “alt. atheism” and “talk. religion. misc” (Simon et al. , 2008) Multiclass Classification: all the 20 categories Models: Disc. LDA, s. LDA(Binary ONLY! Classification s. LDA (Wang et al. , 2009)), LDA+SVM (baseline), Med. LDA+SVM Measure: Relative Improvement Ratio 48 Eric Xing © Eric Xing @ CMU, 2006 -2010

Regression l l l Data Set: Movie Review (Blei & Mc. Auliffe, 2007) Models: Med. LDA(partial), Med. LDA(full), s. LDA, LDA+SVR Measure: predictive R 2 and per-word log-likelihood Sharp decrease in SVs 49 Eric Xing © Eric Xing @ CMU, 2006 -2010

Time Efficiency l Binary Classification l Multiclass: — l Med. LDA is comparable with LDA+SVM Regression: — Med. LDA is comparable with s. LDA 50 Eric Xing © Eric Xing @ CMU, 2006 -2010

Finally, think about a general framework l Med. LDA can be generalized to arbitrary topic models: – – l Unsupervised or supervised Generative or undirected random fields (e. g. , Harmoniums) MED Topic Model (Med. TM)： l l l : hidden r. v. s in the underlying topic model, e. g. , in LDA : parameters in predictive model, e. g. , in s. LDA : parameters of the topic model, e. g. , in LDA : an variational upper bound of the log-likelihood : a convex function over slack variables 51 Eric Xing © Eric Xing @ CMU, 2006 -2010

Summary l A 6 -dimensional space of working with graphical models l Task: l l Data representation: l l MLE? MCLE? Max margin? Evaluation: l l Exact inference? MCMC? Variational? Learning: l l BN? MRF? Regression? SVM? Inference: l l Input and output (e. g. , continuous, binary, counts, …) Model: l l Embedding? Classification? Clustering? Topic extraction? … Visualization? Human interpretability? Perperlexity? Predictive accuracy? It is better to consider one element at a time! 52 Eric Xing © Eric Xing @ CMU, 2006 -2010