Random Forests for Language Modeling Peng Xu and

Скачать презентацию Random Forests for Language Modeling Peng Xu and

dc40320806831a4737aaa362a21ff87d.ppt

Количество слайдов: 34

Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006 1/24/2006 CLSP, The Johns Hopkins University 1

What Is a Language Model? A probability distribution over word sequences Based on conditional probability distributions: probability of a word given its history (past words) 1/24/2006 CLSP, The Johns Hopkins University

What Is a Language Model for? Speech recognition A n W* Source-channel model W 1/24/2006 A CLSP, The Johns Hopkins University

n-gram Language Models A simple yet powerful solution to LM (n-1) items in history: n-gram model n Maximum Likelihood (ML) estimate: n Sparseness Problem: training and test mismatch, most n-grams are never seen; need for smoothing 1/24/2006 CLSP, The Johns Hopkins University

Sparseness Problem Example: Upenn Treebank portion of WSJ, 1 million words training data, 82 thousand words test data, 10 -thousand-word open vocabulary n-gram 3 4 5 6 % unseen 54. 5 75. 4 83. 1 86. 0 Sparseness makes language modeling a difficult regression problem: an n-gram model needs at least |V|n words to cover all n-grams 1/24/2006 CLSP, The Johns Hopkins University

More Data More data solution to data sparseness n n The web has “everything”: web data is noisy. The web does NOT have everything: language models using web data still have data sparseness problem. w [Zhu & Rosenfeld, 2001] In 24 random web news sentences, 46 out of 453 trigrams were not covered by Altavista. n 1/24/2006 In domain training data is not always easy to get. CLSP, The Johns Hopkins University

Dealing With Sparseness in n-gram Smoothing: take out some probability mass from seen n-grams and distribute among unseen n-grams n 1/24/2006 Interpolated Kneser-Ney: consistently the best performance [Chen & Goodman, 1998] CLSP, The Johns Hopkins University

Our Approach Extend the appealing idea of history to clustering via decision trees. n Overcome problems in decision tree construction … by using Random Forests! 1/24/2006 CLSP, The Johns Hopkins University

Decision Tree Language Models Decision trees: equivalence classification of histories n n 1/24/2006 Each leaf is specified by the answers to a series of questions (posed to “history”) which lead to the leaf from the root. Each leaf corresponds to a subset of the histories. Thus histories are partitioned (i. e. , classified). CLSP, The Johns Hopkins University

Decision Tree Language Models: An Example Training data: aba, aca, bcb, bbb, ada {ab, ac, bb, ad} a: 3 b: 2 Is the first word in {a}? New event ‘adb’ in test {ab, ac, ad} a: 3 b: 0 1/24/2006 New event ‘cba’ in test: Stuck! Is the first word in {b}? New event ‘bdb’ in test {bc, bb} a: 0 b: 2 CLSP, The Johns Hopkins University

Decision Tree Language Models: An Example: trigrams (w-2, w-1, w 0) Questions about positions: “Is w-i 2 S? ” and “Is w-i 2 Sc? ” There are two history positions for trigram. Each pair, S and Sc, defines a possible split of a node, and therefore, training data. § S and Sc are complements with respect to training data A node gets less data than its ancestors. (S, Sc) are obtained by an exchange algorithm. 1/24/2006 CLSP, The Johns Hopkins University

Construction of Decision Trees Data Driven: decision trees are constructed on the basis of training data The construction requires: 1. 2. 3. 1/24/2006 The set of possible questions A criterion evaluating the desirability of questions A construction stopping rule or post-pruning rule CLSP, The Johns Hopkins University

Construction of Decision Trees: Our Approach Grow a decision tree until maximum depth using training data n n Use training data likelihood to evaluate questions Perform no smoothing during growing Prune fully grown decision tree to maximize heldout data likelihood n 1/24/2006 Incorporate KN smoothing during pruning CLSP, The Johns Hopkins University

Smoothing Decision Trees Using similar ideas as interpolated Kneser-Ney smoothing: Note: n n 1/24/2006 All histories in one node are not smoothed in the same way. Only leaves are used as equivalence classes. CLSP, The Johns Hopkins University

Problems with Decision Trees Training data fragmentation: n As tree is developed, the questions are selected on the basis of less and less data. Lack of optimality: § The exchange algorithm is a greedy algorithm. § So is the tree growing algorithm Overtraining and undertraining: n n 1/24/2006 Deep trees: fit the training data well, will not generalize well to new test data. Shallow trees: not sufficiently refined. CLSP, The Johns Hopkins University

Amelioration: Random Forests Breiman applied the idea of random forests to relatively small problems. [Breiman 2001] § Using different random samples of data and randomly chosen subsets of questions, construct K decision trees. § Apply test datum x to all the different decision trees. • Produce classes y 1, y 2, …, y. K. § Accept plurality decision: 1/24/2006 CLSP, The Johns Hopkins University

Example of a Random Forest T 1 T 2 T 3 a a An example x will be classified as a according to this random forest. 1/24/2006 CLSP, The Johns Hopkins University

Random Forests for Language Modeling Two kinds of randomness: n Selection of positions to ask about w Alternatives: position 1 or 2 or the better of the two. n Random initialization of the exchange algorithm 100 decision trees: ith tree estimates PDT(i)(w 0|w-2, w-1) The final estimate is the average of all trees 1/24/2006 CLSP, The Johns Hopkins University

Experiments Perplexity (PPL): n 1/24/2006 UPenn Treebank part of WSJ: about 1 million words for training and heldout (90%/10%), 82 thousand words for test CLSP, The Johns Hopkins University

Experiments: trigram Baseline: KN-trigram No randomization: DT-trigram 100 random DTs: RF-trigram Model KN-trigram DT-trigram heldout PPL Gain 160. 1 158. 6 0. 9% Test PPL Gain % 145. 0 163. 3 -12. 6% RF-trigram 126. 8 129. 7 1/24/2006 20. 8% CLSP, The Johns Hopkins University 10. 5%

Experiments: Aggregating ü Considerable improvement already with 10 trees! 1/24/2006 CLSP, The Johns Hopkins University

Experiments: Analysis seen event : w KN-trigram: in training data w DT-trigram: in training data Analyze test data events by number of times seen in 100 DTs 1/24/2006 CLSP, The Johns Hopkins University

Experiments: Stability PPL results of different realizations varies, but differences are small. 1/24/2006 CLSP, The Johns Hopkins University

Experiments: Aggregation v. s. Interpolation w. Aggregation: w. Weighted average: Estimate weights so as to maximize heldout data log-likelihood 1/24/2006 CLSP, The Johns Hopkins University

Experiments: Aggregation v. s. Interpolation 1/24/2006 Optimal interpolation gains almost nothing! CLSP, The Johns Hopkins University

Experiments: High Order n-grams Models Baseline: KN n-gram 100 random DTs: RF n-gram KN RF 1/24/2006 3 145. 0 129. 7 4 140. 0 126. 4 5 138. 8 126. 0 CLSP, The Johns Hopkins University 6 138. 6 126. 3

Using Random Forests to Other Models: SLM Structured Language Model (SLM): [Chelba & Jelinek, 2000] n Approximation: use tree triples SLM KN RF 1/24/2006 137. 9 122. 8 CLSP, The Johns Hopkins University

Speech Recognition Experiments (I) Word Error Rate (WER) by N-best Rescoring: n n n WSJ text: 20 or 40 million words training WSJ DARPA’ 93 HUB 1 test data: 213 utterances, 3446 words N-best rescoring: baseline WER is 13. 7% w N-best lists were generated by a trigram baseline using Katz backoff smoothing. w The baseline trigram used 40 million words for training. w Oracle error rate is around 6%. 1/24/2006 CLSP, The Johns Hopkins University

Speech Recognition Experiments (I) Baseline: KN smoothing 100 random DTs for RF 3 -gram 100 random DTs for the PREDICTOR in SLM Approximation in SLM 3 -gram (20 M) SLM (20 M) KN 14. 0% 13. 0% 12. 8% RF 12. 9% 12. 4% 11. 9% p-value 1/24/2006 3 -gram (40 M) <0. 001 <0. 05 <0. 001 CLSP, The Johns Hopkins University

Speech Recognition Experiments (II) Word Error Rate by Lattice Rescoring n IBM 2004 Conversational Telephony System for Rich Transcription: 1 st place in RT-04 evaluation w Fisher data: 22 million words w WEB data: 525 million words, using frequent Fisher ngrams as queries w Other data: Switchboard, Broadcast News, etc. n n 1/24/2006 Lattice language model: 4 -gram with interpolated Kneser-Ney smoothing, pruned to have 3. 2 million unique n-grams, WER is 14. 4% Test set: DEV 04, 37, 834 words CLSP, The Johns Hopkins University

Speech Recognition Experiments (II) Baseline: KN 4 -gram 110 random DTs for EB-RF 4 -gram Sampling data without replacement Fisher and WEB models are interpolated Fisher 4 -gram Fisher+WEB 4 -gram KN 14. 1% 15. 2% 13. 7% RF 13. 5% 15. 0% 13. 1% p-value 1/24/2006 WEB 4 -gram <0. 001 - <0. 001 CLSP, The Johns Hopkins University

Practical Limitations of the RF Approach Memory: n n n Decision tree construction uses much more memory. Little performance gain when training data is really large. Because we have 100 trees, the final model becomes too large to fit into memory. Effective language model compression or pruning remains an open question. 1/24/2006 CLSP, The Johns Hopkins University

Conclusions: Random Forests New RF language modeling approach More general LM: RF DT n-gram Randomized history clustering Good generalization: better n-gram coverage, less biased to training data Extension of Brieman’s random forests for data sparseness problem 1/24/2006 CLSP, The Johns Hopkins University

Conclusions: Random Forests Improvements in perplexity and/or word error rate over interpolated Kneser-Ney smoothing for different models: n n-gram (up to n=6) Class-based trigram Structured Language Model Significant improvements in the best performing large vocabulary conversational telephony speech recognition system 1/24/2006 CLSP, The Johns Hopkins University