
f33c39e37fa956517d96906e530573fc.ppt
- Количество слайдов: 183
Online Learning for Real-World Problems Koby Crammer University of Pennsylvania 1
Thanks • • 2 Ofer Dekel Josheph Keshet Shai Shalev-Schwatz Yoram Singer • • • Axel Bernal Steve Caroll Mark Dredze Kuzman Ganchev Ryan Mc. Donald Artemis Hatzigeorgiu Fernando Pereira Fei Sha Partha Pratim Talukdar
Tutorial Context SVMs Real-World Data Online Learning Tutorial Optimization Theory 3 Multicass, Structured
Online Learning Tyrannosaurus rex 4
Online Learning Triceratops 5
Online Learning Velocireptor Tyrannosaurus rex 6
Formal Setting – Binary Classification • Instances – Images, Sentences • Labels – Parse tree, Names • Prediction rule – Linear predictions rules • Loss – No. of mistakes 7
Predictions • Discrete Predictions: – Hard to optimize • Continuous predictions : – Label – Confidence 8
Loss Functions • Natural Loss: – Zero-One loss: • Real-valued-predictions loss: – Hinge loss: – Exponential loss (Boosting) – Log loss (Max Entropy, Boosting) 9
Loss Functions Hinge Loss Zero-One Loss 1 1 10
Online Framework • Initialize Classifier • Algorithm works in rounds • On round the online algorithm : – – – Receives an input instance Outputs a prediction Receives a feedback label Computes loss Updates the prediction rule • Goal : 11 – Suffer small cumulative loss
Linear Classifiers • Any Features • W. l. o. g. • Binary Classifiers of the form Notation 12 Abuse
Linear Classifiers (cntd(. • Prediction: • Confidence in prediction: 13
Margin • Margin of an example classifier : with respect to the • Note : • The set there exists such that 14 is separable iff
Geometrical Interpretation 15
Geometrical Interpretation 16
Geometrical Interpretation 17
Geometrical Interpretation Margin <<0 Margin >>0 18 Margin <0
Hinge Loss 19
Separable Set 20
Inseparable Sets 21
Degree of Freedom - I The same geometrical hyperplane can be represented by many parameter vectors 22
Degree of Freedom - II Problem difficulty does not change if we shrink or expand the input 23 space
Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive • Not as good as a well designed batch algorithms 24
Update Rules • Online algorithms are based on an update rule which defines from (and possibly other information) • Linear Classifiers : find from based on the input • Some Update Rules : – – 25 Perceptron (Rosenblat) ALMA (Gentile) ROMMA (Li & Long) NORMA (Kivinen et. al) – MIRA (Crammer & Singer) – EG (Littlestown and Warmuth) – Bregman Based (Warmuth)
Three Update Rules • The Perceptron Algorithm : – Agmon 1954; Rosenblatt 1952 -1962, Block 1962, Novikoff 1962, Minsky & Papert 1969, Freund & Schapire 1999, Blum & Dunagan 2002 • Hildreth’s Algorithm : – Hildreth 1957 – Censor & Zenios 1997 – Herbster 2002 • Loss Scaled : – Crammer & Singer 2001, – Crammer & Singer 2002 26
The Perceptron Algorithm • If No-Mistake – Do nothing • If Mistake – Update • Margin after update : 27
Geometrical Interpretation 28
Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Cumulative Loss Suffered by the Algorithm 29 Sequence of Prediction Functions Cumulative Loss of Competitor
Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Inequality Possibly Large Gap 30 Regret Extra Loss Competitiveness Ratio
Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Grows With T 31 Constant
Relative Loss Bound • For any competitor prediction function • We bound the loss suffered by the algorithm with the loss suffered by Best Prediction Function in hindsight for the data sequence 32
Remarks • If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere) • Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemisphere problem. • Bound of the number of mistakes the perceptron makes with the hinge loss of any competitor 33
Definitions • Any Competitor • The parameters vector the input data can be chosen using • The parameterized hinge loss of • True hinge loss • 1 -norm and 2 -norm of hinge loss 34 on
Geometrical Assumption • All examples are bounded in a ball of radius R 35
Perceptron’s Mistake Bound • Bounds: • If the sample is separable then 36
[FS 99, SS 05] Proof - Intuition • Two views : – The angle between and decreases with – The following sum is fixed as we make more mistakes, our solution is better 37
[C 04] Proof • Define the potential : • Bound it’s cumulative sum and below 38 from above
Proof • Bound from above: Telescopic Sum 39 Zero Vector Non-Negative
Proof • Bound From Below : – No error on tth round – Error on tth round 40
Proof • We bound each term : 41
Proof • Bound From Below : – No error on tth round – Error on tth round • Cumulative bound : 42
Proof • Putting both bounds together : • We use first degree of freedom (and scale) : • Bound : 43
Proof • General Bound: • Choose: • Simple Bound: 44 Objective of SVM
Proof • Better bound : optimize the value of 45
Remarks • Bound does not depend on dimension of the feature vector • The bound holds for all sequences. It is not tight for most real world data • But, there exists a setting for which it is tight 46
Three Bounds 47
Separable Case • Assume there exists such that for all examples Then all bounds are equivalent • Perceptron makes finite number of mistakes until convergence (not necessarily to ) 48
Separable Case – Other Quantities • Use 1 st (parameterization) degree of freedom • Scale the such that • Define • The bound becomes 49
Separable Case - Illustration 50
separable Case – Illustration Finding a separating hyperplance is more difficult 51 The Perceptron will make more mistakes
Inseparable Case • Difficult problem implies a large value of • In this case the Perceptron will make a large number of mistakes 52
Perceptron Algorithm • Extremely easy to • Quantities in bound are implement not compatible : no. of mistakes vs. hinge-loss. • Relative loss bounds for • Margin of examples is separable and ignored by update inseparable cases. Minimal assumptions • Same update for (not iid) separable case and • Easy to convert to a inseparable case. well-performing batch algorithm (under iid assumptions) 53
Passive – Aggressive Approach • The basis for a well-known algorithm in convex optimization due to Hildreth (1957) – Asymptotic analysis – Does not work in the inseparable case • Three versions : – PA-I separable case PA-II inseparable case • Beyond classification – Regression, one class, structured learning • Relative loss bounds 54
Motivation • Perceptron: No guaranties of margin after the update • PA : Enforce a minimal non-zero margin after the update • In particular : – If the margin is large enough (1), then do nothing – If the margin is less then unit, update such that the margin after the update is enforced to be unit 55
Input Space 56
Input Space vs. Version Space • Input Space : – Points are input data – One constraint is induced by weight vector – Primal space – Half space = all input examples that are classified correctly by a given predictor (weight vector) 57 • Version Space : – Points are weight vectors – One constraints is induced by input data – Dual space – Half space = all predictors (weight vectors) that classify correctly a given input example
Weight Vector (Version) Space The algorithm forces to reside in this region 58
Passive Step Nothing to do. already resides on the 59 desired side.
Aggressive Step The algorithm projects on the desired 60 half-space
Aggressive Update Step • Set to be the solution of the following optimization problem : • The Lagrangian: • Solve for the dual: 61
Aggressive Update Step • Optimize for : • Set the derivative to zero • Substitute back into the Lagrangian: • Dual optimization problem 62
Aggressive Update Step • Dual Problem: • Solve it: • What about the constraint? 63
Alternative Derivation • Additional Constraint (linear update: ( • Force the constraint to hold as equality • Solve: 64
Passive-Aggressive Update 65
Perceptron vs. PA • Common Update: • Perceptron • Passive-Aggressive 66
Perceptron vs. PA No-Error, Small Margin Error Margin 67 No-Error, Large Margin
Perceptron vs. PA 68
Three Decision Problems Classification 69 Regression Uniclass
The Passive-Aggressive Algorithm • Each example defines a set of consistent hypotheses: • The new vector is set to be the projection of onto Classification 70 Regression Uniclass
Loss Bound • Assume there exists • Assume: • Then: • Note: 71 such that
Proof Sketch • Define: • Upper bound: • Lower bound: 72 Lipschitz Condition
Proof Sketch (Cont(. • Combining upper and lower bounds 73
Unrealizable case There is no weight vector that satisfy all the constraints 74
Unrealizable Case 75
Unrealizable Case 76
Loss Bound for PA-I • Mistake bound: • Optimal value, set 77
Loss Bound for PA-II • Loss bound: • Similar proof technique as of PA • Bound can be improved similarly to the Perceptron 78
Four Algorithms Perceptron PA II PA 79
Four Algorithms Perceptron PA II PA 80
Next • Real-world problems – Examples – Commonalities – Extension of algorithms for the complex setting – Applications 81
Binary Classification If it’s not one class, It most be the other class 3 82 3
Multi Class Single Label Elimination of a single class is not enough 2 83 3 4 6
Ordinal Ranking / Regression Structure Over Possible Labels Viewer’s Rating Machine Prediction Order relation over labels 84
Hierarchical Classification [DKS’ 04] Phonetic transcription of DECEMBER Gross error d ix CH eh m bcl b er Small errors d AE s eh m bcl b er d ix s eh NASAL bcl b er 85
[DKS’ 04] Phonetic Hierarchy PHONEMES Sononorants Structure Over Possible Labels Silences Nasals Obstruents Liquids l y w r Vowels Front oyow uhuw 86 n m ng Affricates Plosives Center aa ao er aw ay Back iy ih ey eh ae b g d k p t jh ch Fricatives f v sh s th dh zh z
Multi-Class Multi-Label Document Full topic ranking The higher minimum wage signed into law… will be welcome relief for millions of workers …. The 90 cent-an-hour increase …. Relevant topics • REGULATION / POLICY • CORPORATE / INDUSTRIAL • LABOUR • GOVERNMENT / SOCIAL 87 • MARKETS • ECONOMICS • • • • • ECONOMICS CORPORATE / INDUSTRIAL REGULATION / POLICY MARKETS LABOUR GOVERNMENT / SOCIAL LEGAL/JUDICIAL REGULATION/POLICY SHARE LISTINGS PERFORMANCE ACCOUNTS/EARNINGS COMMENT / FORECASTS SHARE CAPITAL BONDS / DEBT ISSUES LOANS / CREDITS STRATEGY / PLANS INSOLVENCY / LIQUIDITY
Multi-Class Multi-Label Document Full topic ranking The higher minimum wage signed into law… will be welcome relief for millions of workers …. The 90 cent-an-hour increase …. • • • • • ECONOMICS INSOLVENCY / LIQUIDITY CORPORATE / INDUSTRIAL REGULATION / POLICY COMMENT / FORECASTS LABOUR LOANS / CREDITS LEGAL/JUDICIAL REGULATION/POLICY SHARE LISTINGS GOVERNMENT / SOCIAL PERFORMANCE ACCOUNTS/EARNINGS MARKETS SHARE CAPITAL BONDS / DEBT ISSUES STRATEGY / PLANS Non-trivial Evaluation Measures Relevant topics • REGULATION / POLICY • MARKETS • CORPORATE / INDUSTRIAL • LABOUR • GOVERNMENT / SOCIAL • ECONOMICS Recall Precision Any Error? No. Errors? 88
Noun Phrase Chunking Estimated volume was a light 2. 4 million ounces. Simultaneous Labeling 89
Named Entity Extraction Bill Clinton and Microsoft founder Bill Interactive Decisions Gates met today for 20 minutes. 90
[Mc. Donald 06] Sentence Compression • The Reverse Engineer Tool is available now and is priced on a site-licensing basis , ranging from $8, 000 for a single user to $90, 000 for a multiuser project site. Complex Input – Output Relation • Essentially , design recovery tools read existing code and translate it into the language in which CASE is conversant -- definitions and structured diagrams. 91
Dependency Parsing John hit the ball with the bat Non-trivial Output 92
[Shalev-Shwartz, Keshet, Singer 2004] Aligning Polyphonic Music Two ways for representing music Symbolic representation: Acoustic representation: 93
[Shalev-Shwartz, Keshet, Singer 2004] Symbolic Representation symbolic representation: - pitch 94 - start-time
[Shalev-Shwartz, Keshet, Singer 2004] Acoustic Representation acoustic signal: Feature Extraction (e. g. Spectral Analysis) 95 acoustic representation:
[Shalev-Shwartz, Keshet, Singer 2004] The Alignment Problem Setting pitch time 96 actual start-time:
Challenges • Elimination is not enough • Structure over possible labels • Non-trivial loss functions • Complex input – output relation • Non-trivial output 97
Challenges • Interactive Decisions Fake News Show Fake News • A wide range of sequence features • Computing an answer is relatively costly 98 Show
Analysis as Labeling Model • Label gives role for corresponding input • "Many to one relation” • Still, not general enough 99
Examples Estimated volume was a light 2. 4 million ounces. B I O B I I O Bill Clinton and Microsoft founder Bill B-PER I-PER O B-ORG O B-PER Gates met today for 20 minutes. I-PER O O O 100
Outline of Solution • A quantitative evaluation of all predictions – Loss Function (application dependent) • Models class – set of all labeling functions – Generalize linear classification – Representation • Learning algorithm – Extension of Perceptron – Extension of Passive-Aggressive 101
Loss Functions • Hamming Distance (Number of wrong decisions) Estimated volume was a light 2. 4 million ounces. B I O B I I O B O O O B I I O O • Levenshtein Distance (Edit distance) – Speech • Number of words with incorrect parent – Dependency parsing 102
Outline of Solution • A quantitative evaluation of all predictions – Loss Function (application dependent) • Models class – set of all labeling functions – Generalize linear classification – Representation • Learning algorithm – Extension of Perceptron – Extension of Passive-Aggressive 103
Multiclass Representation I • k Prototypes 104
Multiclass Representation I • k Prototypes • New instance 105
Multiclass Representation I • k Prototypes • New instance • Compute Class r 1 2 1. 66 3 0. 37 4 106 -1. 08 -2. 09
Multiclass Representation I • k Prototypes • New instance • Compute Class r 1 -1. 08 2 1. 66 3 0. 37 4 -2. 09 • Prediction: The class achieving the highest Score 107
Multiclass Representation II • Map all input and labels into a joint vector space F Estimated volume was a light 2. 4 million ounces. B I O B I I O = (0 1 1 0 … ) • Score labels by projecting the corresponding feature vector 108
Multiclass Representation II • Predict label with highest score (Inference) • Naïve search is expensive if the set of possible labels is large Estimated volume was a light 2. 4 million ounces. B I O B I – No. of labelings = 3 No. of words 109 I I I O
Multiclass Representation II • Features based on local domains F Estimated volume was a light 2. 4 million ounces. B I O B I I O • Efficient Viterbi decoding for sequences 110 = (0 1 1 0 … )
Multiclass Representation II After Shalev Correct Labeling Almost Correct Labeling 111
Multiclass Representation II After Shalev Correct Labeling Incorrect Labeling Worst Labeling 112 Almost Correct Labeling
Multiclass Representation II After Shalev Correct Labeling Almost Correct Labeling Worst Labeling 113
Multiclass Representation II After Shalev Correct Labeling Almost Correct Labeling Worst Labeling 114
Two Representations • Weight-vector per class (Representation I) – Intuitive – Improved algorithms • Single weight-vector (Representation II) – Generalizes representation I F(x, 4)= 0 0 0 x 0 – Allows complex interactions between input and output 115
Why linear models? • Combine the best of generative and classification models: – Trade off labeling decisions at different positions – Allow overlapping features • Modular – factored scoring – loss function – From features to kernels 116
Outline of Solution • A quantitative evaluation of all predictions – Loss Function (application dependent) • Models class – set of all labeling functions – Generalize linear classification – Representation • Learning algorithm – Extension of Perceptron – Extension of Passive-Aggressive 117
Multiclass Multilabel Perceptron Single Round • Get a new instance • Predict ranking • Get feedback • Compute loss • If 118 update weight-vectors
Multiclass Multilabel Perceptron Update (1) • Construct Error-Set • Form any set of parameters § § If § 119 then that satisfies:
Multiclass Multilabel Perceptron Update (1) 120
Multiclass Multilabel Perceptron Update (1) 1 2 3 121 4 5
Multiclass Multilabel Perceptron Update (1) 1 2 3 122 4 5
Multiclass Multilabel Perceptron Update (1) 1 2 3 123 4 5 0 0
Multiclass Multilabel Perceptron Update (1) 1 5 2 0 0 0 3 124 4 a 0 1 -a
Multiclass Multilabel Perceptron Update (2) • Set for • Update: 125
Multiclass Multilabel Perceptron Update (2) 1 5 2 0 0 0 3 126 4 a 0 1 -a
Multiclass Multilabel Perceptron Update (2) 1 5 2 0 0 0 3 127 4 a 0 1 -a
Uniform Update 1 5 2 0 0 0 3 128 4 1/2 0 1/2
Max Update • Sparse • Performance is worse than Uniform’s 1 5 2 0 0 0 3 129 4 0 0 1
Update Results Uniform Update 130 Before Max Update
Margin for Multi Class • Binary: • Multi Class: Prediction Margin Error 131
Margin for Multi Class • Binary: • Multi Class: 132
Margin for Multi Class • Multi Class: Because the loss But not all mistakes How do you know? function is not are equal? constant ! 133
Margin for Multi Class • Multi Class: So, use it ! 134
Linear Structure Models After Shalev Correct Labeling Almost Correct Labeling 135
Linear Structure Models After Shalev Correct Labeling Incorrect Labeling Worst Labeling 136 Almost Correct Labeling
Linear Structure Models After Shalev Correct Labeling Incorrect Labeling Worst Labeling 137 Almost Correct Labeling
PA Multi Class Update 138
PA Multi Class Update • Project the current weight vector such that the instance ranking is consistent with loss function • Set to be the solution of the following optimization problem : 139
PA Multi Class Update • Problem – intersection of constraints may be empty • Solutions – Does not occur in practice – Add a slack variable – Remove constraints 140
Add a Slack Variable • Add a slack variable: Generalized Hinge Loss • Rewrite the optimization: 141
Add a Slack Variable Estimated volume was a light 2. 4 million ounces. • We like to Isolve : B I B O I I I O • May have exponential number of constraints, thus intractable – If the loss can be factored then there is a polynomial equivalent set of constraints (Taskar et. al 2003) – Remove constraints (the other solution for the emptyset problem) 142
PA Multi Class Update • Remove constraints : • How to choose the single competing labeling ? – The labeling that attains the highest score! 143 – … which is the predicted label according to the current model
PA Multiclass online algorithm • Initialize • For – – – 144 Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule
Advantages • • • Process one training instance at a time Very simple Predictable runtime, small memory Adaptable to different loss functions Requires : – Inference procedure – Loss function – Features 145
Batch Setting • Often we are given two sets : – Training set used to find a good model – Test set for evaluation • Enumerate over the training set – Possibly more than once – Fix the weight vector • Evaluate on Test Set • Formal guaranties if training set and test set are i. i. d from fixed (unknown) distribution 146
Two Improvements • Averaging – Instead of using final weight-vector use a combination of all weight-vector obtained during training time • Top – k update – Instead of using only the labeling with highest score, the k labelings with highest score 147
Averaging • Initialize • For – – – Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule – Update the average rule 148 MIRA
Top-k Update • Recall : – Inference : – Update 149
Top-K Update • Recall : – Inference – Top-k Inference 150
Top-K Update • Top-k Inference: • Update: 151
Previous Approaches • Focus on sequences • Similar constructions for other structures • Mainly batch algorithms 152
Previous Approaches • Generative models: probabilistic generators of sequence-structure pairs – Hidden Markov Models (HMMs) – Probabilistic CFGs • Sequential classification: decompose structure assignment into a sequence of structural decisions • Conditional models : probabilistic model of labels given input – Conditional Random Fields [LMR 2001] 153
Previous Approaches • Re-ranking : Combine generative and discriminative models – Full parsing [Collins 2000] • Max Margin Makrov Networks [TGK 2003] – Use all competing labels – Elegant factorization – Closely related to PA with all the constraints 154
HMMs y 1 x 1 a x 2 a x 1 b y 3 x 3 a x 2 b x 1 c 155 y 2 x 3 b x 2 c x 3 c
HMM 156
HMMs • Solves a more general problem then required. Models the joint probability. • Hard to model overlapping features, yet application needs richer input representation. – E. g. word identity, capitalization, ends in “-tion” , word in word list, word font, white space ratio, begins with number, word font ends with “? ” • Relax conditional independence of features on labels intractability 157
John Lafferty, Andrew Mc. Callum, Fernando Pereira 2001 Conditional Random Fields • Define distributions over labels • Maximize the log-likelihood of data 158 • Dynamic programming for expectations (forward-backward algorithm) • Standard convex optimization (L-BFGS)
Local Classifiers Dan Roth MEMM • Train local classifiers. – E. g. Estimated volume was a light 2. 4 million ounces. B ? O B I I O • Combine results in test time • Cheap to train • Can not model well long distance interactions – MEMMs, local classifiers – Combine classifiers at test time – Problem: Can not trade-off decisions at different locations (labelbias problem) 159
Michael Collins 2000 Re-Ranking • Use a generative model to reduce exponential number of labelings into a polynomial number of candidates – Local features • Use the Perceptron algorithm to re-rank the list – Global features • Great results! 160
Empirical Evaluation • • 161 Category Ranking / Multiclass multilabel Noun Phrase Chunking Named entities Dependency Parsing Genefinder Phoneme Alignment Non-projective Parsing
Experimental Setup • Algorithms : – Rocchio, normalized prototypes – Perceptron, one per topic – Multiclass Perceptron - Is. Err, Error. Set. Size • Features – About 100 terms/words per category • Online to Batch : – Cycle once through training set – Apply resulting weight-vector to the test set 162
Data Sets Reuters 21578 Training Examples Test Examples Topics < Topics / Example > No. Features 163 8, 631 2, 158 90 1. 24 3, 468
Data Sets Reuters 21578 Training Examples Test Examples Topics < Topics / Example > No. Features 164 Reuters 2000 8, 631 2, 158 90 1. 24 3, 468 521, 439 287, 944 102 3. 20 9, 325
Training Online Results Perceptron Is. Err MP Err. Set. Size Average Cumulative Is. Err Round Number Reuters 21578 165 MP Is. Err Round Number Reuters 2000
Perceptron Average Precision MP Err. Set. Size Round Number Reuters 21578 166 MP ls. Err Average Cumulative Avg. P Training Online Results Round Number Reuters 2000
Rocciho Test Results Perceptron Is. Err R 21578 167 MP Err. Set. Size Is. Err and Error. Set. Size MP Is. Err R 2000 Is. Err R 21578 R 2000 Error. Set. Size
Sequence Text Analysis • Features : – Meaningful word features – POS features – Unigram and bi-gram NER/NP features F Estimated volume was a light 2. 4 million ounces. B I O B I I O = (0 1 1 0 … ) • Inference : – Dynamic programming – Linear in length, quadratic in number of classes 168
Noun Phrase Chunking Mc. Donald, Crammer, Pereira Estimated volume was a light 2. 4 million ounces. Avg. Perceptron CRF 0. 942 MIRA 169 0. 941 0. 943
Noun Phrase Chunking Performance on test data Mc. Donald, Crammer, Pereira 170 Training time in CPU minutes
Named Entity Extraction Mc. Donald, Crammer, Pereira Bill Clinton and Microsoft founder Bill Gates met today for 20 minutes. Avg. Perceptron CRF 0. 830 MIRA 171 0. 823 0. 831
Named Entity Extraction Performance on test data Mc. Donald, Crammer, Pereira 172 Training time in CPU minutes
Dependency parsing • Features : – Anything over words – single edges • Inference : – Search over possible trees in cubic time (Eisner, Satta) 173
Dependency Parsing Mc. Donald, Crammer, Pereira English Czech Y & M 2003 Avg. Perc. 82. 9 N & S 2004 87. 3 MIRA 83. 2 Avg. Perc. 90. 6 MIRA 174 90. 3 90. 9
Gene Finding Bernal, Crammer, Hatzigeorgiu, Pereira • Semi-Markov Models (features includes segments length information( • Decoding quadratic in length of sequence specificity = FP / (FP + TN) sensitivity = FN / (TP + FN) 175
Dependency Parsing Mc. Donald & Pereira • Complications: – High order features and multiply parents per word – Non-projective trees • Approximate inference Czech 1 – proj 2 – proj 84. 2 1 – non-proj 176 83. 0 84. 1 2 – non-proj 85. 2
Phoneme Alignment Keshet, Shalev-Shwartz, Singer, Chazan • Input: Acoustic signal + true phonemes • Output : Segmentation of the signal t<10 t<30 t<40 Discriminative 79. 2 92. 1 96. 2 98. 1 Brugnara et al 177 t<20 75. 3 88. 9 94. 4 97. 1
Summary • Online training for complex decisions • Simple to implement, fast to train • Modular : – Loss function – Feature Engineering – Inference procedure • Works well in practice • Theoretically analyzed 178
Uncovered Topics • Kernels • Multiplicative updates • Bregman divergences and projections • Theory of online-to-batch • Matrix representations and updates 179
Partial Bibliography • Prediction, learning, and games. Nicolò Cesa-Bianchi and Gábor Lugosi • Y. Censor & S. A. Zenios, “Parallel Optimization”, Oxford UP, 1997 • Y. Freund & R. Schapire, “Large margin classification using the Perceptron algorithm”, MLJ, 1999. • M. Herbster, “Learning additive models online with fast evaluating kernels”, COLT 2001 • J. Kivinen, A. Smola, and R. C. Williamson, “Online learning with kernels”, IEEE Trans. on SP, 2004 • H. H. Bauschke & J. M. Borwein, “On Projection Algorithms for Solving Convex Feasibility Problems”, SIAM Review, 1996 180
Applications • • Online Passive Aggressive Algorithms, CDSS’ 03 + CDKSS’ 05 Online Ranking by Projecting, CS’ 05 Large Margin Hierarchical Classification, DKS’ 04 Online Learning of Approximate Dependency Parsing Algorithms. R. Mc. Donald and F. Pereira European Association for Computational Linguistics, 2006 • Discriminative Sentence Compression with Soft Syntactic Constraints. R. Mc. Donald. European Association for Computational Linguistics, 2006 • Non-Projective Dependency Parsing using Spanning Tree Algorithms. R. Mc. Donald, F. Pereira, K. Ribarov and J. Hajic HLTEMNLP, 2005 • Flexible Text Segmentation with Structured Multilabel Classification. R. Mc. Donald, K. Crammer and F. Pereira HLT-EMNLP, 2005 181
Applications • • • Online and Batch Learning of Pseudo-metrics, SSN’ 04 Learning to Align Polyphonic Music, SKS’ 04 The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees, DSS’ 04 First-Order Probabilistic Models for Coreference Resolution. Aron Culotta, Michael Wick, Robert Hall and Andrew Mc. Callum. NAACL/HLT, 2007 • Structured Models for Fine-to-Coarse Sentiment Analysis R. Mc. Donald, K. Hannan, T. Neylon, M. Wells, and J. Reynar Association for Computational Linguistics, 2007 • Multilingual Dependency Parsing with a Two-Stage Discriminative Parser. R. Mc. Donald and K. Lerman and F. Pereira Conference on Natural Language Learning, 2006 • Discriminative Kernel-Based Phoneme Sequence Recognition, Joseph Keshet, Shai Shalev-Shwartz, Samy Begio, Yoram Singer and Dan Chazan, International Conference on Spoken Language Processing , 2006 182
183
f33c39e37fa956517d96906e530573fc.ppt