Скачать презентацию Status of TMVA the Toolkit for Multi Variate Скачать презентацию Status of TMVA the Toolkit for Multi Variate

16df34a18e0e9797024760931fa9c065.ppt

  • Количество слайдов: 25

Status of TMVA the Toolkit for Multi. Variate Analysis Eckhard von Toerne (University of Status of TMVA the Toolkit for Multi. Variate Analysis Eckhard von Toerne (University of Bonn) For the TMVA core developer team: A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E. v. T. , H. Voss ACAT 2011 Eckhard von Toerne 06. Sept 2011 1/21

Outline • Overview • New developments • Recent physics results that use TMVA – Outline • Overview • New developments • Recent physics results that use TMVA – Web-Site: http: //tmva. sourceforge. net/ – See also: "TMVA - Toolkit for Multivariate Data Analysis , A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E. v. Toerne, H. Voss et al. , ar. Xiv: physics/0703039 v 5 [physics. data-an] ACAT 2011 Eckhard von Toerne 06. Sept 2011 2/21

What is TMVA • Supervised learning • Classification and Regression tasks • Easy to What is TMVA • Supervised learning • Classification and Regression tasks • Easy to train, evaluate and compare various MVA methods • Various preprocessing methods (Decorr. , PCA, Gauss. . . ) • Integrated in ROOT ACAT 2011 Eckhard von Toerne MVA output 06. Sept 2011 3/21

TMVA workflow • Training: – Classification: Learn the features of the different event classes TMVA workflow • Training: – Classification: Learn the features of the different event classes from a sample with known signal/background composition – Regression: Learn the functional dependence between input variables and targets • Testing: – Evaluate the performance of the trained classifier/regressor on an independent test sample – Compare different methods • Application: – Apply the classifier/regressor to real data ACAT 2011 Eckhard von Toerne 06. Sept 2011 4/21

Classification/Regression Classification of signal/background – How to find best decision boundary? x H 2 Classification/Regression Classification of signal/background – How to find best decision boundary? x H 2 x 2 H 1 1 H 0 H H 0 0 x 1 x 1 Regression – How to determine the correct model? ACAT 2011 Eckhard von Toerne 06. Sept 2011 5/21

How to choose a method? • If you have a training sample with only How to choose a method? • If you have a training sample with only few events? Number of „parameters“ must be limited Use Linear classifier or FDA, small BDT, small MLP • Variables are uncorrelated (or only linear corrs) likelihood • I just want something simple use Cuts, LD, Fisher • Methods for complex problems use BDT, MLP, SVM List of acronyms: BDT = boosted decision tree, see manual page 103 ANN = articifical neural network MLP = multi-layer perceptron, a specific form of ANN, also the name of our flagship ANN, manual p. 92 FDA = functional discriminant analysis, see manual p. 87 LD = linear discriminant , manual p. 85 SVM = support vector machine, manual p. 98 , SVM currently available only for classification Cuts = like in “cut selection“, manual p. 56 Fisher = Ronald A. Fisher, classifier similar to LD, manual p. 83 ACAT 2011 Eckhard von Toerne 06. Sept 2011 6/21

Artificial Neural Networks • Advantages: – very flexible, no assumption about the function necessary Artificial Neural Networks • Advantages: – very flexible, no assumption about the function necessary • Disadvantages: – „black box“ – needs tuning – seed dependent Performance Feed-forward Multilayer Perceptron • Modelling of arbitrary nonlinear functions as a nonlinear combination of simple „neuron activation functions“ Speed 1 input layer 1 k hidden layers . . . Nvar discriminating input variables . . . 1 . . . i j 1 ouput layer 1 . . . 2 output classes and background) Mk . . . N M 1 (“Activation” function) with: Robustness No/linear correlations Nonlinear correlations Training Response Overtraining Curse of Dim. Transparency Weak input vars ACAT 2011 (signal Eckhard von Toerne Regression 1 D multi D 06. Sept 2011 7/21

Boosted Decision Trees • Grow a forest of decision trees and determine the event Boosted Decision Trees • Grow a forest of decision trees and determine the event class/target by majority vote • Weights of misclassified events are increased in the next iteration • Advantages: – ignores weak variables – works out of the box • Disadvantages: – vulnerable to overtraining Performance Speed Robustness No/linear correlations Nonlinear correlations Training Response Overtraining / Transparency Weak input vars Curse of Dim. ACAT 2011 Eckhard von Toerne Regression 1 D / multi D 06. Sept 2011 8/21

No Single Best Classifier… Classifiers Criteria Performance no / linear correlations nonlinear correlations Training No Single Best Classifier… Classifiers Criteria Performance no / linear correlations nonlinear correlations Training Speed Response Robust -ness Overtraining Weak input variables Curse of dimensionality Transparency Cuts Likelihood PDERS / k-NN / H-Matrix Fisher MLP BDT Rule. Fit SVM The properties of the Function discriminant (FDA) depend on the chosen function ACAT 2011 Eckhard von Toerne 06. Sept 2011 9/21

Neyman-Pearson Lemma Neyman-Pearson: The Likelihood ratio used as “selection criterion” y(x) gives for each Neyman-Pearson Lemma Neyman-Pearson: The Likelihood ratio used as “selection criterion” y(x) gives for each selection efficiency the best possible background rejection. 1 - ebackgr. 1 Type-1 error small Type-2 error large g “limit ” ive n b in RO y li kel C cur iho v od e be rat tte io g r i. e. it maximizes the area under the “Receiver Operation Characteristics” (ROC) curve oo d ra cla ss ifi nd om cla ss gu es sin ifi ca tio n g Type-1 error large Type-2 error small 0 0 esignal 1 Varying y(x)>“cut” moves working point (efficiency and purity) along ROC curve How to choose “cut”? need to know prior probabilities (S, B abundances) • • Measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p) Discovery of a signal : maximum of S/√(B) Precision measurement: high purity (p) Trigger selection: high efficiency (e) ACAT 2011 Eckhard von Toerne 06. Sept 2011 10/21

Performance with toy data • 3 -dimensional distribution • Signal: sum of Gaussians • Performance with toy data • 3 -dimensional distribution • Signal: sum of Gaussians • Background=flat • Theoretical limit calculated using Neyman-Pearson Lemma • Neural net (MLP) with two hidden layers and backpropagation training. Bayesian option has little influence on high statistics training • TMVA-ANN converges towards theoretical limit for sufficient Ntrain (~100 k) ACAT 2011 Eckhard von Toerne 06. Sept 2011 11/21

Recent developments • Current version: TMVA version 4. 1. 2 in root release 5. Recent developments • Current version: TMVA version 4. 1. 2 in root release 5. 30 • Unit test framework for daily software and method performance validation (C. Rosemann, E. v. T. ) • Multiclassification for MLP, BDTG, FDA • BDT automatic parameter optimization for building the tree architecture • new method to treat data with distinct sub-populations (Method Category) • Optional Bayesian treatment of ANN weights in MLP with backpropagation (Jiahang Zhong) • Extended PDEFoam functionality (A. Voigt) • Variable transformations on a user-defined subset of variables ACAT 2011 Eckhard von Toerne 06. Sept 2011 12/21

Unit test C. Rosemann, E. v. T. • automated framework to verify functionality and Unit test C. Rosemann, E. v. T. • automated framework to verify functionality and performance (ours based on B. Eckel‘s description) • slimmed version runs every night on various OS ******************************** * TMVA - U N I T test : Summary * ******************************** Test 0 : Event [107/107]. . . . . OK Test 1 : Variable. Info [31/31]. . . . OK Test 2 : Data. Set. Info [20/20]. . . . OK Test 3 : Data. Set [15/15]. . . . . OK Test 4 : Factory [16/16]. . . . . OK Test 7 : LDN_sel. Var_Gauss [4/4]. . . . OK. . Test 107 : Boosted. PDEFoam [4/4]. . . . OK Test 108 : Boosted. DTPDEFoam [4/4]. . . . OK Total number of failures: 0 ******************************** ACAT 2011 Eckhard von Toerne 06. Sept 2011 13/21

And now: switching from statistics to physics …acknowledging the hard work of our users And now: switching from statistics to physics …acknowledging the hard work of our users ACAT 2011 Eckhard von Toerne 06. Sept 2011 14/21

Review of recent results • x BDT trained on individual m. H samples with Review of recent results • x BDT trained on individual m. H samples with 10 variables. Expect 1. 47 ev signal events at m. H=160 Ge. V. . compared to 1. 27 ev with cut-based analysis (on 4 variables) and same bgd. Phys. Lett. B 699: 25 -47, 2011. ACAT 2011 Eckhard von Toerne 06. Sept 2011 15/21

Review of recent results Super-Kamiokande Coll. , “Kinematic reconstruction of atmospheric neutrino events in Review of recent results Super-Kamiokande Coll. , “Kinematic reconstruction of atmospheric neutrino events in a large water Cherenkov detector with proton identification“ Phys. Rev. D 79: 112010, 2009. 7 input variables MLP with one hidden layer Signal Background ACAT 2011 Eckhard von Toerne 06. Sept 2011 16/21

Review of recent results • CDF Coll. , “First Observation of Electroweak Single Top Review of recent results • CDF Coll. , “First Observation of Electroweak Single Top Quark Production“, Phys. Rev. Lett. 103: 092002, 2009. • BDT analysis with ~20 input variables • lepton + ET-mis+ jets • Results for s+t-channel ACAT 2011 Eckhard von Toerne 06. Sept 2011 17/21

Review of recent results using TMVA • • • • CDF+D 0 combined higgs Review of recent results using TMVA • • • • CDF+D 0 combined higgs working group, hep-ex/1107. 5518. (SVM) CMS Coll. , H-->WW search, Phys. Lett. B 699: 25 -47, 2011. (BDT) Ice. Cube Coll. , astro-ph/1101. 1692. (MLP) D 0 Coll. , top-pairs, Phys. Rev. D 84: 012008, 2011. (BDT) Ice. Cube Coll. , Phys. Rev. D 83: 012001, 2011. (BDT) Ice. Cube Coll. , Phys. Rev. D 82: 112003, 2010. (BDT) D 0 Coll. , Higgs search, Phys. Rev. Lett. 105: 251801, 2010. (BDT) CDF Coll. , single top, Phys. Rev. D 82: 112005, 2010. (BDT) D 0 Coll. , single top, Phys. Lett. B 690: 5 -14, 2010. (BDT) D 0 Coll. , top pairs, Phys. Rev. D 82: 032002, 2010. (Likelihood) CDF Coll. , single top obs. , Phys. Rev. Lett. 103: 092002, 2009. (BDT) Super-Kamiokande Coll. , Phys. Rev. D 79: 112010, 2009. (MLP) BABAR Coll. , Phys. Rev. D 79: 051101, 2009. (BDT) + other papers + several ATLAS papers with TMVA about to come out… ACAT 2011 Eckhard von Toerne 06. Sept 2011 18/21

Review of recent results using TMVA • • • • CDF+D 0 combined higgs Review of recent results using TMVA • • • • CDF+D 0 combined higgs working group, hep-ex/1107. 5518. (SVM) CMS Coll. , H-->WW search, Phys. Lett. B 699: 25 -47, 2011. (BDT) Ice. Cube Coll. , astro-ph/1101. 1692. (MLP) D 0 Coll. , top-pairs, Phys. Rev. D 84: 012008, 2011. (BDT) Ice. Cube Coll. , Phys. Rev. D 83: 012001, 2011. (BDT) Ice. Cube Coll. , Phys. Rev. D 82: 112003, 2010. (BDT) D 0 Coll. , Higgs search, Phys. Rev. Lett. 105: 251801, 2010. (BDT) CDF Coll. , single top, Phys. Rev. D 82: 112005, 2010. (BDT) D 0 Coll. , single top, Phys. Lett. B 690: 5 -14, 2010. (BDT) D 0 Coll. , top pairs, Phys. Rev. D 82: 032002, 2010. (Likelihood) CDF Coll. , single top obs. , Phys. Rev. Lett. 103: 092002, 2009. (BDT) Super-Kamiokande Coll. , Phys. Rev. D 79: 112010, 2009. (MLP) BABAR Coll. , Phys. Rev. D 79: 051101, 2009. (BDT) + other papers + several ATLAS papers with TMVA about to come out… ACAT 2011 VA g in nk a ou y or f ! TM s u Th Eckhard von Toerne 06. Sept 2011 19/21

Summary • TMVA versatile package for classification and regression tasks • Integrated into ROOT Summary • TMVA versatile package for classification and regression tasks • Integrated into ROOT • Easy to train classifiers/regression methods • A multitude of physics results based on TMVA are coming out • Thank you for your attention! ACAT 2011 Eckhard von Toerne 06. Sept 2011 20/21

Credits • TMVA is open source software • Use & redistribution of source permitted Credits • TMVA is open source software • Use & redistribution of source permitted according to terms in BSD license • Several similar data mining efforts with rising importance in most fields of science and industry Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot -Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U. , UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Christoph Rosemann (DESY), Doug Schouten (S. Fraser U. , Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland), Jiahang Zhong (Academica Sinica, Taipeh). ACAT 2011 Eckhard von Toerne 06. Sept 2011 21/21

Spare Slides ACAT 2011 Eckhard von Toerne 06. Sept 2011 22/21 Spare Slides ACAT 2011 Eckhard von Toerne 06. Sept 2011 22/21

A complete TMVA training/testing session void TMVAnalysis( ) { TFile* output. File = TFile: A complete TMVA training/testing session void TMVAnalysis( ) { TFile* output. File = TFile: : Open( "TMVA. root", "RECREATE" ); TMVA: : Factory *factory = new TMVA: : Factory( "MVAnalysis", output. File, "!V"); Create Factory TFile *input = TFile: : Open("tmva_example. root"); factory->Add. Variable("var 1+var 2", 'F'); factory->Add. Variable("var 1 -var 2", 'F'); //factory->Add. Target("tarval", 'F'); Add variables/ targets factory->Add. Signal. Tree ( (TTree*)input->Get("Tree. S"), 1. 0 ); factory->Add. Background. Tree ( (TTree*)input->Get("Tree. B"), 1. 0 ); //factory->Add. Regression. Tree ( (TTree*)input->Get("reg. Tree"), 1. 0 ); factory->Prepare. Training. And. Test. Tree( "", "n. Train_Signal=200: n. Train_Background=200: n. Test_Signal=200: n. Test_Background=200: Norm. Mode=None Initialize Trees factory->Book. Method( TMVA: : Types: : k. Likelihood, "Likelihood", "!V: !Transform. Output: Spline=2: NSmooth=5: NAv. Evt. Per. Bin=50" ); factory->Book. Method( TMVA: : Types: : k. MLP, "MLP", "!V: NCycles=200: Hidden. Layers=N+1, N: Test. Rate=5" ); Book MVA methods factory->Train. All. Methods(); // factory->Train. All. Methods. For. Regression(); factory->Test. All. Methods(); factory->Evaluate. All. Methods(); output. File->Close(); delete factory; } ACAT 2011 Eckhard von Toerne Train, test and evaluate 06. Sept 2011 23/21

What is a multi-variate analysis? • “Combine“ all input variables into one output variable What is a multi-variate analysis? • “Combine“ all input variables into one output variable • Supervised learning means learning by example: the program extracts patterns from training data Input Variables Classifier Output ACAT 2011 Eckhard von Toerne 06. Sept 2011 24/21

Metaclassifiers – Category Classifier and Boosting • The category classifier is custom-made for HEP Metaclassifiers – Category Classifier and Boosting • The category classifier is custom-made for HEP – Use different classifiers for different phase space regions and combine them into a single output • TMVA supports boosting for all classifiers – Use a collection of “weak learners“ to improve their performace (boosted Fisher, boosted neural nets with few neurons each…) ACAT 2011 Eckhard von Toerne 06. Sept 2011 25/21