Machine Learning in Performance Management Irina Rish IBM

Machine Learning in Performance Management Irina Rish IBM T. J. Watson Research Center January 24, 2001 1

Outline n n Introduction Machine learning applications in Performance Management n Bayesian learning tools: extending ABLE n Advancing theory n Summary and future directions 2 Irina Rish, IBM TJWRC

Learning problems: examples Remote Procedure Calls (RPCs) Transaction 1 Transaction 2 BUY? SELL? System event mining Events from hosts End-user transaction recognition OPEN_DB? SEARCH? Time Pattern discovery, classification, diagnosis and prediction 3 Irina Rish, IBM TJWRC

Approach: Bayesian learning Learn (probabilistic) dependency models Bayesian networks S P(S) P(B|S) P(C|S) B C Pattern classification: P(class|data)=? P(X|C, S) Prediction: P(symptom|cause)=? X D P(D|C, B) Diagnosis: P(cause|symptom)=? Numerous important applications: § § § Medicine Stock market Bio-informatics e. Commerce Military ……… 4 Irina Rish, IBM TJWRC

Outline n n Introduction Machine-learning applications in Performance Management n Transaction Recognition n In progress: Event Mining; Probe Placement; etc. n Bayesian learning tools: extending ABLE n Advancing theory n Summary and future directions 5 Irina Rish, IBM TJWRC

End-User Transaction Recognition: why is it important? ? Ø Open. DB Ø Search Ø Send. Mail End-User Transactions (EUT) Remote Procedure Calls (RPCs) Client Workstation Session (connection) RPCs Server (Web, DB, Lotus Notes) Examples: Lotus Notes, Web/e. Business (on-line stores, travel agencies, trading): database transactions, buy/sell, search, email, etc. § Realistic workload models (for testing performance) § Resource management (anticipating requests) § Quantifying end-user perception of performance (response times) 6 Irina Rish, IBM TJWRC

Why is it hard? Why learn from data? Example: EUTs and RPCs in Lotus Notes EUTs Move. Msg. To. Folder Find. Mail. By. Key RPCs 1. 2. 3. 4. 5. 6. 7. 8. OPEN_COLLECTION UDATE_COLLECTION DB_REPLINFO_GET GET_MOD_NOTES READ_ENTRIES OPEN_COLLECTION FIND_BY_KEY READ_ENTRIES Ø Many RPC and EUT types (92 RPCs and 37 EUTs) Ø Large (unlimited) data sets (10, 000+ Tx inst. ) Ø Manual classification of a data subset took about a month Ø Non-deterministic and unknown EUT RPC mapping: Ø “Noise” sources - client/server states Ø No client-side instrumentation – unknown EUT boundaries 7 Irina Rish, IBM TJWRC

Our approach: Classification + Segmentation Problem 1: label segmented data (classification) Segmented RPC's 1 Labeled Tx's 2 Tx 1 1 Tx 2 3 3 1 1 1 Tx 3 2 2 3 4 Tx 3 Tx 1 (similar to text classification) Problem 2: both segment and label (EUT recognition) Unsegmented RPC's Segmented RPC's and Labeled Tx's 1 2 1 3 4 1 Tx 2 Tx 1 2 Tx 3 3 4 Tx 2 (similar to speech understanding, image segmentation) 8 Irina Rish, IBM TJWRC

How to represent transactions? “Feature vectors” n RPC occurrences n RPC counts 9 Irina Rish, IBM TJWRC

Classification scheme Training phase Training data: RPCs labeled with EUTs “Test” data: Unlabeled RPCs Learning Classifier Feature Extraction Classification Classifier EUTs Operation phase 10 Irina Rish, IBM TJWRC

Our classifier: naïve Bayes (NB) Simplifying (“naïve”) assumption: feature independence given class 1. 2. Training: estimate parameters 2. Classification: given (unlabeled) instance and (e. g. , ML-estimates) , choose most likely class: (Bayesian decision rule) 11 Irina Rish, IBM TJWRC

Classification results on Lotus Co. C data NB + Bernoulli, mult. or geom. Accuracy NB + shifted geom. Baseline classifier: Always selects mostfrequent transaction Training set size § Significant improvement over baseline classifier (75%) § NB is simple, efficient, and comparable to the state-of-the-art classifiers: § SVM – 85 -87%, Decision Tree – 90 -92% § Best-fit distribution (shift. geom) - 12 necessarily best classifier! (? ) IBM TJWRC not Irina Rish,

Transaction recognition: segmentation + classification Dynamic programming (Viterbi search) (Recursive) DP equation: Naive Bayes classifier 13 Irina Rish, IBM TJWRC

Accuracy Transaction recognition results Model Classification Segmentation Bernoulli best Second best Multinomial best Third best Geometric best Fourth best Shift. Geom. worst best Training set size § Good EUT recognition accuracy: 64% (harder problem than classification!) § Reversed order of results: best classifier - not necessarily best recognizer! (? ) further research! 14 Irina Rish, IBM TJWRC

EUT recognition: summary n A novel approach: learning EUTs from RPCs n Patent, conference paper (AAAI-2000), prototype system n Successful results on Lotus Notes data (Lotus Co. C): n n n Classification – naive Bayes (up to 87% accuracy) EUT recognition – Viterbi+Bayes (up to 64% accuracy) Work in progress: n Better feature selection (RPC subsequences? ) n Selecting “best classifier” for segmentation task n Learning more sophisticated classifiers (Bayesian networks) n Information-theoretic approach to segmentation (MDL) 15 Irina Rish, IBM TJWRC

Outline n n Introduction Machine-learning applications in Performance Management n Transaction Recognition n In progress: Event Mining; Probing Strategy; etc. n Bayesian learning tools: extending ABLE n Advancing theory n Summary and future directions 16 Irina Rish, IBM TJWRC

Event Mining: analyzing system event sequences What is it? Why is it important? n learning system behavior patterns for better performance management Events from hosts Why is it hard? large complex systems (networks) with many dependencies; n prior models not always available n many events/hosts, data sets: huge and constantly growing n Example: USAA data 858 hosts, 136 event types n 67184 data points: (13 days, by sec) n Event examples: n High-severity events: 'Cisco_Link_Down‘, 'chassis. Minor. Alarm_On‘, etc. n Low-severity events: 'tcp. Connect. Close‘, 'duplicate_ip‘, etc. n Time (sec) 17 Irina Rish, IBM TJWRC

1. Learning event dependency models ? ? ? Event 1 Event 2 Event. M Event N Current approach: § learn dynamic probabilistic graphical models (temporal, or dynamic Bayes nets) § Predict: § time to failure § event co-occurrence § existence of hidden nodes – “root causes” § Recognize sequence of high-level system states: unsupervised version of EUT recognition problem Important issue: incremental learning from data streams 18 Irina Rish, IBM TJWRC

2. Clustering hosts by their history “Problematic” hosts “Silent” hosts § group hosts w/ similar event sequences: what is appropriate similarity (“distance”) metric? One example: § e. g. , distance between “compressed” sequences – event distribution models: 19 Irina Rish, IBM TJWRC

Probing strategy (EPP) n Objectives: find probe frequency F that minimizes 1. E (Tprobe-Tstart) - failure detection, or 2. E( total “failure” time – total “estimated” failure time) gives accurate performance estimate response time n Constraints on additional load induced by probes: L(F) < Max. Load Availability violations Probes time 20 Irina Rish, IBM TJWRC

Outline n n Introduction Machine-learning applications in Performance Management n Bayesian learning tools: extending ABLE n Advancing theory n Summary and future directions 21 Irina Rish, IBM TJWRC

ABLE: Agent Building and Learning Environment 22 Irina Rish, IBM TJWRC

What is ABLE? What is my contribution? n A JAVA toolbox for building reasoning and learning agents n Provides: visual environment, boolean and fuzzy rules, neural networks, genetic search n My contributions: n naïve Bayes classifier (batch and incremental) n Discretization n Future releases: n n General Bayesian learning and inference tools Available at n Alpha. Works: www. alpha. Works. ibm. com/tech n Project page: w 3. rchland. ibm. com/projects/ABLE 23 Irina Rish, IBM TJWRC

How does it work? 24 Irina Rish, IBM TJWRC

Who is using Naïve Bayes tools? Impact on other IBM projects Video Character Recognition: § § § (w/ C. Dorai): Naïve Bayes: 84% accuracy Better than SVM on some pairs characters (aver. SVM = 87%) Current work: combining Naïve Bayes with SVMs of Environmental data analysis: § § (w/ Yuan-Chi Chang) Learning mortality rates using data on air pollutants Naïve Bayes is currently being evaluated Performance management: § § 25 Event mining – in progress EUT recognition – successful results Irina Rish, IBM TJWRC

Outline n Introduction n Machine-learning in Performance Management n Bayesian learning tools: extending ABLE n Advancing theory n n n analysis of naïve Bayes classifier inference in Bayesian Networks Summary and future directions 26 Irina Rish, IBM TJWRC

Why Naïve Bayes does well? And when? Class-conditional feature independence: Unrealistic assumption! But why/when it works? Intuition: wrong probability estimates wrong classification! Bayes-optimal: P(class|f) True Naïve Bayes: NB estimate When independence assumptions do not hurt classification? 27 Irina Rish, IBM TJWRC

Case 1: functional dependencies Lemma 1: Naïve Bayes is optimal when features are functionally dependent given class Proof : 28 Irina Rish, IBM TJWRC

Case 2: “almost-functional” (low-entropy) distributions Lemma 2: Naïve Bayes is a “good approximation” for “almost-functional” dependencies Formally: If P(f i = a i ) ³ 1 - δ, for i = 1, . . . , n, or P( f = a ) ³ 1 - δ then n Related practical examples: n n RPC occurrences in EUTs: often almost-deterministic (and NB does well) Successful “local inference” in almost-deterministic Bayesian networks (Turbo coding, “mini-buckets” – see Dechter&Rish 2000) 29 Irina Rish, IBM TJWRC

Experimental results support theory Random problem generator: uniform P(class); random P(f|class): 1. A randomly selected entry in P(f|class) is assigned 2. The rest of entries – uniform random sampling + normalization 1. Less “noise” (smaller ) 2. Feature dependence does 2. => NB closer to optimal NOT correlate with NB error 30 Irina Rish, IBM TJWRC

Outline n Introduction n Machine-learning in Performance Management n Transaction Recognition n Event Mining n Bayesian learning tools: extending ABLE n Advancing theory n n n analysis of naïve Bayes classifier inference in Bayesian Networks Summary and future directions 31 Irina Rish, IBM TJWRC

From Naïve Bayes to Bayesian Networks Naïve Bayes model: independent features given class Bayesian network (BN) model: Any joint probability distributions P(S) Smoking P(C|S) lung Cancer P(X|C, S) X-ray P(S, C, B, X, D)= = P(S) P(C|S) P(B|S) P(X|C, S) P(D|C, B) P(B|S) Bronchitis P(D|C, B) Dyspnoea CPD: C 0 0 1 1 B D=0 D=1 0 0. 1 0. 9 1 0. 7 0. 3 0 0. 8 0. 2 1 0. 9 0. 1 Query: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ? 32 Irina Rish, IBM TJWRC

Example: Printer Troubleshooting (Microsoft Windows 95) Application Output OK Print Spooling On Spool Process OK Local Disk Space Adequate Spooled Data OK GDI Data Input OK Uncorrupted Driver Correct Driver Network Up Correct Printer Path Net Cable Connected GDI Data Output OK Correct Printer Selected Print Data OK Correct Driver Settings Net/Local Printing Net Path OK Printer On and Online PC to Printer Transport OK Printer Data OK Print Output OK 33 Local Path OK Correct Local Port Local Cable Connected Paper Loaded Printer Memory Adequate [Heckerman, 95] Irina Rish, IBM TJWRC

How to use Bayesian networks? cause Diagnosis: P(cause|symptom)=? Prediction: P(symptom|cause)=? symptom Classification: P(class|data)=? MEU Decision-making (given utility function) NP-complete inference problems Applications: Approximate algorithms 34 Medicine Stock market Bio-informatics e. Commerce Performance management § etc. § § § Irina Rish, IBM TJWRC

Local approximation scheme “Mini-buckets” (paper submitted to JACM) Idea: reduce complexity of inference by ignoring some dependencies Successfully used for approximating Most Probable Explanation: Very efficient on real-life (medical, decoding) and synthetic problems Approximation accuracy Less “noise” => higher accuracy similarly to naïve Bayes! General theory needed: Independence assumptions and “almost-deterministic” distributions noise Potential impact: efficient inference in complex performance management models (e. g. , event mining, system dependence models) 35 Irina Rish, IBM TJWRC

Summary Performance management: End-user transaction recognition: (Lotus Co. C) n novel method, patent, paper; applied to Lotus Notes n In progress: event mining (USAA), probing strategies (EPP) n Machine-learning tools: n n (alpha. Works) Extending ABLE w/ Bayesian classifier Applying classifier to other IBM projects: n Video character recognition n Environmental data analysis Theory and algorithms: n n n analysis of Naïve Bayes accuracy (Research Report) approximate Bayesian inference (submitted paper) patent on meta-learning 36 Irina Rish, IBM TJWRC

Future directions Research interest Automated learning and inference Practical Problems Theory Generic tools Performance management: Transaction recognition – better feature selection, segmentation n Event Mining – Bayes net models, clustering n Web log analysis – segmentation/ classification/ clustering n Modeling system dependencies – Bayes nets n “Technology transfer” – generic approach to “event streams” (EUTs, sys. events, web page accesses) n ML library / ABLE: n n n Bayesian learning n general Bayes nets n temporal BNs n incremental learning Bayesian inference n Exact inference n Approximations Other tools: n SVMs, decision trees n Combined tools, meta-learning tools 37 Analysis of algorithms: Naïve Bayes accuracy: other distribution types n Accuracy of local inference approximations n Comparing model selection criteria (e. g. , Bayes net learning) n Relative analysis and combination of classifiers (Bayes/max. margin/DT) n n Incremental learning Irina Rish, IBM TJWRC

Collaborations n Transaction recognition n n Event Mining n n R. Dechter (UCI) Meta-learning n n B. Dom (Almaden) Approximate inference in Bayes nets n n C. Dorai (Watson) MDL approach to segmentation n n J. Bigus, R. Vilalta (Watson) Video Character Recognition n n J. Hellerstein, R. Vilalta, S. Ma, C. Perng (Watson) ABLE n n J. Hellerstein, T. Jayram (Watson) R. Vilalta (Watson) Environmental data analysis n Y. Chang (Watson) 38 Irina Rish, IBM TJWRC

Machine learning discussion group n Weekly seminars: n n Active group members: n n Mark Brodie, Vittorio Castelli, Joe Hellerstein, Daniel Oblinger, Jayram Thathachar, Irina Rish (more people joint recently) Agenda: n n 11: 30 -2: 30 (w/ lunch) in 1 S-F 40 discussions of recent ML papers, book chapters (“Pattern Classification” by Duda, Hart, and Stork, 2000) brain-storming sessions about particular ML topics Recent discussions: accuracy of Bayesian classifiers (naïve Bayes) Web site: http: //reswat 4. research. ibm. com/projects/mlreadinggroup. nsf/main/t oppage 39 Irina Rish, IBM TJWRC