Scalable Benchmarks and Kernels for Data Mining and

Scalable Benchmarks and Kernels for Data Mining and Analytics Vipin Kumar University of Minnesota kumar@cs. umn. edu www. cs. umn. edu/~kumar § Joint work with Alok Choudhary and Gokhan Memik (Northwestern) and Michael Steinbach (University of Minnesota) § Research funded by NSF

Need for High Performance Data Mining § Today’s digital society has seen enormous data growth in both commercial and scientific databases Biomedical Data § Data Mining is becoming a commonly used tool to extract information from large and complex datasets § Advances in computing capabilities and technological innovation needed to harvest the available wealth of data Homeland Security Internet Geo-spatial data Sensor Networks Computational Simulations

Data Mining for Climate Data NASA ESE questions: l How is the global Earth system changing? l What are the primary forcings? l How does Earth system respond to natural & human-induced changes? l What are the consequences of changes in the Earth system? l How well can we predict future changes? l Global snapshots of values for a number of variables on land surfaces or water NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years…. http: //www. nasa. gov/centers/ames/news/releases/2003/03_51 AR. html Detection of Ecosystem Disturbances: This interactive module displays the locations on the earth surface where significant disturbance events have been detected. High Resolution EOS Data: • EOS satellites provide high resolution measurements • Finer spatial grids • 1 km grid produces 694, 315, 008 data points • Going from 0. 5º degree data to 1 km data results in a 2500 fold increase in the data size • More frequent measurements • Multiple instruments • High resolution data allows us to answer more detailed questions: • Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties • Finding relationships between leaf area index (LAI) and topography of a river drainage basin • Finding relationships between fire frequency and elevation as well as topographic position Disturbance Viewer • Leads to substantially high computational and memory requirements

Data Mining for Cyber Security • Due to proliferation of Internet, more and more organizations are becoming vulnerable to sophisticated cyber attacks Traditional Intrusion Detection Systems (IDS) have well-known limitations • – – – • Too many false alarms Unable to detect sophisticated and novel attacks Unable to detect insider abuse/ policy abuse Data Mining is well suited to address these challenges MINDS – Minnesota Intrusion Detection System Large Scale Data Analysis is needed for • Correlation of suspicious events across network sites – • Incorporated into Interrogator architecture at ARL Center for Intrusion Monitoring and Protection (CIMP) • • Helps analyze data from multiple sensors at Do. D sites around the country Routinely detects Insider Abuse / Policy Violations / Worms / Scans • Helps detect sophisticated attacks not identifiable by single site analyses Analysis of long term data (months/years) – Uncover suspicious stealth activities (e. g. insiders leaking/modifying information)

Data Mining for Biomedical Informatics § Recent technological advances are helping to generate large amounts of both medical and genomic data • High-throughput experiments/techniques - Gene and protein sequences - Gene-expression data - Biological networks and phylogenetic profiles • Electronic Medical Records - IBM-Mayo clinic partnership has created a DB of 5 million patients - NIH Roadmap § Data mining offers potential solution for analysis of large-scale data • • • Automated analysis of patients history for customized treatment Design of drugs/chemicals Prediction of the functions of anonymous genes Protein Interaction Network

Role of Benchmarks in Architecture Design § Benchmarks guide the development of new processor architectures in addition to measuring the relative performance of different systems • SPEC: General purpose architecture (“Advances in the microprocessor industry would not have been possible without the SPEC benchmarks” - David Patterson) • TPC: Database Systems • SPLASH: Parallel machine architectures • Mediabench: Media and Communication Processors • Net. Bench: Network/Embedded processors

Do We Need Benchmarks Specific to Data Mining? § Performance metrics of several benchmarks gathered from Vtune • Cache miss ratios, Bus usage, Page faults etc. 11 10 9 8 7 6 5 4 3 2 1 0 SPEC INT SPEC FP Media. Bench TPC-H Mine. Bench gcc bzip 2 gzip mcf twolf vortex vpr parser apsi art equake lucas mesa mgrid swim wupwise rawcaudio epic encode cjpeg mpeg 2 pegwit gs toast Q 17 Q 3 Q 4 Q 6 apriori bayesian birch eclat hop scalparc k. Means fuzzy rsearch semphy snp genenet svm-rfe Cluster Number § Benchmark applications were grouped using Kohenen clustering to spot trends: Reference: [Pisharath J. , Zambreno J. , Ozisikyilmaz B. , Choudhary A. , 2006]

Recently funded NSF project: Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries PIs: A. Choudhary and Gokhan Memik (NW) , V. Kumar and M. Steinbach (UM) Goal: Establish a comprehensive benchmarking suite for data mining applications. § Motivate the development of new processor architectures and system design for data mining § Motivate the implementation of more sophisticated data mining algorithms that can work with the constraints imposed by current architecture designs § Improvement the productivity of scientists and engineers using data mining application in a wide variety of domains

Data Mining Tasks … Clu ste Data rin g n eli d g Mo e iv ict d e Pr t ion ia oc s As s le Ru Milk An De oma tec ly tio n

Key Data Mining Algorithms § Clustering • • • K-means, EM, SOM Single link / Group Average hierarchical clustering DBSCAN, SNN § Classification • • • Bayes SVM Decision trees, Rule based systems § Association Rule Mining • Apriori, FP-Growth § Anomaly Detection • • • Statistical methods Distance-based Clustering-based § Preprocessing • SVD, PCA

Major Data Mining Kernels 1. Counting 1. Given a set of data records, count types of different categories to build a contingency table 2. Count the occurrence of a set of items in a set of transactions 2. Pairwise computations 1. Given a set of data records, perform pairwise distane/similarity computations § Linear Algebra operations • SVD, PCA

General Characteristics of Data Mining Algorithms § Dense/Sparse data § Hash table / Hash tree § Linked Lists § Iterative nature § Data often too large to fit in main memory • Spatial locality is critical

Constructing a Decision Tree Employed Yes Worthy: 4 Not Worthy: 3 No Education Worthy: 0 Not Worthy: 3 Graduate Worthy: 2 Not Worthy Key Computation Employed = Yes 4 3 Employed = No 0 3 High School/ Undergrad Worthy: 2 Not Worthy: 4

Constructing a Decision Tree Employed = Yes Employed = No

Constructing a Decision Tree in Parallel m categorical attributes n records Partitioning of data only – global reduction per node is required – large number of classification tree nodes gives high communication cost

Constructing a Decision Tree in Parallel Partitioning of classification tree nodes 10, 000 training records 7, 000 records 2, 000 3, 000 records 5, 000 2, 000 1, 000 – natural concurrency – load imbalance – the amount of work associated with each node varies – limited concurrency on the upper portion of the tree – child nodes use the same data as used by parent node – loss of locality – high data movement cost

Speedup Comparison of the Three Parallel Algorithms § Data set used in SLIQ paper (Ref: Mehta, Agrawal and Rissanen, 1996) § IBM SP 2 with 128 processors 0. 8 million examples hybrid Data partitioning Tree partitioning 1. 6 million examples hybrid Data partitioning Tree partitioning § Dynamic load balancing inspired by parallel sparse Cholesky factorization and parallel tree search

Speedup of the Hybrid Algorithm with Different Size Data Sets

Hash Table Access • Some efficient decision tree algorithms require random access to large data structures. • Example: SPRINT (Ref: Shafer, Agrawal, Mehta, 1996) Hash Table Processor P 0 Left Processor P 1 Processor P 2 Storing the entire has table on one processor makes the algorithm unscalable Right

Scal. Par. C (Ref: Joshi, Karypis, Kumar, 1998) § Scal. Par. C is a scalable parallel decision tree construction algorithm • Scales to large number of processors • Scales to large training sets § Scal. Par. C is memory efficient • The hash-table is distributed among the processors § Scal. Par. C performs minimum amount of communication

This Scal. Par. C Design is Inspired by. . § Communication Structure of Parallel Sparse Matrix-Vector Algorithms Processor P 0 Processor P 1 Processor P 2 Hash Table Entries

Parallel Runtime (Ref: Joshi, Karypis, Kumar, 1998) 128 Processor Cray T 3 D

Computing Association Patterns 1. Market-basket transactions 3. Generate association rules 2. Find item combinations (itemsets) that occur frequently in data

Counting Candidates § Frequent Itemsets are found by counting candidates § Simple way: • Search for each candidate in each transaction Transactions Candidates Count N A B C D A C E B C D A B D E B C E B D M A B A C A D A E B C B D A B E B C D A B D E A B C D E 0 1 2 2 1 0 0 3 1 0 4 1 0 2 0 0 1 0 0 0 Naïve approach requires O(NM) comparisons Reduce the number of comparisons (NM) by using hash tables to store the candidate itemsets

Parallel Association Rules: Scaleup Results (100 K, 0. 25%) (Ref: Han, Karypis, and Kumar, 2000) DD (Agrawal & Shafer, 1996) Efficient implementation of collective communication IDD (Han, Karypis, Kumar, 2000) Dynamic restructuring of computation HD (Han, Karypis, Kumar, 2000)

Candidates for Mine. Bench

Analysis of Benchmark Algorithms § Explore the bottlenecks associated with the current general purpose sequential and parallel machines § Explore how different architectural features impact the performance of data mining algorithms

Preliminary Evaluation of Some Sample Data Sets § Example small (S), medium (M), and large (L) data set § Execution time for some algorithms in the Mine. Bench suite. Reference: [Liu Y. , Pisharath J. , Liao W. , Memik G. , Choudhary A. , Dubey P. , 2004]

Designing Efficient Kernels for Data Mining § Understanding of the bottlenecks in executing DM algorithms on current architectures will help design new, more efficient algorithms § Focus will be on design frequently used kernels that dominates the execution time of most DM algorithms § Both sequential and parallel versions will be developed Frequency of Kernel Operations in Representative Applications Reference: [Pisharath J. , Zambreno J. , Ozisikyilmaz B. , Choudhary A. , 2006]

Conclusions § Data mining applications are becoming increasingly important § Current systems design approach not adequate for DM applications § Mine. Bench – a new benchmark suite which encompasses many algorithms found in data mining § Initial findings: • Data mining applications are unique in terms of performance characteristics • There exists much room for optimization with regards to data mining workloads

Bibliography § § § • • • Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley April 2005 Introduction to Parallel Computing, (Second Edition) by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison-Wesley, 2003 Data Mining for Scientific and Engineering Applications, edited by R. Grossman, C. Kamath, W. P. Kegelmeyer, V. Kumar, and R. Namburu, Kluwer Academic Publishers, 2001 J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon, "Emerging Scientific Applications in Data Mining", Communications of the ACM Volume 45, Number 8, pp 54 -58, August 2002 C. Potter, P. Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, V. Genovese, Major Disturbance Events in Terrestrial Ecosystems Detected using global Satellite Data Sets, Global Change Biology 9 (7), 1005 -1021, 2003 Vipin Kumar, “Parallel and Distributed Computing for Cyber Security". An article based on the keynote talk by the author at 17 th International Conference on Parallel and Distributed Computing Systems (PDCS-2004). DS Online Journal, OLUME 6, NUMBER 10, October 2005 Ying Liu, Jayaprakash Pisharath, Wei-keng Liao, Gokhan Memik, Alok Choudhary, and Pradeep Dubey. Performance Evaluation and Characterization of Scalable Data Mining Algorithms. In Proceedings of the 16 th International Conference on Parallel and Distributed Computing and Systems (PDCS), November 2004. Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash Pisharath, Gokhan Memik, and Alok Choudhary. Performance Characterization of Data Mining Applications using Mine. Bench. In Proceedings of the 9 th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-9), February 2006. Jayaprakash Pisharath, Joseph Zambreno, Berkin Ozisikyilmaz, and Alok Choudhary. Accelerating Data Mining Workloads: Current Approaches and Future Challenges in System Architecture Design. In Proceedings of the 9 th International Workshop on High Performance and Distributed Mining (HPDM), April 2006