Скачать презентацию University of Arkansas Data Mining with Teradata TM Скачать презентацию University of Arkansas Data Mining with Teradata TM

cad91cc1b8bbef2a75e0070ec0aedffa.ppt

  • Количество слайдов: 38

University of Arkansas Data Mining with Teradata. TM Warehouse Miner Jim Kashner CTO Data University of Arkansas Data Mining with Teradata. TM Warehouse Miner Jim Kashner CTO Data Mining

The Empirical Method and Decision Support … • all of the information in this The Empirical Method and Decision Support … • all of the information in this presentation are “jim’s opinions numbers 8 through 224” (for today) … Refined Hypothesis Proposition • a framework for making decisions in the presence of uncertainty • seeks to shed light on the validity or plausibility of notions, suppositions, propositions, hypotheses Supposition Notion Interpretation Data • is iterative and circular – don’t ever finish – just stop at some point 11/30/2004 Copyright 2004 Analysis Teradata, a division of NCR 2

Teradata Warehouse Miner Technology Enablers for the Data Mining Process • the various releases Teradata Warehouse Miner Technology Enablers for the Data Mining Process • the various releases of Teradata Warehouse Miner are intended to serve as very powerful technology enablers for the Data Mining Process • but, Tools Don’t Build Models, Thoughtful People Do – When a good tool between the ears drives the data mining process, good models are built – When too much is asked of analytical software, the risk of spurious and invalid models rises proportionately • but thoughtful people who build models can also be helped by having a proven and generic process to follow – The formal Teradata Data Mining Method is one of several good processes used to conduct successful data mining projects • its foundation is the “tried and true” empirical method • its not a prescription, just a set of carefully constructed suggestions 11/30/2004 Copyright 2004 Teradata, a division of NCR 3

Teradata Data Mining Method Project Management Business Issues Architecture and Technology Preparation Data Preparation Teradata Data Mining Method Project Management Business Issues Architecture and Technology Preparation Data Preparation Analytical Modeling Knowledge Delivery and Deployment Knowledge Transfer • data mining is a very iterative process – the linear process depicted above serves as a guide, and identifies the chunky bits of the process 11/30/2004 Copyright 2004 Teradata, a division of NCR 4

Data Mining with Teradata Warehouse Miner Teradata’s Data Mining Method – Our Process Architecture Data Mining with Teradata Warehouse Miner Teradata’s Data Mining Method – Our Process Architecture and Technology Preparation TWM – Stats & ADS “Data Profiling” Business Question Identification and Qualification Data Preparation and Pre. Processing • Data Exploration • Data Transformation TWM – Analytics Analytic Modeling H ig hl • Multivariate Statistics • Machine Learning Algorithms y It e ra ti ve Model Construction and Evaluation TWM – Deployment Pr Model Deployment oc es s • Scoring & Evaluation • Lifecycle Maintenance Model Deployment and Maintenance Project Management -and- Knowledge Transfer 11/30/2004 Copyright 2004 Teradata, a division of NCR 5

Data Mining and the Empirical Method • data mining is not automated discovery of Data Mining and the Empirical Method • data mining is not automated discovery of hidden patterns in your data • data mining is thoughtful and technology enabled discovery of hidden patterns in your data • welcome to the empirical method 11/30/2004 Copyright 2004 Teradata, a division of NCR 6

Teradata as an Analytic Engine • Teradata is especially well-suited to perform complex aggregations Teradata as an Analytic Engine • Teradata is especially well-suited to perform complex aggregations and evaluations of sets according to conditional logic – native Teradata functions – expressed as SQL – where indexes cannot reasonably be expected to exist for any particular aggregation, set evaluation, or conditional logic • analytical modeling algorithms require an engine that can perform complex aggregations and evaluations of sets according to conditional logic • the very good fit of Teradata as an analytic engine is rather obvious after considering what analytical modeling algorithms actually do under the hood 11/30/2004 Copyright 2004 Teradata, a division of NCR 7

Said another way. . . Given: The following notation is used in virtually all Said another way. . . Given: The following notation is used in virtually all statistical, artificial intelligence, and machine learning algorithms that denote equations used to represent and calculate data mining models: f (x) - which means sum f (x) - and - Σ f (x) - which means sum - which means multiply Є and Є - which mean is, and is not an element of (set theory) Question: What do they all have in common? Answer: All of these are what Teradata does better than any other engine on this planet. Note: f(x) are other supported functions, mathematical and other, either as native Teradata functions, or those that can be expressed in SQL with Teradata extensions very efficiently. 11/30/2004 Copyright 2004 Teradata, a division of NCR 8

Teradata Warehouse Miner is an ongoing experiment • Tera. Miner. TM Stats – June, Teradata Warehouse Miner is an ongoing experiment • Tera. Miner. TM Stats – June, 1999 • Teradata Warehouse Miner – Stats, Analytics, & Deployment – July, 2001 • Teradata Warehouse Miner – Stats, Analytics, Deployment, & ADS (Analytical Data Set generation) – June, 2004 • additional functionality continually in subsequent releases – to each of these components of Teradata Warehouse Miner • because of our success with this “experimental approach”, we continue to ask: “Why not? ” – Teradata continues to amaze us by what it can do – our Teradata Warehouse Miner Software Engineering Team is quite amazing too 11/30/2004 Copyright 2004 Teradata, a division of NCR 9

What is Teradata Warehouse Miner ? • TWM includes a set of. NET Interfaces What is Teradata Warehouse Miner ? • TWM includes a set of. NET Interfaces and a User Interface – generates and executes Teradata-specific SQL – – instantiated by User Interface easily integrated into other applications (partners, custom) all analysis parameters, model definition, and analysis results stored in metadata select results or explain, or persist results in table, temporary table or view • ANSI SQL when possible • TWM includes several types of. NET Interfaces – – – – Registry independent application extensions or plug-ins Teradata Warehouse Miner Descriptive Statistics DLL Teradata Warehouse Miner ADS DLL Teradata Warehouse Miner Data Reorganization DLL Teradata Warehouse Miner Analytic Algorithm & Scoring DLLs (4) Teradata Warehouse Miner Matrix DLL Teradata Warehouse Miner Statistical Test DLL • TWM includes a GUI for the desktop – – User interface to. NET Objects Queries Teradata Dictionary to aid in parameterizing functions • • – directly using HELP syntax optionally, MDS DIM (Metadata Services Database Information Model) Interactive display of results – SQL, Data, Graphs, Reports 11/30/2004 Copyright 2004 Teradata, a division of NCR 10

Teradata Warehouse Miner High Level Architecture Client Platform: Windows NT, 2000, XP, . NET Teradata Warehouse Miner High Level Architecture Client Platform: Windows NT, 2000, XP, . NET 2003 Server User Interface Services User Interface Visualizations Business Services Manager Projects Metadata Access Analyses Teradata Warehouse Miner • Windows Interface – build, maintain, and execute projects – explore and manipulate results • tabular and graphical – parameterize. NET APIs • . NET APIs & ADO –. NET Interfaces (APIs) • documented for developers – Active. X Data Objects • DLL interface ”plug-ins” Algorithms (COM) Data Access Algorithms (. NET) Teradata ODBC Teradata Metadata Services Teradata Platform: Data Services Teradata RDBMS Version 2 Release 4. 1 or later 11/30/2004 Copyright 2004 Teradata, a division of NCR – write all API parameters and all XML results in TWM metadata • stored in binary data type – generate & submit SQL – receive query results from Teradata and present them in user interface – read model definition and results stored in TWM metadata to display XML reports and graphs – read model definition in TWM metadata to score and evaluate 11

Teradata Warehouse Miner Data Description Functions Univariate Statistics Count Minimum, Maximum Modes Mean Standard Teradata Warehouse Miner Data Description Functions Univariate Statistics Count Minimum, Maximum Modes Mean Standard Deviation Standard Error Variance Coefficient of Variation Skewness Kurtosis Uncorrected Sum of Squares Corrected Sum of Squares Quantiles and Ranks Top 10/Bottom 10 Percentiles Deciles Quartiles Tertiles Top 5/Bottom 5 Ranked Values with Counts Scatter Plot Analysis 2 -D and 3 -D Plots of Continuous Variables Correlation Analysis Quickly view pair-wise correlations among ‘n’ variables 11/30/2004 Values Analysis (basic data quality analysis) 3 Data Types 3 Counts 3 # NULL Values 3 # Positive Values 3 # Negatives Values 3 # Zeros 3 # Blanks 3 # Unique Values Frequency Analyses Frequency of Discrete Variables N-Way Cross. Tabulation Pair-wise Cross-Tabs Histogram Analyses Histograms of Continuous Variables Options for Even Width User Defined Widths/Boundaries Quantile Adaptive Binning Overlay columns Statistics within bins Overlap Analysis Index/Key Column Consistency Copyright 2004 Teradata, a division of NCR Data Explorer Performs basic statistical analysis on a set of tables and selected columns within any Teradatabase Intelligent decisions about which functions to perform Most criteria for “Intelligent” decisions can be modified by user Values Analysis - Every column in the set of input tables Univariate Statistical Analysis - Every column of numeric or date type Frequency Analysis Every column that has less than or equal to a number of unique values Histogram Analysis Every numeric or date type column that has more than a number of unique values Data Visualizations 2 D & 3 D Histograms 2 D & 3 D Frequency Bar Charts Values Bar Charts & Circular Graphs Box and Whisker Plots Scatter Plots Integrated Data Explorer Graphics 12

Teradata Warehouse Miner Data Derivation and Transformation Functions Variable Creation Aggregations Count, Average, Sum, Teradata Warehouse Miner Data Derivation and Transformation Functions Variable Creation Aggregations Count, Average, Sum, etc. Windowed Aggregates/OLAP Rank, Quantiles, Moving Sums, etc. Arithmetic operators/functions Variable Creation (cont) Calendar functions: day_of_week, day_of_calendar, quarter_of_year, etc. String functions LOWER, UPPER, TRIM, ||, etc. : +, -, *, /, MOD, ** Data Type conversion ABS, EXP, LN, LOG, SQRT, etc. SQL predicates Trigonometric & Hyperbolic functions COS, SIN, TAN, ACOS, etc. COSH, SINH, TANH, ACOSH, etc. CASE expressions and NULL operators valued and searched types NULLIF, COALESCE Comparison operators =, >, <, <>, <=, >= Logical predicates TRUE, FALSE, NULL Variable Dimensioning Simple Dimensions Specific values Range of values Combined Dimensions Variable Transformation Bin Coding Design Coding Recoding Rescaling Derive Hook to Variable Creation Statistical Transformations Z-Score Sigmoid NULL Value Replacement Literal value Mean value Median value Mode Imputed values Hierarchical Dimensions Sys. Calendar, etc. BETWEEN…AND…, IN (expression list), etc. 11/30/2004 Copyright 2004 Teradata, a division of NCR 13

Teradata Warehouse Miner Data Reorganization, Build ADS, Matrix Functions Data Reorganization Random Sample and Teradata Warehouse Miner Data Reorganization, Build ADS, Matrix Functions Data Reorganization Random Sample and Stratified Random Partitioning Build ADS Matrix Functions Create Final ADS Correlation Create Metadata for Refresh Covariance Denormalize/Pivoting SSCP Corrected SSCP Joining Inner Left Outer Right Outer Full Outer 11/30/2004 Copyright 2004 Teradata, a division of NCR 14

Teradata Warehouse Miner Analytical Techniques, Scoring, Visualizations (1) Analytic Algorithms (Multivariate Statistical Techniques) Model Teradata Warehouse Miner Analytical Techniques, Scoring, Visualizations (1) Analytic Algorithms (Multivariate Statistical Techniques) Model Scoring Linear Regression Logistic Regression Factor Analysis SQL-based model scoring model statistics variable coefficients, standard errors, confidence intervals, etc. incremental R 2 step-wise variable selection options forward & forward only backward & backward only all scoring SQL is provided Supporting Visualizations Scatter Plot Lift Chart Regression Plots Factor Pattern Scree Plot Factor Analysis Principal Component Analysis Principal Axis Factors Maximum Likelihood Factors Orthogonal & Oblique Rotations Logistic Regression Logit Model Coefficients, Odds Ratios and Statistics Model Success Analysis and Lift Tables step-wise variable selection options Multivariate Diagnostics forward & forward only backward & backward only 11/30/2004 Copyright 2004 Teradata, a division of NCR Extensive Collinearity Diagnostics Automated Identification of Constants Row level diagnostics, and much more… SQL-based model evaluation 15

Teradata Warehouse Miner Analytical Techniques, Scoring, Visualizations (2) Analytic Algorithms (AI and Machine Learning Teradata Warehouse Miner Analytical Techniques, Scoring, Visualizations (2) Analytic Algorithms (AI and Machine Learning Techniques) Decision Tree/Rule Induction gini / regression (i. e. , CART) Entropy (i. e. , C 4. 5 / C 5. 0) CHAID pruning gini algorithm pruning gain ratio algorithm pruning manual pruning Model Scoring Decision Trees Clustering Affinity and Sequence Analyses SQL-based model scoring all scoring SQL is provided Supporting Visualizations Graphical Tree Browser Clustering Interactive Pruning Text Rules Distributions K-Means Nearest Neighbor Linkage Expectation Maximization Lift Charts Cluster Sizes / Distance / Measures Association Color Map Gaussian Mixture Model Poisson Mixture Model variable importance report Affinity and Sequence Analyses Model Evaluation Feature Rich Implementations truth table (confusion matrix) model statistics & indices SQL-based model evaluation Support Confidence Lift z-Score 11/30/2004 Copyright 2004 Teradata, a division of NCR 16

Teradata Warehouse Miner Statistical Tests Binomial Tests Normality/Equality Tests Binomial Kolmogorov-Smirnov Sign Lilliefors Test Teradata Warehouse Miner Statistical Tests Binomial Tests Normality/Equality Tests Binomial Kolmogorov-Smirnov Sign Lilliefors Test Shapiro-Wilk Rank Tests D’Agostino & Pearson Omnibus Mann-Whitney (Kruskal-Wallis) Smirnov Wilcoxon Friedman Contingency Table Tests Chi-square Median Parametric Tests F (Two Way) Unequal Sample Size F (N-Way) Equal Sample Size T 11/30/2004 Copyright 2004 Teradata, a division of NCR 17

Why Did We Build Teradata Warehouse Miner? Integrated Data Mining Environment Modelers Build Models Why Did We Build Teradata Warehouse Miner? Integrated Data Mining Environment Modelers Build Models Business Deploys Models Other Technologies Business Deploys Models Teradata and TWM Inefficient Environment - Elapsed and Execution Times Continual Data Movement Data Redundancy Metadata Inconsistencies “Many Versions of The Truth” 11/30/2004 Modelers Build Models Copyright 2004 Efficiently Architected Environment - MPP Performance and Scalability No Data Movement No Data Redundancy Shared Metadata “One Version of The Truth” Teradata, a division of NCR 18

Why are Integrated Analytics Important? Efficiency, Performance & Scalability • Mine data in an Why are Integrated Analytics Important? Efficiency, Performance & Scalability • Mine data in an integrated environment u Huge data volumes – leverages the parallelism of Teradata u Minimize data redundancy u Eliminate proprietary data structures Source Data Analytic Data Set u Simplify data & system management Analytic Metadata u Better results using larger amounts of detailed data u Eliminate potential errors during data movement & external sampling u Integrated model building and scoring u Reduced overall modeling time Modelers Build Models 11/30/2004 Business Deploys Models Copyright 2004 u Many resulting elapsed and execution time improvements have been astronomical ! Teradata, a division of NCR 19

The Teradata Warehouse Miner Goal Enable Entire Data Mining Process In Teradata Data Warehouse The Teradata Warehouse Miner Goal Enable Entire Data Mining Process In Teradata Data Warehouse Data Pre. Processing Source Data Analytic Metadata Scored Data Set Model Deployment Analytic Data Set Analytical Modeling • • data starts and ends in the database open to accommodate 3 rd party partner tools 11/30/2004 Copyright 2004 Teradata, a division of NCR 20

Teradata Warehouse Miner Projects and Analytic Modules • Teradata Warehouse Miner Projects contain one Teradata Warehouse Miner Projects and Analytic Modules • Teradata Warehouse Miner Projects contain one or more tasks • each task is called an Analytic Module – eight categories of analytic modules • ADS (Analytical Data Set generation) – Variable Creation – Variable Transformation – Build ADS • • Analytics (Analytic Algorithms) Descriptive Statistics Matrix Functions (correlation, …) Miscellaneous – free form SQL , … • Reorganization (Structure of Data) • Scoring (and Model Evaluation) • Statistical Tests • Analytic Modules are the fundamental building blocks used to conduct data analysis in Teradata Warehouse Miner 11/30/2004 Copyright 2004 Teradata, a division of NCR 21

Teradata Warehouse Miner Elements in the Primary Window Analytic Module Icon Connection Properties Icon Teradata Warehouse Miner Elements in the Primary Window Analytic Module Icon Connection Properties Icon Data Source Status Main Menus Main Toolbar Project Icon Open, Save, and Save All Icons ODBC Connection Icon Project Area Run and Stop Icons Analysis Set-up and Results Viewing Area hmmm… I wonder what else might fill this large gray area some day. . . Runtime Message Area 11/30/2004 Copyright 2004 Teradata, a division of NCR 22

Teradata Warehouse Miner The 7 Steps to Results • there are 7 basic steps Teradata Warehouse Miner The 7 Steps to Results • there are 7 basic steps in the use of Teradata Warehouse Miner* – – connect to an ODBC data source with appropriate permissions create a new, (or open an existing) Project add at least one Analytic Module to the Project set input and analytic options • select table(s) and column(s) to be analyzed • set Analytic Module parameters** • set other Analytic Module options as necessary** – set output and results options – execute the Analytic Module (using the run icon ) • optionally, save the Project(s) and Analyses – examine, interpret, and use results of interest** • that’s it * use these steps after you or a system administrator has set up an ODBC Data Source (DSN) on your PC. The DSN must point to source, result, and metadata Teradatabases for which you have appropriate permissions ** setting Analytic Model options, and interpreting and using results appropriately requires expertise specific to the Analytic Module chosen 11/30/2004 Copyright 2004 Teradata, a division of NCR 23

Using Teradata Warehouse Miner The 7 Steps to Results An Example 11/30/2004 Copyright 2004 Using Teradata Warehouse Miner The 7 Steps to Results An Example 11/30/2004 Copyright 2004 Teradata, a division of NCR 24

Teradata Warehouse Miner Step 1 - connect to an ODBC data source 11/30/2004 Copyright Teradata Warehouse Miner Step 1 - connect to an ODBC data source 11/30/2004 Copyright 2004 Teradata, a division of NCR 25

Teradata Warehouse Miner Step 2 - create a new Project 11/30/2004 Copyright 2004 Teradata, Teradata Warehouse Miner Step 2 - create a new Project 11/30/2004 Copyright 2004 Teradata, a division of NCR 26

Teradata Warehouse Miner Step 3 - add an Analytic Module to the Project 11/30/2004 Teradata Warehouse Miner Step 3 - add an Analytic Module to the Project 11/30/2004 Copyright 2004 Teradata, a division of NCR 27

Teradata Warehouse Miner Step 4 – set input and analytic options (select table and Teradata Warehouse Miner Step 4 – set input and analytic options (select table and columns to be analyzed) 11/30/2004 Copyright 2004 Teradata, a division of NCR 28

Teradata Warehouse Miner Step 4 – set input and analytic options (set Analytic Module Teradata Warehouse Miner Step 4 – set input and analytic options (set Analytic Module parameters) 11/30/2004 Copyright 2004 Teradata, a division of NCR 29

Teradata Warehouse Miner Step 4 – set input and analytic options (set other Analytic Teradata Warehouse Miner Step 4 – set input and analytic options (set other Analytic Module options as necessary) 11/30/2004 Copyright 2004 Teradata, a division of NCR 30

Teradata Warehouse Miner Step 5 – set output and results options **Note: This screen-shot Teradata Warehouse Miner Step 5 – set output and results options **Note: This screen-shot is from a Scoring Module for the analytic algorithm module used in this example 11/30/2004 Copyright 2004 Teradata, a division of NCR 31

Teradata Warehouse Miner Step 6 - execute the Analytic Module 11/30/2004 Copyright 2004 Teradata, Teradata Warehouse Miner Step 6 - execute the Analytic Module 11/30/2004 Copyright 2004 Teradata, a division of NCR 32

Teradata Warehouse Miner Step 6 - execute the Analytic Module (optionally, save the Project(s) Teradata Warehouse Miner Step 6 - execute the Analytic Module (optionally, save the Project(s) and Analyses) 11/30/2004 Copyright 2004 Teradata, a division of NCR 33

Teradata Warehouse Miner Step 7 - examine, interpret, and use results (1) 11/30/2004 Copyright Teradata Warehouse Miner Step 7 - examine, interpret, and use results (1) 11/30/2004 Copyright 2004 Teradata, a division of NCR 34

Teradata Warehouse Miner Step 7 - examine, interpret, and use results (2) 11/30/2004 Copyright Teradata Warehouse Miner Step 7 - examine, interpret, and use results (2) 11/30/2004 Copyright 2004 Teradata, a division of NCR 35

Tips for Navigating the Teradata Warehouse Miner Interface • on-line help and user’s guide Tips for Navigating the Teradata Warehouse Miner Interface • on-line help and user’s guide – – – very extensive and thorough tutorials for each function describes many of the analytical techniques in detail many reference formulae are provided use these liberally • menus and toolbar • runtime message area • setting program options and preferences – global – run-time • setting up Project Directories for files on PC client – optionally, for local HTML reports and associated graphics 11/30/2004 Copyright 2004 Teradata, a division of NCR 36

Teradata Warehouse Miner Demo TWM, an enabling technology to assist in addressing qualified business Teradata Warehouse Miner Demo TWM, an enabling technology to assist in addressing qualified business questions that are well suited to the processes of decision support and data mining (data exploration – data transformation – exploratory modeling – model building and validation – scoring and evaluation – lifecycle maintenance – …) 11/30/2004 Copyright 2004 Teradata, a division of NCR 37

University of Arkansas Data Mining with Teradata. TM Warehouse Miner Questions and Discussion 11/30/2004 University of Arkansas Data Mining with Teradata. TM Warehouse Miner Questions and Discussion 11/30/2004 Copyright 2004 Teradata, a division of NCR 38