Скачать презентацию Elective-I Data Mining Techniques and Applications Examination Scheme- Скачать презентацию Elective-I Data Mining Techniques and Applications Examination Scheme-

8c59f54809ee90cb33a73bb19bb9268c.ppt

  • Количество слайдов: 65

Elective-I Data Mining Techniques and Applications Examination Scheme- In semester Assessment: 30 End semester Elective-I Data Mining Techniques and Applications Examination Scheme- In semester Assessment: 30 End semester Assessment : 70 Text Books: Data Mining Concepts and Techniques- Micheline Kamber Introduction to Data Mining with case studies-G. k. Gupta Reference Books: Mining the Web Discovering Knowledge from Hypertext data. Saumen charkrobarti Reinforcement and systemic machine learning for decision making- Parag Kulkarni

Unit-1) Introduction, Knowledge of data, Data processing. . Data mining described Need of data Unit-1) Introduction, Knowledge of data, Data processing. . Data mining described Need of data mining Kinds of pattern and technologies Issues in mining KDD vs. Data Mining Machine learning Concepts OLAP Knowledge Representation Data Preproccesing. Cleaning, integration, Reduction, Transformation and Discretization Application with mining aspect (Weather Prediction)

Some Definitions Data : Data are any facts, numbers, or text that can be Some Definitions Data : Data are any facts, numbers, or text that can be processed by a computer. operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions Information: The patterns, associations, or relationships among all this data can provide information.

Definitions Continued. . Knowledge: Information can be converted into knowledge about historical patterns and Definitions Continued. . Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in terms of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. Data Warehouses: Data warehousing is defined as a process of centralized data management and retrieval.

Why need of Data Mining? The Explosive Growth of Data: from terabytes to petabytes Why need of Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability ▪ Automated data collection tools, database systems, Web, computerized society Major sources of abundant data ▪ Business: Web, e-commerce, transactions, stocks, … ▪ Science: Remote sensing, bioinformatics, scientific simulation, … ▪ Society and everyone: news, digital cameras, You. Tube **We are drowning in data, but starving for knowledge! ** “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets 5

What Is Data Mining? Data mining- is the principle of sorting through large amounts What Is Data Mining? Data mining- is the principle of sorting through large amounts of data and picking out relevant information. In other words… Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Other names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What is Data Mining? Searching through large amounts of data for correlations, sequences, and What is Data Mining? Searching through large amounts of data for correlations, sequences, and trends. Current “driving applications” in sales (targeted marketing, inventory) and finance (stock picking)

Data Warehouse example Data Warehouse example

Data Rich, Information Poor Data Rich, Information Poor

Data Mining process Data Mining process

Knowledge discovery from data KDD process includes data cleaning (to remove noise and inconsistent Knowledge discovery from data KDD process includes data cleaning (to remove noise and inconsistent data) data integration (where multiple data sources may be combined) data selection (where data relevant to the analysis task are retrieved from the database) data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations)

KDD continued…. data mining (an essential process where intelligent methods are applied in order KDD continued…. data mining (an essential process where intelligent methods are applied in order to extract data patterns. pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) Data mining is a core of knowledge discovery process

Knowledge Discovery (KDD) Process n Data mining—core of knowledge discovery process Pattern Evaluation Data Knowledge Discovery (KDD) Process n Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Warehouse Data Cleaning Data Integration Databases Selection

Knowledge Process 1. 2. 3. 4. 5. 6. 7. Data cleaning – to remove Knowledge Process 1. 2. 3. 4. 5. 6. 7. Data cleaning – to remove noise and inconsistent data Data integration – to combine multiple source Data selection – to retrieve relevant data for analysis Data transformation – to transform data into appropriate form for data mining Data mining Evaluation Knowledge presentation

Knowledge Process Step 1 to 4 are different forms of data preprocessing Although data Knowledge Process Step 1 to 4 are different forms of data preprocessing Although data mining is only one step in the entire process, it is an essential one since it uncovers hidden patterns for evaluation

Knowledge Process Based on this view, the architecture of a typical data mining system Knowledge Process Based on this view, the architecture of a typical data mining system may have the following major components: Database, data warehouse, world wide web, or other information repository Database or data warehouse server Data mining engine Pattern evaluation model User interface

Data Mining – on what kind of data? n Relational Database Data Mining – on what kind of data? n Relational Database

Data Mining – on what kind of data? Data Warehouses Data Mining – on what kind of data? Data Warehouses

Data Mining – on what kind of data? Transactional Databases Advanced data and information Data Mining – on what kind of data? Transactional Databases Advanced data and information systems Object-oriented database Temporal DB, Sequence DB and Time serious DB Spatial DB Text DB and Multimedia DB … and WWW

Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Pattern Recognition Statistics Data Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Other Disciplines

What kind of patterns can we mine? In general, data mining tasks can be What kind of patterns can we mine? In general, data mining tasks can be classified into two categories: descriptive and predictive Descriptive mining tasks characterize the general properties of the data in database Predictive mining tasks performs inference on the current data in order to make predictions

Functionalities: Class Description: Characterization and Discrimination Mining Frequent Patterns, Associations and correlations Classification and Functionalities: Class Description: Characterization and Discrimination Mining Frequent Patterns, Associations and correlations Classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis

Concept/Class Description: Characterization and Discrimination Data Characterization: A data mining system should be able Concept/Class Description: Characterization and Discrimination Data Characterization: A data mining system should be able to produce a description summarizing the characteristics of customers. Example: The characteristics of customers who spend more than $1000 a year at (some store called ) All. Electronics. The result can be a general profile such as age, employment status or credit ratings.

Characterization and Discrimination continued… Data Discrimination: It is a comparison of the general features Characterization and Discrimination continued… Data Discrimination: It is a comparison of the general features of targeting class data objects with the general features of objects from one or a set of contrasting classes. User can specify target and contrasting classes. Example: The user may like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by about 30% in the same duration.

Mining Frequent Patterns, Associations and correlations Frequent Patterns : as the name suggests patterns Mining Frequent Patterns, Associations and correlations Frequent Patterns : as the name suggests patterns that occur frequently in data. Association Analysis: from marketing perspective, determining which items are frequently purchased together within the same transaction. Example: An example is mined from the (some store) All. Electronic transactional database. buys (X, “Computers”) buys (X, “software”) [Support = 1%, confidence = 50% ] X represents customer confidence = 50% , if a customer buys a computer there is a 50% chance that he/she will buy software as well. Support = 1%, means that 1% of all the transactions under analysis showed that computer and software were purchased together.

Mining Frequent Patterns, Associations and correlations Another example: Multidimensional rule: Age (X, 20… 29) Mining Frequent Patterns, Associations and correlations Another example: Multidimensional rule: Age (X, 20… 29) ^ income (X, 20 K-29 K) buys(X, “CD Player”) [Support = 2%, confidence = 60% ] Customers between 20 to 29 years of age with an income $20000 -$29000. There is 60% chance they will purchase CD Player and 2% of all the transactions under analysis showed that this age group customers with that range of income bought CD Player.

Classification and Prediction Classification is the process of finding a model that describes and Classification and Prediction Classification is the process of finding a model that describes and distinguishes data classes or concepts. . > this model is used to predict the class of objects whose class label is unknown. Classification model can be represented in various forms such as IF-THEN Rules A decision tree Neural network

Classification Model Classification Model

Cluster Analysis Clustering analyses data objects without consulting a known class label. Example: Cluster Cluster Analysis Clustering analyses data objects without consulting a known class label. Example: Cluster analysis can be performed on All. Electronics customer data in order to identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing.

Cluster Analysis The figure shows a 2 -D plot of customers with respect to Cluster Analysis The figure shows a 2 -D plot of customers with respect to customer locations in a city.

Outlier Analysis : A database may contain data objects that do not comply with Outlier Analysis : A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. Example: Use in finding Fraudulent usage of credit cards. Outlier Analysis may uncover Fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account. Outlier values may also be detected with respect to the location and type of purchase or the purchase frequency.

Technologies used… Data mining includes many techniques from Domains bellow: Statistics Machine Learning Database Technologies used… Data mining includes many techniques from Domains bellow: Statistics Machine Learning Database systems and Data Warehouses Information Retrieval Visualization High performance computing

Technologies continued. . Statistics: It studies Collection, Analyasis Interpretation and presentation of Data. #>Statistical Technologies continued. . Statistics: It studies Collection, Analyasis Interpretation and presentation of Data. #>Statistical research develops tools for prediction and forecasting using data #>Statistical methods can also be used to verify data mining results.

Conti… Information Retrieval: It is science of searching for documents or information in documents… Conti… Information Retrieval: It is science of searching for documents or information in documents…

Conti… Database Systems Data Warehouses: This research focuses on the creation, maintainance and use Conti… Database Systems Data Warehouses: This research focuses on the creation, maintainance and use of databases for organizations and end users.

Continued. . Machine Learning: It investigates how computers can learn or improve their performance Continued. . Machine Learning: It investigates how computers can learn or improve their performance based on data.

KDD vs Data Mining KDD-(Knowledge Discovery in Databases) is a field of computer science, KDD vs Data Mining KDD-(Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i. e. knowledge) from large collections of digitized data. KDD consists of several steps, and Data Mining is one of them.

Conti. . This process deal with the mapping of lowlevel data into other forms Conti. . This process deal with the mapping of lowlevel data into other forms those are more compact, abstract and useful. This is achieved by creating short reports, modelling the process of generating data and developing predictive models that can predict future cases. Data Mining: >> is application of a specific algorithm in order to extract patterns from data.

What is the difference between KDD and Data mining? Although, the two terms KDD What is the difference between KDD and Data mining? Although, the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet slightly different concepts. KDD is the overall process of extracting knowledge from data while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. In other words, Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.

Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization Summary

Why Data Preprocessing? Data in the real world is dirty incomplete: missing attribute values, Why Data Preprocessing? Data in the real world is dirty incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data ▪ e. g. , occupation=“” noisy: containing errors or outliers ▪ e. g. , Salary=“-10” inconsistent: containing discrepancies in codes or names ▪ e. g. , Age=“ 42” Birthday=“ 03/07/1997” ▪ e. g. , Was rating “ 1, 2, 3”, now rating “A, B, C” ▪ e. g. , discrepancy between duplicate records

Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data ▪ e. g. , duplicate or missing data may cause incorrect or even misleading statistics. Data preparation, cleaning, and transformation comprises the majority of the work in a data mining application (around 90%).

Multi-Dimensional Measure of Data Quality A well-accepted multi-dimensional view: Accuracy Completeness Consistency Timeliness Believability Multi-Dimensional Measure of Data Quality A well-accepted multi-dimensional view: Accuracy Completeness Consistency Timeliness Believability Valueable Accessibility

Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies Data integration Integration of multiple databases, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization (for numerical data)

Data Preprocessing… Why preprocess the data? Data cleaning Data integration and transformation Data reduction Data Preprocessing… Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization Summary

Data Cleaning… Importance “Data cleaning is the number one problem in data warehousing” Data Data Cleaning… Importance “Data cleaning is the number one problem in data warehousing” Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration

Missing Data is not always available E. g. , many tuples have no recorded Missing Data is not always available E. g. , many tuples have no recorded values for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data

Noisy Data. . Noise: random error or variance in a measured variable. Incorrect attribute Noisy Data. . Noise: random error or variance in a measured variable. Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems etc Other data problems which requires data cleaning duplicate records, incomplete data, inconsistent data

How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e. g. , deal with possible outliers)

Binning Methods for Data Smoothing. . Sorted data for price (in dollars): 4, 8, Binning Methods for Data Smoothing. . Sorted data for price (in dollars): 4, 8, 9, 15, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: Bin 1: 4, 8, 9, 15 Bin 2: 21, 24, 25 Bin 3: 26, 28, 29, 34 Smoothing by bin means: Bin 1: 9, 9, 9, 9 Bin 2: 23, 23, 23 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 4, 15 Bin 2: 21, 25, 25 Bin 3: 26, 26, 34

Outlier Removal. . Data points inconsistent with the majority of data Different outlier Noisy: Outlier Removal. . Data points inconsistent with the majority of data Different outlier Noisy: One’s age = 200, widely deviated points Removal methods Clustering Curve-fitting

Data Preprocessing. . Why preprocess the data? Data cleaning Data integration and transformation Data Data Preprocessing. . Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization

Data Integration. . Data integration: combines data from multiple sources Schema integration integrate metadata Data Integration. . Data integration: combines data from multiple sources Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e. g. , A. cust-id B. cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different, e. g. , different scales, metric vs. British units Removing duplicates and redundant data

Data Transformation. . Smoothing: remove noise from data Normalization: scaled to fall within a Data Transformation. . Smoothing: remove noise from data Normalization: scaled to fall within a small, specified range (-0. 1 to 1. 0 and 0. 0 to 1. 0) Attribute/feature construction New attributes constructed from the given ones Aggregation: summarization Generalization: concept hierarchy climbing

: Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data : Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization Summary CS 583, Bing Liu, UIC 56

Data Reduction Strategies Data is too big to work with. . Data reduction Obtain Data Reduction Strategies Data is too big to work with. . Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Dimensionality reduction — remove unimportant attributes Aggregation and clustering Sampling CS 583, Bing Liu, UIC 58

Dimensionality Reduction Feature selection (i. e. , attribute subset selection): >>>Select a minimum set Dimensionality Reduction Feature selection (i. e. , attribute subset selection): >>>Select a minimum set of attributes (features) that is sufficient for the data mining task. <<< CS 583, Bing Liu, UIC 59

Clustering. . Partition data set into clusters. . CS 583, Bing Liu, UIC 60 Clustering. . Partition data set into clusters. . CS 583, Bing Liu, UIC 60

Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization CS 583, Bing Liu, UIC 61

Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals because some data mining algorithms only accept categorical attributes. Some techniques: Binning methods – equal-width, equal-frequency Entropy-based methods CS 583, Bing Liu, UIC 62

Discretization and Concept Hierarchy Discretization reduce the number of values for a given continuous Discretization and Concept Hierarchy Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior) CS 583, Bing Liu, UIC 63

Summary of Data Preprocessing Data preparation is a big issue for data mining Data Summary of Data Preprocessing Data preparation is a big issue for data mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization Many methods have been proposed but still it is an active area of research………. . CS 583, Bing Liu, UIC 64