- Количество слайдов: 23
Chapter 7 Pages 304 -309, 311, Sections 7. 3, 7. 5, 7. 6 DATA, TEXT, AND WEB MINING
Data mining A process that uses statistical, mathematical, artificial intelligence and machine-learning techniques to extract and identify new knowledge from large databases Recognizes the untapped value of data in large databases You may unexpectedly strike rich in understanding relationships among data
Example Task: Find the best route to cover the territory
Challenge of finding relationships in large databases
Connect equal elevation points to make a contour map The dark vertical line shows the best route to cross the territory without falling off a cliff.
Once relationships are discovered, they can be used for prediction
Uses of Data Mining-1 • Classification Identify attribute of interest (eg. You want to classify who is likely to pay late) Examine all other attribute values of customer from data warehouse and locate the one that is most related to the attribute of interest (eg. monthly income level) • Mining Algorithm The most common algorithm used for Classification is Decision trees Gini Index: helps to determine where to find the split between two classes (eg. at what income level) - used in developing decision trees (see example on page 316)
Which product class is the best seller? Conclusion: Clay products with a price below $25!
Uses of Data Mining-2 • Segmentation Partitioning a database into groups in which the members of each group share similar characteristics • Mining Algorithm Clustering: The object is to sort cases into groups so that the similarities within the group are strong among members of the same cluster and weak between members of different clusters Eg. Companies with over 100 employees may share similar characteristics (eg. revenue size) than those with less than 100 employees. Knowledge can help with developing different policies when dealing with different type of companies
Uses of Data Mining-3 • Association A category of data mining algorithm that establishes relationships about items that occur together in a given record Eg. You may discover from data that senior students take elective courses together in the final semester Can be helpful to schedule courses People who buy a suit may also buy dress shirt People who buy swimwear may buy fins, goggles, cap, etc.
Uses of Data Mining-4 • • Sequence discovery The identification of associations over time. Discovering the order in which events occur. The algorithm can examine data and predict what event is most likely to occur next. Widely used in studying how visitors navigate a Web site. Helps to improve chances of making a sale.
Uses of Data Mining-5 Regression is a statistical technique that is used to map data to a prediction value • Forecasting estimates future values based on patterns within large sets of data Eg. Gasoline prices this month may predict next month’s sales of SUVs
Data Mining Concepts and Applications Data mining applications – – Marketing Banking Retailing and sales Manufacturing and production – Brokerage and securities trading – Insurance – Computer hardware and software – Government and defense – Airlines – Health care – Broadcasting – Police – Homeland security
Text Mining Application of data mining to text files, typically freestyle text material Discovers new knowledge that is not obvious Examples: Examine all news services, cluster similar topics, create a new summary for each topic Find the “hidden” content of documents, including additional useful relationships, eg. Lies, deceptions, scams Not same as the search engine on the Web.
Text Mining – how is it done? It entails the generation of meaningful numerical indices/factors from the unstructured text and then processing these indices using various data mining algorithms Example: Extract each word from the document being text mined Eliminate commonly used words (the, and, other, etc) Combine synonyms and phrases Calculate weights for each term: tf factor (term frequency) – actual number of times a word appears in a document idf factor (inter document frequency) – across multiple documents High tf factor value of a given term indicates that the document topic is probably around the meaning of that term!
Text Mining - applications – Automatic detection of e-mail spam or phishing through analysis of the document content – Automatic processing of messages or e-mails to route a message to the most appropriate party to process that message – Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses
Web Mining The discovery and analysis of interesting and useful information from the Web
Web content mining The extraction of useful information from Web pages Eg. Search with the help of keywords in the Meta tags of the web page You can analyze the document content of the first 10 links of Google in a search response You can generate a summary of the contents automatically in a new document!
Web structure mining The development of useful information from the links included in the Web documents If a web site’s pages predominantly link to each other, you may consider the site to exist ‘independent’ If a collection of web sites are linked to each other heavily, it points to a web community or clan that share common interests Example application: Web structure mining can lead to better understanding of extremist groups
Uses for Web mining – Determine the lifetime value of clients – Design cross-marketing strategies across products – Evaluate promotional campaigns – Target electronic ads and coupons at user groups – Predict user behavior – Present dynamic information to users
Data Mining Project Processes
Steps for Data Mining • Problem definition: Decide the measure to study and the suitable mining algorithm (see Exercise 11) • Data preparation: Design the cube and populate it relevant data from the data warehouse • Training: Run the mining algorithm on a subset of the data warehouse data for the system to learn to find segments, associations, etc among data • Validation: Run the ‘learnt’ model from previous step to the remaining subset of data and try to ‘predict’. Since you have historical data, you can verify if the ‘learnt’ model is any good. • Deploy: Implement to predict in real environment where you do not know the actual results.