Data Mining 157 A, Fall Semester 2006 Brent Turner

Presentation Contents: 1. 2. 3. 4. 5. 6. 7. 8. What Is Data Mining Ideas The DM Process Advantages and Problems in DM Example 1 – web searches Example 2 – buying habits Example 3 – basketball stats References

1 What is DM

2 The DM process 1. Data gathering 2. Data cleansing: eliminate errors and/or bogus data 3. Feature extraction: obtaining only the interesting attributes of the data 4. Pattern extraction and discovery. 5. Visualization of the data. 6. Evaluation of results

3 Data Mining Ideas Search dataspace for a new “golden” relationship. • Brute force: 40 items: 2^40 = 1099511627776 (a trillion) possible pair combinations to look at with only 40 data items • Smarter Search: Infer or guess relationships based on other known data (Association rules; Causality; Frequent item sets)

4 Advantages of Data Mining § Provides new knowledge from existing data § Public databases § Government sources § Company Databases § Old data can be used to develop new knowledge § New knowledge can be used to improve services or products § Improvements lead to: § § Bigger profits More efficient service

Some problems to consider in DM § Privacy – datum dealing with personal information (e. g. medical history) may need to be kept private from employers, insurance companies, etc. § Legality – can DM be used to screen out highrisk persons or help prosecute a crime § Ethics – should we create software that can be used in unethical ways? What should be done with the new knowledge?

5 Example 1 – Web Search a. Page rank, for discovering the most “important” pages on the Web, as used in Google. b. Hubs and authorities, a more detailed evaluation of the importance of Web pages using a variant of the eigenvector calculation used for Page rank.

6 Example 2 – Buying habits 5% 70% + 5% = Historic data might identify that customers who purchase the Gladiator DVD and the Patriot DVD also purchase the Braveheart DVD. The historic data might indicate that the first two DVDs are purchased by only 5% of all customers. But 70% of these then also purchase Braveheart.

Example 2 – Buying habits Support = 5% customers bought Gladiator & Patriot Confidence = 70% hose who will also buy Braveheart Conclusion: Use realtime web advertising to get more sales.

7 Example 3 – basketball stats In one application, IBM's Advance Scout was developed to identify different strategies employed by basketball players in the NBA.

Pippen Discoveries include the observation that Scottie Pippen's favorite move on the left block is a righthanded hook to the middle.

Harper And when guard Ron Harper penetrates the lane, he shoots the ball 83% of the time.

Jordan Also, it was noticed that 17% of Michael Jordan's offence comes on isolation plays, during which he tends to take two or three dribbles before pulling up for a jumper

8 1) 2) 3) 4) 5) References “Data Mining” Oo, Aung, 2005; at www. cs. sjsu. edu/faculty/lee/cs 157 accessed 11 -292006. “Data Mining Lecture Notes” Ullman , Jeffery D. , at infolab. stanford. edu/~ullman/mining accessed 11 -292006. “DATA MINING Desktop Survival Guide” Williams, Graham, at www. togaware. com/datamining/survivor accessed 11 -29 -2006. Pinker, Steven, at pinker. wjh. harvard. edu accessed 11 -27 -2006. Photographs at www. nba. com, accessed 11 -29 -2006.