a055bc3f06a4a894d0ab0b0aee146ec4.ppt
- Количество слайдов: 55
DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 mhd@engr. smu. edu This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; eamonn@cs. ucr. edu 7/10/07 - SEDE'07
The 2000 ozone hole over the antarctic seen by EPTOMS 7/10/07 - SEDE'07 http: //jwocky. gsfc. nasa. gov/multi. html#hole
OBJECTIVE Explore some of the applications of data mining techniques. 7/10/07 - SEDE'07
Data Mining Applications Outline n Introduction – Data Mining Overview n Classification (Prediction, Forecasting) n Clustering n Association Rules (Link Analysis) n Applications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics n Conclusions 7/10/07 - SEDE'07
Data Mining Overview n Finding hidden information in a database n Fit data to a model n You must know what you are looking for n You must know how to look for you 7/10/07 - SEDE'07
“If it looks like a duck, terrorist, walks like a duck, and terrorist, and quacks like a duck, then terrorist, then it’s a duck. ” terrorist. ” Description Behavior Classification Clustering (Profiling) (Similarity) 7/10/07 - SEDE'07 3/19/2018 Associations Link Analysis 6
Classification Applications n Teachers classify students’ grades as A, B, C, D, or F. n Letter Recognition n andwriting Recognition n Phishing: http: //computerworld. com/action/article. do? command= view. Article. Basic&taxonomy. Name=cybercrime_hackin g&article. Id=9002996&taxonomy. Id=82 n Pluto: http: //www. npr. org/templates/story. php? story. Id= 5705254 7/10/07 - SEDE'07
Classification Example Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. Katydids Grasshoppers (c) Eamonn Keogh, eamonn@cs. ucr. edu 7/10/07 - SEDE'07 3/19/2018 8
Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length 7/10/07 - SEDE'07 3/19/2018 Grasshoppers 9 (c) Eamonn Keogh, eamonn@cs. ucr. edu Katydids
Clustering Applications n Targeted Marketing n Determining Gene Functionality n Identifying Species n Clustering vs. Classification n No prior knowledge n Number of clusters n Meaning of clusters n Unsupervised learning 7/10/07 - SEDE'07
7/10/07 - SEDE'07 http: //149. 170. 199. 144/multivar/ca. htm
What is Similarity? 7/10/07 - SEDE'07 3/19/2018 (c) Eamonn Keogh, eamonn@cs. ucr. edu 12
Association Rules Applications n People who buy diapers also buy beer n If gene A is highly expressed in this disease then gene B is also expressed n Relationships between people n www. amazon. com n Book Stores n Department Stores n Advertising n Product Placement 7/10/07 - SEDE'07
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc. 7/10/07 - SEDE'07 3/19/2018 14
Data Mining Applications Outline n Introduction – Data Mining Overview n Classification (Prediction, Forecasting) n Clustering n Association Rules (Link Analysis) n Applications n Fraud Detection & Illegal Activities Facial Recognition n Cheating & Plagiarism n Bioinformatics n Conclusions n 7/10/07 - SEDE'07
7/10/07 - SEDE'07
Fraud Detection n Identify fraudulent behavior n Used Extensively in financial, law enforcement, health care, etc. sectors n http: //www. aaai. org/AITopics/html/fraud. html n SPSS: http: //www. spss. com/predictiveclaims/fraud_det ection. htm n Neural Technologies: http: //www. neuralt. com/fraud_management. html 7/10/07 - SEDE'07
Law Enforcement n Identify suspect behavior and relationships n I 2 Inc. n Investigative analytic/visualization software n http: //www. i 2 inc. com n Social Network Analysis – Analyze patterns of relationships n Relationships: personal, religious, operational, etc. 7/10/07 - SEDE'07
Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture 7/10/07 - SEDE'07 Notes in Computer Science, Publisher: Springer-Verlag Gmb. H, Volume 3495 / 2005 , p. 287.
Data Mining Applications Outline n Introduction – Data Mining Overview n Classification (Prediction, Forecasting) n Clustering n Association Rules (Link Analysis) n Applications n Fraud Detection & Illegal Activities n Facial Recognition Cheating & Plagiarism n Bioinformatics n Conclusions n 7/10/07 - SEDE'07
How Stuff Works, “Facial Recognition, ” http: //computer. howstuf fworks. com/facialrecognition 1. htm 7/10/07 - SEDE'07
Facial Recognition n Based upon features in face n Convert face to a feature vector n Less invasive than other biometric techniques n http: //www. face-rec. org n http: //computer. howstuffworks. com/facialrecognition. htm n SIMS: http: //www. casinoincidentreporting. com/Prod ucts. aspx 7/10/07 - SEDE'07
7/10/07 - SEDE'07 3/19/2018 (c) Eamonn Keogh, eamonn@cs. ucr. edu 23
Data Mining Applications Outline n Introduction – Data Mining Overview n Classification (Prediction, Forecasting) n Clustering n Association Rules (Link Analysis) n Applications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism Bioinformatics n Conclusions n 7/10/07 - SEDE'07
Cheating on Multiple Choice Tests n Similarity between tests based on number of common wrong answers. n (George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams, ” Journal of Applied Statistics, vol 27, no 7, 200, pp 909 -923. ) n The number of common correct answers is often ignored. n H-H Index (D. N. Harpp, J. J. Hogan, and J. S. Jennings, 1996, “Crime in the Classroom – Part II, and update, ” Journal of Chemical Education, vol 73, no 4, pp 349 -351): H-H = (Number of exact answers in common) (Number of different answers) 7/10/07 - SEDE'07
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts: , Dallas Morning News, June 4, 2007. 7/10/07 - SEDE'07
No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts: , Dallas Morning News, June 4, 2007. 7/10/07 - SEDE'07
Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts: , Dallas Morning News, June 4, 2007. 7/10/07 - SEDE'07
Data Mining Applications Outline n Introduction – Data Mining Overview n Classification (Prediction, Forecasting) n Clustering n Association Rules (Link Analysis) n Applications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics n Conclusions 7/10/07 - SEDE'07
DNA n Basic building blocks of organisms n Located in nucleus of cells n Composed of 4 nucleotides n Two strands bound together 7/10/07 - SEDE'07 http: //www. visionlearning. com/library/module_viewer. php? mi d=63
Central Dogma: DNA -> RNA -> Protein DNA CCTGAGCCAACTATTGATGAA transcription RNA CCUGAGCCAACUAUUGAUGAA translation Protein 7/10/07 - SEDE'07 PEPTIDE www. bioalgorithms. info; chapter 6; Gene Prediction
mi. RNA n Short (20 -25 nt) sequence of noncoding RNA n Known since 1993 but significance not widely appreciated until 2001 n Impact / Prevent translation of m. RNA n Generally reduce protein levels without impacting m. RNA levels (animal cells) n Functions n Causes some cancers n Guide embryo development n Regulate cell Differentiation n Associated with HIV n … 7/10/07 - SEDE'07
Questions n If each cell in an organism contains the same DNA – n How does each cell behave differently? n Why do cells behave differently during childhood/? n What causes some cells to act differently – such as during disease? n DNA contains many genes, but only a few are being transcribed – why? n One answer - mi. RNA 7/10/07 - SEDE'07
http: //www. time. com/time/magazine/article/0, 9171, 1541283, 00. html 7/10/07 - SEDE'07
Human Genome n Scientists originally thought there would be about 100, 000 genes n Appear to be about 20, 000 n WHY? n Almost identical to that of Chimps. What makes the difference? n Visualization from UCR dna. QT. mov n Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk) 7/10/07 - SEDE'07
RNAi – Nobel Prize in Medicine 2006 si. RNA may be artificially added to cell! Double stranded RNA Short Interfering RNA (~20 -25 nt) RNA-Induced Silencing Complex Binds to m. RNA Cuts RNA 7/10/07 - SEDE'07 Image source: http: //nobelprize. org/nobel_prizes/medicine/laureates/2006/adv. html, Advanced Information, Image 3
Computer Science & Bioinformatics n n n Algorithms Data Structures Improving efficiency Data Mining Biologists don’t usually understand or even appreciate what Computer Science can do n Issues: n Scalability n Fuzzy n We will look at: n Microarray Clustering n TCGR 7/10/07 - SEDE'07
Affymetrix Gene. Chip® Array http: //www. affymetrix. com/corporate/outreach/lesson_plan/educator_resources. affx 7/10/07 - SEDE'07 3/19/2018 38
Microarray Data Analysis n n n Each probe location associated with gene Measure the amount of m. RNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions n Which genes are related to this disease? n Which genes behave in a similar manner? n What is the function of a gene? n Clustering n Hierarchical n K-means 7/10/07 - SEDE'07
Microarray Data - Clustering "Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811 -816, January 20, 2004 7/10/07 - SEDE'07
mi. RNA Research Issues n Predict / Find mi. RNA in genomic sequence n Predict mi. RNA targets n Identify mi. RNA functions 7/10/07 - SEDE'07
Temporal CGR (TCGR) n 2 D Array n Each Row represents counts for a particular window in sequence • First row – first window • Last row – last window • We start successive windows at the next character location n Each Column represents the counts for the associated pattern in that window • Initially we have assumed order of patterns is alphabetic n Size of TCGR depends on sequence length and subpattern length 7/10/07 - SEDE'07
TCGR Example (cont’d) TCGRs for Sub-patterns of length 1, 2, and 3 7/10/07 - SEDE'07
TCGR – Mature mi. RNA (Window=5; Pattern=3) C Elegans Homo Sapiens Musculus All Mature 7/10/07 - SEDE'07 ACG CGC GCG UCG
TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo Micro. RNA Precursors using Local Structure -Sequence Features and Support Vector Machine, ” BMC Bioinformatics, vol 6, no 310. 7/10/07 - SEDE'07 P OS I T I VE NE G AT I VE
TCGRs for Xue Test Data PO S I T I VE NE GA T I VE 7/10/07 - SEDE'07
Data Mining Applications Outline n Introduction – Data Mining Overview n Classification (Prediction, Forecasting) n Clustering n Association Rules (Link Analysis) n Applications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics n Conclusions 7/10/07 - SEDE'07
Conclusions n Not magic n Doesn’t work for all applications n Stock Market Prediction n Issues n Privacy n Data n Here are some infamous examples of failed data mining applications 7/10/07 - SEDE'07
7/10/07 - SEDE'07 3/19/2018 49
Dallas Morning News October 7, 2005 7/10/07 - SEDE'07
7/10/07 - SEDE'07 3/19/2018 51 http: //ieeexplore. ieee. org/iel 5/6/32236/01502526. pdf? tp=& arnumber=1502526&isnumber=32236
BIG BROTHER ? n Total Information Awareness n http: //infowar. net/tia/www. darpa. mil/iao/index. htm n http: //www. govtech. net/magazine/story. php? id=45918 n http: //en. wikipedia. org/wiki/Information_Awareness_Office n Terror Watch List n http: //www. businessweek. com/technology/content/may 2005/tc 20050 511_8047_tc_210. htm n http: //www. theregister. co. uk/2004/08/19/senator_on_terror_watch/ n n http: //blogs. abcnews. com/theblotter/2007/06/fbi_terror_watc. html http: //www. thedenverchannel. com/news/9559707/detail. html n CAPPS n http: //www. theregister. co. uk/2004/04/26/airport_security_failures/ n http: //www. heritage. org/Research/Homeland. Defense/BG 1683. cfm n http: //www. theregister. co. uk/2004/07/16/homeland_capps_scrapped/ n http: //en. wikipedia. org/wiki/CAPPS 7/10/07 - SEDE'07
7/10/07 - SEDE'07
7/10/07 - SEDE'07 3/19/2018 54
7/10/07 - SEDE'07