C 20 0046 Database Management Systems Lecture 26

C 20. 0046: Database Management Systems Lecture #26 M. P. Johnson Stern School of Business, NYU Spring, 2005 M. P. Johnson, DBMS, Stern/NYU, Spring 2005 1

Agenda n Last time: q OLAP & Data Warehouses n Data Mining n Websearch n Etc. M. P. Johnson, DBMS, Stern/NYU, Spring 2005 2

Goals after today: 1. Be aware of what problems DM solves 2. What the algorithms are 3. Supervised v. unsupervised learning 4. Understand how to find frequent sets and why M. P. Johnson, DBMS, Stern/NYU, Spring 2005 3

New topic: Data Mining n Situ: data rich but knowledge poor q Terabytes and terabytes of data q Searching for needles in haystacks n Can collect lots of data q credit card xacts q bar codes q club cards q webserver logs q call centers q 311 calls in NYC q cameras M. P. Johnson, DBMS, Stern/NYU, Spring 2005 4

Lots of data n Can store this data q DBMSs, data warehousing n Can query it q q n SQL DW/OLAP queries find the things such that… Find the totals for each of… Can get answers to specific questions, but what does it all mean? M. P. Johnson, DBMS, Stern/NYU, Spring 2005 5

But how to learn from this data? n What kinds of people will buy my widgets? n Whom should I send my direct-mail literature to? n How can I segment my market? q http: //www. joelonsoftware. com/articles/Camelsand. Rubber. Duckies. html n Whom should we approve for a gold card? n Who gets approved for car insurance? q And what should we charge? M. P. Johnson, DBMS, Stern/NYU, Spring 2005 6

Knowledge Discovery n Goal: Extract interesting/actionable knowledge from large collections of data q q q Finding rules Finding patterns Classifying instances n KD ~= DM ~= Business Analytics n Business Intelligence: semi-intelligent reporting/OLAP M. P. Johnson, DBMS, Stern/NYU, Spring 2005 7

Data Mining at intersection of disciplines n DBMS q n Statistics q n mathematical modeling, regression Visualization q n query processing, data warehousing, OLAP 3 d representation AI/machine learning q q neural networks, decision trees, etc. Except the algorithms must really scale M. P. Johnson, DBMS, Stern/NYU, Spring 2005 8

ML/DM: two basic kinds Supervised learning 1. q q Classification Regression Unsupervised learning 2. q q Clustering Or: Find something interesting M. P. Johnson, DBMS, Stern/NYU, Spring 2005 9

Supervised learning n Situ: you have many particular instances q q q n n Have various fields describing them But other one property you’re interested in: q n Transactions Customers Credit applications E. g. , credit-worthiness Goal: infer this dependent property from the other data M. P. Johnson, DBMS, Stern/NYU, Spring 2005 10

Supervised learning n Supervised learning starts with training data q n Use the training data to build a model q n n Many different algorithms Given the model, you can now determine the dependent property for new instances q n Many instances including the dependent property And ideally, the answers are “correct” Categorical property “classification” Numerical property “regression” M. P. Johnson, DBMS, Stern/NYU, Spring 2005 11

k-Nearest Neighbor n Very simple algorithm q n n Map training data points in a space Given any two points, can compute the distance between them q n Sometimes works Euclidean Given an unlabeled point, label this way: q q Find the k nearest points Let them vote on the label M. P. Johnson, DBMS, Stern/NYU, Spring 2005 12

Neural Networks (skip? ) n “Hill climbing” based on connections between neurons in brain n simple NN: n input layer with 3 nodes hidden layer with 2 nodes output layer with 1 node 1. 2. 3. n n n each node points to each node in next level each node as some activation level and a something like a critical mass Draw picture q What kind of graph is this? M. P. Johnson, DBMS, Stern/NYU, Spring 2005 13

Neural Networks (skip? ) n values passed into input nodes represent the problem instance n given weighted sum of inputs to neuron, sends out pulse only if the sum is greater than its threshold q n values output by hidden nodes sent to output node if the weighted sum going into output node is high enough, it outputs 1, otherwise 0 M. P. Johnson, DBMS, Stern/NYU, Spring 2005 14

NN applications (skip? ) n plausible application: we have data about potential customers q party registration q married or not q gender q income level n or: have credit applicant information q employment n income home ownership q bankruptcy should we give a credit card to him? q n M. P. Johnson, DBMS, Stern/NYU, Spring 2005 15

How NNs work (skip? ) n hope: plug in customer out comes whether we should market toward him n How does it get the right answer? n n Initially, all weights are random! But: we assume we have data for lots of people which we know to be either interested in our products or not (let’s say) q n we have data for both kinds So: when we plug in one of these customers, we know what the right answer is supposed to be M. P. Johnson, DBMS, Stern/NYU, Spring 2005 16

How NNs work (skip? ) n can used the Backpropagation algorithm: q q q for each known problem instance, plug in and look at answer if answer is wrong, change edges weights in one way; o. w. , change them the opposite way (details omitted) n repeat… n the more iterations we do, the more the NN learns our known data with enough confidence, can apply NN to unknown customer data to learn whether to market toward them n M. P. Johnson, DBMS, Stern/NYU, Spring 2005 17

LBS example (skip? ) n n Investments goal: maximize return on investments q n lots of time-series data for different props for different stocks q q q n buy/sell right securities at right time return market signals pick right ones react soln: create NN for each stock q retrain weekly M. P. Johnson, DBMS, Stern/NYU, Spring 2005 18

Decision Trees n Another use of (rooted) trees q Trees but not BSTs n Each node ~ one attribute q Its children ~ possible values of that attribute n E. g. : each node ~ some field on a credit app q Each path from root to leaf is one rule q If these fields have these values then make this decision M. P. Johnson, DBMS, Stern/NYU, Spring 2005 19

Decision Trees n Details: q q q n Example: top node: history of bankruptcy? q q n n n for binary property, two out edges, but may be more for continuous property (income), divide values into discrete ranges a property may appear more tan once if yes, REJECT if no, then: employed? If no, … (maybe look for high monthly housing payment) If yes, … Particular algorithms: ID 3, CART, etc. M. P. Johnson, DBMS, Stern/NYU, Spring 2005 20

Naïve Bayes Classifier n Bayes Theorem: Pr(B|A) = Pr(B, A)/Pr(A) = (Pr(A|B)*Pr(B))/Pr(A) n Or: n Used in many spam filters W means: “the msg has the words W 1, W 2, …Wn” S means: “it’s spam” Goal: given new msg with certain words, is this spam? n n n q Pr(S|W) = Pr(S, W)/Pr(W) = (Pr(W|S)*Pr(S))/Pr(W) Is Pr(S|W) > 50%? M. P. Johnson, DBMS, Stern/NYU, Spring 2005 21

Naïve Bayes Classifier n This is supervised learning, so we first have a training phase q n Training phase: q q q n n Look at lots of spam messages and my non-spam messages For each word Wi, compute Pr(Wi) For each word Wi, compute Pr(Wi|S) Compute Pr(S) That’s it! Now, we wait for email to arrive… M. P. Johnson, DBMS, Stern/NYU, Spring 2005 22

Naïve Bayes Classifier n When a new msg with words W = W 1…Wn arrives, we compute: q Pr(S|W) = (Pr(W|S)*Pr(S))/Pr(W) n What’s Pr(W) and Pr(W|S)? Assuming words are independent (obviously false), we have: q Pr(W) = Pr(W 1)*Pr(W 2)*…*Pr(Wn) q Pr(W|S) = Pr(W 1|S)*Pr(W 2|S)*…*Pr(Wn|S) n n Each number here, we have precomputed! q Except for new words… To decide spam status of message, then, we just do the math! Very simple, but works surprisingly well in practice q Really simple: can write in a page of Perl code q See also Paul Graham: http: //www. paulgraham. com/spam. html M. P. Johnson, DBMS, Stern/NYU, Spring 2005 23

Naïve Bayes Classifier From: "mrs bulama shettima" <aminabulama@pnetmail. co. za Date: September 11, 2004 4: 32: 14 AM EDT To: aminabulama@pnetmail. co. za Subject: SALAM. My name is Hajia Amina Shettima Mohammed Bulama the eldest wife of Shettima Mohammed Bulama who was the est. while managing director of the Bank of the North Nig, Plc. I am contacting you in a benevolent spirit; utmost confidence and trust to enable us provide a solution to a money transfer of $12 M that is presentlt the last resort of my family. My husband has just been retired from office and has also been placed under survelience of which he can not travel out of the country for now due to allegations levelled against him while in office as the Managing director of the bank of the north Nig. Plc of which he is facing litigations right now. You may be quite surprised at my sudden contact to you but do not despair, I got your contact from a business site on the internet and following the information I gathered about you, I was convinced that you could be of assistance to me. So, decided to contact you at once due to the urgency required for us to immediately transfer the said funds out of the country. During the time my husband was with the bank he made several loans to various oil firms and inreturn he was given various sums as gratifications. The prominent amongst the deals was monies that emanated from funds set aside for the importation of various ballot equipmemts for the just concluded april general elections . If you are conversant with world news, you would understand better. During this period my husband was able to make some good money for himself and kept in a private security finance accounts here as persopnal effects. Out of the money my husband made, he left the sum of US$12 M (Twelve Mllion United States Dollars) which was kept in a Private security firm here in Nigeria and he kept it in the vaults of this firm as personal effects. M. P. Johnson, The reason is because no names were used to lodge in the consignment containing the funds. Instead, he used PERSONAL IDENTIFICATION NUMBERS (PIN)and declared the contents as Bearer Bonds and Treasury Bills. Also the firm issued him with a certificate of deposit of the consignment. Note that I have these information in my custody. Right now, my husband has asked me to negotiate with a foreigner who would assist us lay claim to the consignment, as the eldest wife of my husband, I believe that I owe the entire family an obligation to ensure that the US$12 M is successfully transferred abroad for investment purposes. With the present situation, I cannot do it all by myself. It is based on this that I am making this contact with you. I have done a thorough homework and fine-tuned the best way to create you as the beneficiary to the consignment containing the funds and effect the transfer of the consignment accordingly. It is rest assured that the modalities I have resolved to finalize the entire project guarantees our safety and the successful transfer of the funds. So, you will be absolutely right when you say that this project is risk free and viable. If you are capable and willing to assist, contact me at once via this email for more details. Believe me, there is no one else we can trust again. All my husband's friends have deserted us after exploiting us on the pretence of trying to help my husband. As it is said, it is at the time of problems that you know your true friends. So long as you keep everything to yourself, we would definitely have no problems. For your assistance, I am ready to give you as much as 25% of the total funds after confirmation in your possession and invest a reasonable percentage into any viable business you may suggest. Please, I need your assistance to make this happen and please do not undermine it because it will also be a source of up liftment to you also. You have absolutely nothing to loose in assisting us instead, you have so much to gain. I will appreciate if you can contact me at my private email: mrsaminabulama@gawab. com once. Awaiting your urgent and positive response. Thanks and Allah be with you. DBMS, Stern/NYU, Spring 2005 Hajia (Mrs) Amina Shettima Bulama. 24

Judging the results n (Relatively) easy to create a model that does very well on the training data n What matters: should do well on future, unseen data n Common problem: overfitting q q n Model is too close to the training data Needs to be pruned Common approach: cross-validation q q q Divide training data into 10 parts Train on 9/10, test on 1/10 Do this all 10 ways M. P. Johnson, DBMS, Stern/NYU, Spring 2005 25

Association applications n Spam filtering n Network intrusion detection n Trading strategies n Political marketing: q q Mail order tulip bulbs Conservative party voters NYTimes last year… M. P. Johnson, DBMS, Stern/NYU, Spring 2005 26

New topic: Clustering n Unsupervised learning q q q n divide items up into clusters of same type What are the different types of customers we have? What types of webpages are there? Each item becomes a point in the space q q One dim for every property Poss plotting for webpages: dim. for each word Value in dim d is # occur. s of d / # all word occur. s k-means M. P. Johnson, DBMS, Stern/NYU, Spring 2005 27

k-means n Simple clustering algorithm n n Want to partition data points into sets of “similar” instances Plot in n-space partition points into clumps in space n Like a unsupervised analog to k-NN n q But iterative M. P. Johnson, DBMS, Stern/NYU, Spring 2005 28

k-means n n Alg: Choose initial means (centers) Do { assign each point to the closest mean recompute means of clumps } until change in means < e Visualization: q http: //www. delft-cluster. nl/textminer/theory/kmeans. html M. P. Johnson, DBMS, Stern/NYU, Spring 2005 29

n Amazon’s “Page You Made” q n mpj 9@col How does this work? M. P. Johnson, DBMS, Stern/NYU, Spring 2005 30

New topic: Frequent Sets n n Find sets of items that go together Intuition: market-basket data q q q n Suppose you’re Wal-Mart, with 460 TB of data n http: //developers. slashdot. org/article. pl? sid=04/11/14/20 57228 Might not know customer hist (but maybe: club cards!) One thing you do know: groupings of items n Each set of item wrung up in a an individual purchase Famous (claimed) example: beer and diapers are positively correlated q q Intuition: Babies at home, not at bars put chips in between M. P. Johnson, DBMS, Stern/NYU, Spring 2005 31

Finding Frequent Sets n Situ: one table q q q n n Baskets(id, item) One row for each purchase of each item Alt interp: words on webpages Goal: given support s, find set of items that appear together in >=s baskets For pairs, obvious attempt: SELECT B 1. item, B 2. item, count(B. id) FROM Baskets B 1, Baskets B 2 WHERE B 1. id = B 2. id AND B 1. item < B 2. item GROUP BY B 1. item, B 2. item M. P. Johnson, DBMS, Stern/NYU, HAVING count(B 1. basket) >= s; Spring 2005 What’s wrong? 32

A Priori Property n a priori ~ prior to q q n As opposed to a posteriori ~ postq n Prior to investigation Deductive/from reason/non-empirical After investigation/experience Logical observation: if support(X) >= s, then for every individual x in X, support(x) >= s q any item y with support < s can be ignored M. P. Johnson, DBMS, Stern/NYU, Spring 2005 33

A Priori Property E. g. : You sell 10, 000 items, stores record of 1, 000 baskets, with avg of 20 items each n q q 20, 000 rows in Baskets #pairs of items from 1, 000 baskets = C(20, 2)*1, 000 = 190, 000 pairs! One idea: n 1. 2. Find support-s items Run old query on them M. P. Johnson, DBMS, Stern/NYU, Spring 2005 34

A Priori Property n Suppose we’re looking for sup-10, 000 pairs q n Each item in pair must be sup-10, 000 Counting arg: have only 20, 000 individual items purchased q q Can have at most 2000 = 20, 000/10, 000 popular items Eliminated 4/5 s of the item types! n q n Actually, probably much more Will also lower the average basket size Suppose now have 500, 000 rows with avg basket size = 10 q #pairs = 500, 000*C(10, 2) = 22, 500, 000 rows M. P. Johnson, DBMS, Stern/NYU, Spring 2005 35

A Priori Algorithm n n But: a frequent itemset may have >2 items Idea: build frequent sets (not just pairs) iteratively Alg: First, find all frequent items q n These are the size-1 frequent sets Next, for each size-k frequent set q For each frequent item n n Check whether the union is frequent If so, add to collection of size-k+1 frequent sets M. P. Johnson, DBMS, Stern/NYU, Spring 2005 36

Mining for rules n Frequent sets tell us which things go together n But they don’t tell us causality n Sometimes want to say: a causes b q n Or at least: presence a makes b likely E. g. : razor purchase future razorblade purchases q q but not vice versa make razors cheap M. P. Johnson, DBMS, Stern/NYU, Spring 2005 37

Association rules n Here’s some data – what implies what? Xact Custid Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 5/10/99 Pen 1 113 106 5/10/99 Milk 1 114 201 6/1/99 Pen 2 114 201 6/1/99 Ink 2 114 201 6/1/99 Juice 4 114 201 6/1/99 Water 1 M. P. Johnson, DBMS, Stern/NYU, Spring 2005 38

Association rules n Candidates: q q Pen ink Ink pen Water juice Etc. n How to pick? n Two main measures M. P. Johnson, DBMS, Stern/NYU, Spring 2005 39

Judging association rules n Support q q n Support(X Y) = support (X union Y) = Pr(X union Y) Does the rule matter? Confidence q q Conf(X Y) = support(X Y) / support(X) = Pr(Y|X) Show we believe the rule? M. P. Johnson, DBMS, Stern/NYU, Spring 2005 40

Association rules n What are the supports and confidences? Xact Custid Date Item Qty 111 201 5/1/99 Pen 2 n 111 201 5/1/99 Ink 1 n 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 5/10/99 Pen 1 113 106 5/10/99 Milk 1 114 201 6/1/99 Pen 2 114 201 6/1/99 Ink 2 114 201 6/1/99 Juice 4 114 201 6/1/99 Water 1 M. P. Johnson, DBMS, Stern/NYU, Spring 2005 n Pen ink Ink pen Water juice n Which matter? 41

Discovering association rules Association rules only matter if their support and confidence are both high enough n q User specifies minimum allowed for each First, support: n q q High support(X Y) high support(X u Y) X u Y is a frequent set So, to find a good association rule n 1. 2. Generate a frequent set Z Divide into subsets X, Y all possible ways n Checking if conf(X Y) is high enough M. P. Johnson, DBMS, Stern/NYU, Spring 2005 42

Association rules n Suppose {pen, ink} is frequent Xact Custid Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 5/10/99 Pen 1 113 106 5/10/99 Milk 1 114 201 6/1/99 Pen 2 114 201 6/1/99 Ink 2 114 201 6/1/99 Juice 4 114 201 6/1/99 Water 1 M. P. Johnson, DBMS, Stern/NYU, Spring 2005 n n Divide in subsets both ways: Pen ink Ink pen Which do we choose? 43

Other kinds of baskets n Here, the basket was a single transaction (purchase) n But could have other baskets q q q All purchases from each customer All purchases on first-day-of-the-months Etc. M. P. Johnson, DBMS, Stern/NYU, Spring 2005 44

Frequent set/association applications n Store/website/catalog layout n Page You Made n Direct marketing n Fraud detection M. P. Johnson, DBMS, Stern/NYU, Spring 2005 45

Mining v. warehousing n Warehousing: let user search, group by interesting properties q q q n Give me the sales of A 4 s by year and dealer, for these colors User tries to learn from results which properties are important/interesting What’s driving sales? Mining: tell the user what the interesting properties are q How can I increase sales? M. P. Johnson, DBMS, Stern/NYU, Spring 2005 46

Social/political concerns n Privacy n TIA n Sensitive data q Allow mining but not queries n Opt-in/opt-out n “Don’t be evil. ” M. P. Johnson, DBMS, Stern/NYU, Spring 2005 47

For more info n See Dahr & Stein: Seven Methods for Transforming Corporate Data into Business Intelligence (1997) q q Drawn on above A few years old, but very accessible n http: //www. kdnuggets. com/ n Data mining courses offered here… M. P. Johnson, DBMS, Stern/NYU, Spring 2005 48

Future n n n RAID Websearch Proj 5 due q q n Print by 10: 30 and turn in on time (at 11) If not, email… Final Exam: next Thursday, 5/5, 10 -11: 50 am q Info is up M. P. Johnson, DBMS, Stern/NYU, Spring 2005 49