Data Mining Algorithms for Recommendation Systems Zhenglu Yang

Скачать презентацию Data Mining Algorithms for Recommendation Systems Zhenglu Yang

8a42e9f8878d5abeef0048c800fed603.ppt

Количество слайдов: 136

Data Mining Algorithms for Recommendation Systems Zhenglu Yang University of Tokyo Some slides are from online materials.

Applications 2

Applications 3

Applications Corporate Intranets 4

System Inputs o Interaction data (users items) n Explicit feedback – rating, comments n Implicit feedback – purchase, browsing o User/Item individual data n User side: o Structural attribute information o Personal description o Social network n Item side: o Structural attribute information o Textual description/content information o Taxonomy of item (category) 5

Interaction between Users and Items Observed preferences (Purchases, Ratings, page views, bookmarks, etc) 6

Profiles of Users and Items User Profile: Item Profile: (1) Attribute Nationality, Sex, Age, Hobby, etc Price, Weight, Co lor, Brand, etc (2) Text Personal description Product description (3) Link (3) link Social network Taxonomy of item (category) 7

All Information about Users and Items User Profile: Item Profile: (1) Attribute Nationality, Sex, Age, Hobby, etc Price, Weight, Co lor, Brand, etc (2) Text Personal description Product description (3) Link (3) link Social network Taxonomy of item (category) Observed preferences (Purchases, Ratings, page views, bookmarks, etc) 8

KDD and Data Mining Artificial Intelligence Machine learning KDD Database Statistics Natural Language Processing Data mining is a multi-disciplinary field 9

Recommendation Approaches o Collaborative filtering n Using interaction data (user-item matrix) n Process: Identify similar users, extrapolate from their ratings o Content based strategies n Using profiles of users/items (features) n Process: Generate rules/classifiers that are used to classify new items o Hybrid approaches 10

A Brief Introduction o Collaborative filtering n Nearest neighbor based n Model based 11

Recommendation Approaches o Collaborative filtering n Nearest neighbor based o User based o Item based n Model based 12

User-based Collaborative Filtering o Idea: people who agreed in the past are likely to agree again o To predict a user’s opinion for an item, use the opinion of similar users o Similarity between users is decided by looking at their overlap in opinions for other items

User-based CF (Ratings) Item 1 User 2 User 3 User 4 User 5 User 6 Item 2 Item 3 Item 4 Item 5 Item 6 8 1 7 2 9 8 7 ? 1 2 8 9 3 1 2 1 1 2 3 2 2 1 1 1 10 9 …… 2 1 good bad 14

Similarity between Users Item 1 User 2 User 3 Item 2 Item 3 Item 4 Item 5 Item 6 9 8 7 ? 1 2 8 9 3 1 o Only consider items both users have rated o Common similarity measures: n Cosine similarity n Pearson correlation 15

Recommendation Approaches o Collaborative filtering n Nearest neighbor based o User based o Item based n Model based o Content based strategies o Hybrid approaches 16

Item-based Collaborative Filtering o Idea: a user is likely to have the same opinion for similar items o Similarity between items is decided by looking at how other users have rated them 17

Example: Item-based CF Item 1 Item 2 Item 3 Item 4 Item 5 User 1 8 1 ? 2 7 User 2 2 2 5 7 5 User 3 5 4 7 User 4 7 1 7 3 8 User 5 1 7 4 6 5 User 6 8 3 7

Similarity between Items Item 3 Item 4 ? 2 5 7 7 4 7 3 4 6 8 3 o Only consider users who have rated both items o Common similarity measures: n Cosine similarity n Pearson correlation

Recommendation Approaches o Collaborative filtering n Nearest neighbor based n Model based o Matrix factorization (i. e. , SVD) o Content based strategies o Hybrid approaches 20

Singular Value Decomposition (SVD) o Mathematical method used to apply for many problems o Given any mxn matrix R, find matrices U, I, and V that R = UIVT U is mxr and orthonormal I is rxr and diagonal V is nxr and orthonormal o Remove the smallest values to get Rm, k with k<

Problems with Collaborative Filtering o Cold Start: There needs to be enough other users already in the system to find a match. o Sparsity: If there are many items to be recommended, even if there are many users, the user/ratings matrix is sparse, and it is hard to find users that have rated the same items. o First Rater: Cannot recommend an item that has not been previously rated. n New items n Esoteric items o Popularity Bias: Cannot recommend items to someone with unique tastes. n Tends to recommend popular items. 22

Recommendation Approaches o Collaborative filtering o Content based strategies o Hybrid approaches 23

Advantages of Content-Based Approach o No need for data on other users. n No cold-start or sparsity problems. o Able to recommend to users with unique tastes. o Able to recommend new and unpopular items n No first-rater problem. o Can provide explanations of recommended items by listing content-features that caused an item to be recommended. 25

Recommendation Approaches o Collaborative filtering o Content based strategies n n Association Rule Mining Text similarity based Clustering Classification o Hybrid approaches 26

Traditional Data Mining Techniques o Association Rule Mining (ARM) o Sequential Pattern Mining (SPM) 27

ARM Example: Market Basket Data o Items frequently purchased together: beer diaper o Uses: n n Recommendation Placement Sales Coupons o Objective: increase sales and reduce costs 28

Association Rule Mining o Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Beer} {Diaper}, {Coke} {Eggs}, Implication means co-occurrence, not causality! 29

Some Definitions 　　An itemset is supported by a transaction if it is included in the transaction Market-Basket transactions 　　 is supported by transaction 1, and 3, and its support is 2/4=50%. 30

Some Definitions If the support of an itemset exceeds user specified min_support (threshold), this itemset is called a frequent itemset (pattern). Market-Basket transactions min_support=50% is a frequent itemset is not a frequent itemset 31

Outline o Association Rule Mining n Apriori n FP-growth o Sequential Pattern Mining 32

Apriori Algorithm o Proposed by Agrawal et al. [VLDB’ 94] o First algorithm for Association Rule mining o Candidate generation-and-test o Introduced anti-monotone property 33

Apriori Algorithm Market-Basket transactions 34

A Naive Algorithm 2 3 12 1 1 3 1 ……. 1 Supmin=2 35

Apriori Algorithm o Anti-monotone property: If an itemset is not frequent, then any of its superset is not frequent 2 3 3 1 2 3 1 1 ……. Supmin=2 36

Apriori Algorithm Supmin = 2 Tid 10 20 30 40 L 2 Items A, C, D B, C, E A, B, C, E B, E C 1 1 st scan Itemset sup {A, C} 2 {B, E} 3 {C, E} 2 Itemset C 3 {B, C, E} C 2 Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 3 rd scan L 3 L 1 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 C 2 2 nd scan Itemset sup {B, C, E} 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 37

Drawbacks of Apriori o Multiple scans of transaction database n Multiple database scans are costly o Huge number of candidates n To find frequent itemset i 1 i 2…i 100 o # of scans: 100 o # of Candidates: 2100 -1 = 1. 27*1030 38

Outline o Association Rule Mining n Apriori n FP-growth o Sequential Pattern Mining 39

FP-Growth o Proposed by Han et al. [SIGMOD’ 00] o Uses the Apriori pruning principle o Scan DB only twice n Once to find frequent 1 -itemset (single item pattern) n Once to construct FP-tree (prefix tree, Trie), the data structure of FP-growth 40

FP-Growth TID 100 200 300 400 500 Items bought {f, a, c, d, g, i, m, p} {a, b, c, f, l, m, o} {b, f, h, j, o, w} {b, c, k, s, p} {a, f, c, e, l, p, m, n} Header Table Item Support f 4 c 4 a 3 b 3 m 3 p 3 TID 100 200 300 400 500 {} (ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p} f: 1 c: 1 a: 1 m: 1 p: 1 41

FP-Growth TID 100 200 300 400 500 (ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} Header Table {c, b, p} {f, c, a, m, p} Item support head f 4 c 4 a 3 b 3 m 3 p 3 {} f: 4 c: 3 c: 1 b: 1 a: 3 b: 1 p: 1 m: 2 b: 1 p: 2 m: 1 43

FP-Growth {} Header Table Item support head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases Item cond. pattern base p fcam: 2, cb: 1 freq. itemset f: 4 c: 1 c: 3 b: 1 a: 3 p: 1 m: 2 b: 1 p: 2 m: 1 fp, cp, ap, mp, fcp, fap, fmp, cap, cmp, amp, facp, fcmp, famp, fcamp m b a c fca: 2, fcab: 1 fca: 1, f: 1, c: 1 fc: 3 fm, cm, am, fcm, fam, cam, fcam … … … 44

Outline o Association Rule Mining n Apriori n FP-growth o Sequential Pattern Mining n GSP n SPADE, SPAM n Prefix. Span 45

Applications o Customer shopping sequences Query 80% Q 1 P 1 Note computer Q 2 Re-query P 2 Memory CD-ROM within 3 days 46

Some Definitions If the support of a sequence exceeds user specified min_support, this sequence is called a sequential pattern. ID Customer Sequence 10 20 30 min_support=50% is a sequential pattern is not a sequential pattern 47

Outline o Association Rule Mining n Apriori n FP-growth o Sequential Pattern Mining n GSP n SPADE, SPAM n Prefix. Span 48

GSP (Generalized Sequential Pattern Mining) o Proposed by Srikant et al. [EDBT’ 96] o Uses the Apriori pruning principle o Outline of the method n Initially, every item in DB is a candidate of length-1 n For each level (i. e. , sequences of length-k) do o Scan database to collect support count for each candidate sequence o Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori n Repeat until no frequent sequence or no candidate can be found 49

Finding Length-1 Sequential Patterns o Initial candidates: n , , , , , , , o Scan database once, count support for candidates min_sup =2 Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> Cand Sup 3 5 4 3 3 2 1 1 50

Generating Length-2 Candidates
51 length-2 Candidates <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <(cd)> <(ce)> <(cf)> <(de)> <(df)> <(ef)> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44. 57% candidates 51

The GSP Mining Process Cand. cannot pass sup. threshold Cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. <(bd)bc> … 5 th scan: 1 cand. 1 length-5 seq. <(bd)cba> pat. 3 rd scan: 46 cand. 19 length-3 … seq. pat. 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all … … <(ab)> … <(ef)> 1 st scan: 8 cand. 6 length-1 seq. pat. Seq. ID Sequence min_sup =2 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 52

Outline o Association Rule Mining n Apriori n FP-growth o Sequential Pattern Mining n GSP n SPADE, SPAM n Prefix. Span 53

SPADE Algorithm o Proposed by Zaki et al. [MLJ’ 01] o Vertical ID-list data representation o Support counting n Temporal joins o Candidate to be tested n In local candidate lists 54

SPADE Algorithm ID Data Sequence 10 < b d b c b d a b a d > 20 < d c a a b c b d a b > 30 < c a d c a > ID-List DB a b c d SID Pos 10 7 10 1 10 4 10 2 10 9 10 3 20 2 10 6 20 3 10 5 20 6 10 10 20 4 10 8 30 1 20 9 20 5 30 6 20 8 30 2 20 7 30 9 30 3 30 4 20 10 30 5 30 7 30 8 30 10 55

Temporal Joins (S-Step) {a, b} SID 10 {b} {a} Pos 8 SID Pos 20 5 10 7 10 1 20 7 10 9 10 3 20 10 20 3 10 5 20 4 10 8 20 9 20 5 30 2 20 7 SID Pos 30 4 20 10 10 7 30 7 10 9 30 10 20 9 + Sup{ab}=2 {b, a} Sup{ba}=2 min_support=2 56

Outline o Association Rule Mining n Apriori n FP-growth o Sequential Pattern Mining n GSP n SPADE, SPAM n Prefix. Span 57

SPAM Algorithm o Proposed by Ayres et al. [KDD’ 02] o Key idea based on SPADE o Using bitmap data representation o Much faster than SPADE yet space consuming 58

SPAM Algorithm ID 10 1 2 3 4 5 6 7 1 2 3 4 5 1 2 3 4 30 10 10 20 20 20 30 30 < a c ( b c ) d ( a b c) a d > 20 SID Pos Data Sequence {a} {b} {c} {d} 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 59

SPAM Temporal Joins Sequence extended step: {a} {b} {a}s {b} {a, b} 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 S-step & Sup(ab)=2 min_support=2 60

Outline o Association Rule Mining n Apriori n FP-growth o Sequential Pattern Mining n GSP n SPADE, SPAM n Prefix. Span 61

Prefix. Span (Projection-based Approach) o Proposed by Pei et al. [ICDE’ 01] o Based on pattern growth o Prefix-Postfix (Projection) representation o Basic idea: use frequent items to recursively project sequence database into smaller projected databases and grow patterns in each projected database. 62

Example SDB SID 10 20 30 min_sup=2 Sequence prefix -projected database <(_c)(cd)> First scan 1 -Length frequentl patterns , , , prefix Sup(a)=1 Frequent patterns -projected database Scan Sup(b)=2 aa … ab -proj. DB <(_c)d(abc)ad)> <(_d)> null prefix ac Sup(d)=3 prefix Sup(c)=3 ad prefix -proj. DB <(bc)d(abc)ad> <(bd)> <(cd)> -proj. DB <(abc)ad> null 63

Recommendation Approaches o Collaborative filtering o Content based strategies n n Association Rule Mining Text similarity based Clustering Classification o Hybrid approaches 64

Text Similarity based Techniques o Vector Space Model (VSM) n TF-IDF o Semantic resource based n Wordnet n Wiki n Web 65

All Information about Users and Items User Profile: Item Profile: (1) Attribute Nationality, Sex, Age, Hobby, etc Price, Weight, Color, Br and, etc (2) Text Personal description Product description (3) link (3) Link Social network Observed preferences Associated relation between items (i. e. , co-purchased by the same user) (Purchases, Ratings, page views, play lists, bookmarks, etc) 66

All Information about Users and Items User Profile: Item Profile: (1) Attribute Nationality, Sex, Age, Hobby, etc Price, Weight, Color, Br and, etc (2) Text Personal description Product description (3) link (3) Link Social network Observed preferences I like car, movie, … Associated relation between items (i. e. , co-purchased by the same user) (Purchases, Ratings, This page music views, play lists, car is nicely equipped with auto air conditioning … bookmarks, etc) 67

Profile Representation – Vector Space Model o o User Profile Structured data attributes: book, car, TV … Free text o o Item Profile Structured data attributes: name, color, price … Free text “I like car, movie, music…” User A Item B car 1 0 0 TV 0 0 bike 1 1 … … A 1 book Cosine Similarity “This car is nicely equipped with auto air conditioning…” … cos(θ ) B Weighted Cosine Similarity weight 68

TF*IDF Weighting o TF*IDF weighting o Term frequency tft, d of a term t in a document d i. e. , nt, d is how many times t is appears in d o Inverse document frequency idft of a term t i. e. , dft how many times t is appears in all documents where N is the number of all documents 69

Profile Representation n Unstructured data • e. g. , text description or review of the restaurant, or news articles p No attribute names with well-defined values p Natural language complexity n Same word with different meanings n Different words with same meaning n Need to impose structure on free text before it can be used in recommendation algorithm 70

All Information about Users and Items User Profile: Item Profile: (1) Attribute Nationality, Sex, Age, Hobby, etc Price, Weight, Color, Br and, etc (2) Text Personal description Product description (3) link (3) Link Social network Observed preferences I like automobile, movie, music … Associated relation between items (i. e. , co-purchased by the same user) (Purchases, Ratings, This page views, play lists, car is nicely equipped with auto air conditioning … bookmarks, etc) 71

Text Similarity based Techniques o Vector Space Model (VSM) n TF-IDF o Semantic resource based n Wordnet n Wiki n Web 72

Knowledge based Similarity p Knowledge data Word. Net p Intuition: Two words are similar if they are close to each other p Measure approach n Shortest path based [Rada, SMC’ 89][Wu, ACL’ 94][Leacock’ 98] n Content based [Resnik, IJCAI’ 95][Jiang, ROLING’ 97][Lin, ICML’ 98] 73

Knowledge-based word semantic similarity o (Leacock & Chodorow, 1998) o (Wu & Palmer, 1994) o (Lesk, 1986) n Finds the overlap between the dictionary entries of two words 74

Text Similarity based Techniques o Vector Space Model (VSM) n TF-IDF o Semantic resource based n Wordnet n Wiki n Web 75

Explicit Semantic Similarity (ESA) o Proposed by Gabrilovich [IJCAI’ 07] o Map text to concepts (i. e. , vector) in Wiki o Calculate ESA score by common vector based measure (i. e. , cosine) 76

ESA Process This figure is from Gabrilovich IJCAI’ 07. 77

ESA Example p Text 1: The dog caught the red ball. p Text 2: A labrador played in the park. Glossary of cue sports terms American Football Strategy Baseball Boston Red Sox T 1: 2711 402 487 528 T 2: 108 171 107 74 p Similarity Score: 14. 38% This slide is from Rada Mihalcea. 78

Text Similarity based Techniques o Vector Space Model (VSM) n TF-IDF o Semantic resource based n Wordnet n Wiki n Web 79

Corpus based similarity o Corpus data n Web (search engine) o Intuition: n Two words are similar if they frequently occur in the same page n PMI-IR [Turney, ECML’ 01] 80

PMI-IR o Pointwise Mutual Information (Church and Hanks’ 89) o PMI-IR (Turney’ 01) where N is the number of Web pages 81

Recommendation Approaches o Collaborative filtering o Content based strategies n n Association Rule Mining Text similarity based Clustering Classification o Hybrid approaches 82

All Information about Users and Items User Profile: Item Profile: (1) Attribute Nationality, Sex, Age, Hobby, etc Price, Weight, Color, Br and, etc (2) Text Personal description Product description (3) link (3) Link Social network Observed preferences (Purchases, Ratings, page views, play lists, bookmarks, etc) Associated relation between items (i. e. , co-purchased by the same user) Clustering 83

Clustering o K-means o Hierarchical Clustering 84

K-means o Introduced by Mac. Queen, J. B. (1967) o Works when we know k, the number of clusters we want to find o Idea: n Randomly pick k points as the “centroids” of the k clusters n Loop: o For each point, put the point in the cluster to whose centroid it is closest o Recompute the cluster centroids o Repeat loop (until there is no change in clusters between two consecutive iterations. ) Iterative improvement of the objective function: Sum of the squared distance from each point to the centroid of its cluster 85

K-means Example (K=2) Randomly pick seeds Reassign clusters Compute centroids Reasssign clusters x x Compute centroids Reassign clusters Converged! 86

Clustering o K-means o Hierarchical Clustering 87

Hierarchical Clustering o Two types: n Agglomerative (bottom up) n Divisive (top down) o Agglomerative: two groups are merged if distance between them is less than a threshold o Divisive: one group is split into two if intergroup distance more than a threshold o Can be expressed by an excellent graphical representation called dendrogram 88

Hierarchical Clustering dendrogram 89

Hierarchical Agglomerative Clustering o Put every point in a cluster by itself. For I=1 to N-1 do{ let C 1 and C 2 be the most mergeable pair of clusters Create C 1, 2 as parent of C 1 and C 2 } o Example: for simplicity, we use 1 -dimensional objects. n Numerical Objects: 1, 2, 5, 6, 7 o Agglomerative clustering: n n find two closest objects and merge; => {1, 2}, so we have now {1. 5, 5, 6, 7}; => {1, 2}, {5, 6}, so {1. 5, 5. 5, 7}; => {1, 2}, {{5, 6}, 7}. 1 25 6 7 90

Recommendation Approaches o Collaborative filtering o Content based strategies n n Association Rule Mining Text similarity based Clustering Classification o Hybrid approaches 91

Illustrating Classification Task

Classification o o o k-Nearest Neighbor (k. NN) Decision Tree Naïve Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 93

k-Nearest Neighbor Classification (k. NN) o k. NN does not build model from the training data. o Approach n To classify a test instance d, define k-neighborhood P as k nearest neighbors of d n Count number n of training instances in P that belong to class cj n Estimate Pr(cj|d) as n/k (majority vote) o No training is needed. Classification time is linear in training set size for each test case. o k is usually chosen empirically via a validation set or cross-validation by trying a range of k values. o Distance function is crucial, but depends on applications. 94

Example: k=1 (1 NN) Car Book Clothes which class? Book 95

Example: k=3 (3 NN) Car Book Clothes which class? Car 96

Discussion n Advantage o Nonparametric architecture o Simple o Powerful o Requires no training time n Disadvantage o Memory intensive o Classification/estimation is slow o Sensitive to k 97

Classification o o o k-Nearest Neighbor (k. NN) Decision Tree Naïve Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 98

Example of a Decision Tree o Judge the cheat possibility: Yes/No al al ric o g te ca o ca g te s ric in t on c Training Data u uo c s as l

Example of a Decision Tree o Judge the cheat possibility: Yes/No al al ric o g te ca o ca g te s ric in t on c u uo s as l c Splitting Attributes Refund Yes No NO Mar. St Single, Divorced Tax. Inc < 80 K NO Training Data Married NO > 80 K YES Model: Decision Tree

Another Example of Decision Tree o Judge the cheat possibility: Yes/No l ica or ca g te o g te ca l ica r in s ou u c t on s s la c Married Mar. St NO Single, Divorced Refund No Yes NO Tax. Inc < 80 K NO > 80 K YES There could be more than one tree that fits the same data!

Decision Tree - Construction o Creating Decision Trees n Manual - Based on expert knowledge n Automated - Based on training data (DM) o Two main issues: n Issue #1: Which attribute to take for a split? n Issue #2: When to stop splitting?

Classification o k-Nearest Neighbor (k. NN) o Decision Tree n CART n C 4. 5 o o Naïve Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 103

The CART Algorithm o Classification And Regression Trees o Developed by Breiman et al. in early 80’s. n Introduced tree-based modeling into the statistical mainstream n Rigorous approach involving cross-validation to select the optimal tree 104

Key Idea Recursive Partitioning o Take all of your data. o Consider all possible values of all variables. o Select the variable/value (X=t 1) that produces the greatest “separation” in the target. o (X=t 1) is called a “split”. o If X< t 1 then send the data to the “left”; otherwise, send data point to the “right”. o Now repeat same process on these two “nodes” o You get a “tree” o Note: CART only uses binary splits. 105

Key Idea o Let Φ(s |t ) be a measure of the “goodness” of a candidate split s at node t , where: o Then the optimal split maximizes this Φ(s |t ) measure over all possible splits at node t. 106

Key Idea o Φ(s |t ) is large when both of its main components are large: and 1. - Maximum value if child nodes are equal size (same support) ): E. g. 0. 5*0. 5 = 0. 25 and 0. 9*0. 1= 0. 09 2. Q (s |t )= n Maximum value if for each class the child nodes are completely uniform (pure) n Theoretical maximum value for Q (s|t) is k, where k is the number of classes for the target variable 107

CART Example Training Set of Records for Classifying Credit Risk 108

CART Example – Candidate Splits p CART is restricted to binary splits Candidate Split Left Child Node, t. L Right Child Node, t. R 1 2 3 4 5 6 7 8 9 Savings = low Savings = medium Savings = high Assets = low Assets = medium Assets = high Income <=$25, 000 Income <=$50, 000 Income <=$75, 000 Savings={medium, high} Savings={low, medium} Assets={medium, high} Assets={low, medium} Income > $25, 000 Income > $50, 000 Income > $75, 000 Candidate Splits for t = Root Node 109

CART Primer o Split 1. -> Savings=low (L-true, R-false) n Right: 1, 3, 4, 6, 8 n Left: 2, 5, 7 o PR=5/8 = 0. 625 PL=3/8=0. 375 -> 2*PLPR=15/64=0. 46875 o P(j=Bad | t) n P(Bad | t. R)= 1/5 = 0. 2 n P(Bad | t. L)= 2/3 = 0. 67 o P(j=Good | t) n P(Good | t. R)= 4/5 = 0. 8 n P(Good | t. L)= 1/3 = 0. 33 o Q(s|t)= |0. 67 -0. 2|+|0. 8 -0. 33| = 0. 934 110

CART Example Split PL PR P(j|t. L) P(j|t. R) 2 PLPR Q(s|t) Φ(s |t ) 1 0. 375 0. 625 0. 934 0. 4378 0. 375 0. 625 0. 46875 1. 2 0. 5625 3 0. 25 0. 75 0. 334 0. 1253 4 0. 25 0. 75 0. 375 1. 667 0. 6248 5 0. 25 6 0. 25 0. 75 0. 375 1 0. 375 7 0. 375 0. 625 0. 46875 0. 934 0. 4378 8 0. 625 0. 375 0. 46875 1. 2 0. 5625 9 0. 875 0. 125 G: 0. 8 B: 0. 2 G: 0. 4 B: 0. 6 G: 0. 667 B: 0. 333 G: 0. 833 B: 0. 167 G: 0. 5 B: 0. 5 G: 0. 8 B: 0. 2 G: 1 B: 0 0. 46875 2 G: 0. 333 B: 0. 667 G: 1 B: 0 G: 0. 5 B: 0. 5 G: 0 B: 1 G: 0. 75 B: 0. 25 G: 1 B: 0 G: 0. 333 B: 0. 667 G: 0. 4 B: 0. 6 G: 0. 571 B: 0. 429 0. 21875 0. 858 0. 1877 p For each candidate split, examine the values of the various components of the measure Φ(s|t) 111

CART Example Root Node (All Records) Assets = Low vs. Assets ∈ {Medium, High} Assets=Low Bad Risk (Records 2, 7) Assets∈ {Medium, High} Decision Node A (Records 1, 3, 4, 5, 6, 8) CART decision tree after initial split 112

CART Example Split PL PR P(j|t. L) P(j|t. R) 2 PLPR Q(s|t) Φ(s |t ) 1 0. 167 0. 833 0. 4 0. 1112 0. 5 0. 6666 0. 3333 3 0. 333 0. 667 0. 4444 1 0. 4444 5 0. 667 0. 333 0. 4444 0. 5 0. 2222 6 0. 333 0. 667 0. 4444 0. 5 0. 2222 7 0. 333 0. 667 0. 4444 1 0. 4444 8 0. 5 0. 6666 0. 3333 9 0. 167 0. 833 G: 0. 8 B: 0. 2 G: 0. 667 B: 0. 333 G: 1 B: 0 G: 0. 75 B: 0. 25 G: 1 B: 0 0. 2782 2 G: 1 B: 0 G: 0. 5 B: 0. 5 G: 0. 75 B: 0. 25 G: 1 B: 0 G: 0. 5 B: 0. 5 G: 0. 667 B: 0. 333 G: 0. 8 B: 0. 2782 0. 4 0. 1112 Values of Components of Measure Φ(s|t) for Each Candidate Split on Decision Node A 113

CART Example Root Node (All Records) Assets = Low vs. Assets ∈ {Medium, High} Assets=Low Bad Risk (Records 2, 7) Decision Node A (Records 1, 3, 4, 5, 6, 8) Savings=High Decision Node B (Records 3, 6) Savings∈ {Low, Medium} Good Risk (Records 1, 4, 5, 8) CART decision tree after decision node A split 114

CART Example Root Node (All Records) Assets = Low vs. Assets ∈ {Medium, High} Assets=Low Decision Node A (Records 1, 3, 4, 5, 6, 8) Bad Risk (Records 2, 7) Savings∈ {Low, Medium} Savings=High Decision Node B (Records 3, 6) Assets=High Good Risk (Records 6) Good Risk (Records 1, 4, 5, 8) Assets=Medium Bad Risk (Records 3) CART decision tree, fully grown form 115

Classification o k-Nearest Neighbor (k. NN) o Decision Tree n CART n C 4. 5 o o Naïve Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 116

The C 4. 5 Algorithm o Proposed by Quinlan in 1993 o An internal node represents a test on an attribute. o A branch represents an outcome of the test, e. g. , Color=red. o A leaf node represents a class label or class label distribution. o At each node, one attribute is chosen to split training examples into distinct classes as much as possible o A new case is classified by following a matching path to a leaf node. 117

The C 4. 5 Algorithm o Differences between CART and C 4. 5: n Unlike CART, the C 4. 5 algorithm is not restricted to binary splits. o It produces a separate branch for each value of the categorical attribute. n C 4. 5 method for measuring node homogeneity is different from the CART. 118

The C 4. 5 Algorithm - Measure o We have a candidate split S, which partitions the training data set T into several subsets, T 1, T 2, . . . , Tk. o C 4. 5 uses the concept of entropy reduction to select the optimal split. o entropy_reduction(S) = H(T)-HS(T), where entropy H(X) is: Where Pi represents the proportion of records in subset i. o The weighted sum of the entropies for the individual subsets T 1, T 2, . . . , T k o C 4. 5 chooses the optimal split - the split with greatest entropy reduction 119

Classification o o o k-Nearest Neighbor (k. NN) Decision Tree Naïve Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 120

Bayes Rule o Recommender system question n Li is the class for item i (i. e. , that the user likes item i) n A is the set of features associated with item i o Estimate p(Li|A) o p(Li |A) = p(A| Li) p(Li) / p(A) o We can always restate a conditional probability in terms of n The reverse condition p(A| Li) n Two prior probabilities o p(Li) o p(A) o Often the reverse condition is easier to know n We can count how often a feature appears in items the user liked n Frequentist assumption 121

Naive Bayes o Independence (Naïve Bayes assumption) n the features a 1, a 2, . . . , ak are independent o For joint probability o For conditional probability o Bayes' Rule 122

An Example Compute all probabilities required for classification 123

An Example o For C = t, we have o For class C = f, we have o C = t is more probable. t is the final class. 124

Naïve Bayesian Classifier o Advantages: n Easy to implement n Very efficient n Good results obtained in many applications o Disadvantages n Assumption: class conditional independence, therefore loss of accuracy when the assumption is seriously violated (those highly correlated data sets) 125

Classification o o o K-Nearest Neighbor (k. NN) Decision Tree Naïve Bayesian Artificial Neural Network Support Vector Machine Ensemble methods 126

References for Machine Learning o T. Mitchell, Machine Learning, Mc. Graw Hill, 1997 o C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 o T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Springer, 2001. o V. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998. o Y. Kodratoff, R. S. Michalski, Machine Learning: An Artificial Intelligence Approach, Volume III, Morgan Kaufmann, 1990 127

Recommendation Approaches o Collaborative filtering n Nearest neighbor based n Model based o Content based strategies n n Association Rule Mining Text similarity based Clustering Classification o Hybrid approaches 128

The Netflix Prize Slides here are from Yehuda Koren.

Netflix o Movie rentals by DVD (mail) and online (streaming) o 100 k movies, 10 million customers o Ships 1. 9 million disks to customers each day n 50 warehouses in the US n Complex logistics problem o Employees: 2000 n But relatively few in engineering/software n And only a few people working on recommender systems o Moving towards online delivery of content o Significant interaction of customers with Web site 130

The $1 Million Question 131

Million Dollars Awarded Sept 21 st 2009 132

133

Lessons Learned o Scale is important n e. g. , stochastic gradient descent on sparse matrices o Latent factor models work well on this problem n Previously had not been explored for recommender systems o Understanding your data is important, e. g. , timeeffects o Combining models works surprisingly well n But final 10% improvement can probably be achieved by judiciously combining about 10 models rather than 1000’s n This is likely what Netflix will do in practice 134

Useful References o Y. Koren, Collaborative filtering with temporal dynamics, ACM SIGKDD Conference 2009 o Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer, 2009 o Y. Koren, Factor in the neighbors: scalable and accurate collaborative filtering, ACM Transactions on Knowledge Discovery in Data, 2010 135

Thank you! 136