I L M I K N I N Jerry Scripps N G
Overview n n n What is link mining? Motivation Preliminaries n n definitions metrics network types Link mining techniques
What is Link Mining? Graph Theory Statistics Link Mining Data Mining Machine Learning Social Network Analysis Database
What is Link Mining? Examples: n Discovering communities within collaboration networks n Finding authoritative web pages on a given topic n Selecting the most influential people in a social network
Link Mining – Motivation Emerging Data Sets n n World wide web Social networking Collaboration databases etc.
Link Mining – Motivation Direct Applications n n n What is the community around msu. edu? What are the authoritative pages? Who has the most influence? Who is the likely member of terrorist cell? Is this a news story about crime, politics or business?
Link Mining – Motivation Indirect Applications n n Convert ordinary data sets into networks Integrate link mining techniques into other techniques
Preliminaries n n n Definitions Metrics Network Types
Definitions Community Node (vertex, point, object) Link (edge, arc)
Metrics Node n n Degree Closeness Betweenness Clustering coefficient Node Pair n n n Graph distance Min-cut Common neighbors Jaccard’s coef Adamic/adar Pref. attachment Katz Hitting time Rooted page. Rank sim. Rank Bibliographic metrics Network n n n Characteristic path length Clustering coefficient Min-cut
Network Types Watts & Strogatz Regular Small World Random
Networks – Scale-free n n n Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-world networks
Network recap Network Type Clustering coefficient Random Low Characteristic Power Law path length Low No Regular High No Small world High Low ? Scale-free ? ? Yes
Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion
Link-Based Classification ? Include features from linked objects: n building a single model on all features n Fusion of link and attribute models
Link-Based Classification Chakrabarti, et al. n n Copying data from neighboring web pages actually reduced accuracy Using the label from neighboring page improved accuracy 111011 ? 101011 B 010010 A 011110 A
Link-Based Classification Lu & Getoor n Define vectors for attributes and links n n Attribute data OA(X) Link data LD(X) constructed using n n n mode (single feature – class of plurality) count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists) 111011 ? 101011 B 010010 A 011110 A OA (attr) LD (link) 111011 2 110 A … … Model 1 Model 2
Link-Based Classification Lu & Getoor n Define probabilities for both n Attribute n n Link Class estimation:
Link-Based Classification Summary n n n Using class of neighbors improves accuracy Using separate models for attribute and link data further improves accuracy Other considerations: n n improvements are possible by using community information knowledge of network type could also benefit classifier
Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion
Link Prediction
Link Prediction Liben-Nowell and Kleinberg Tested node-pair metrics: n Graph distance n Common neighbors n Jaccards coefficient n Adamic/adar n Preferential attachment n Katz n Hitting time n Rooted Page. Rank n Sim. Rank Neighborhood Ensemble of paths
Link Prediction - results
Link Prediction – summary n n There is room for growth – best predictor has accuracy of only around 9% Predicting collaborations is difficult Finding communities could help if most collaborations are intra-community New problem could be to predict the direction of the link
Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion
Ranking
Ranking – Markov Chain Based n n n Random-surfer analogy Problem with cycles Page. Rank uses random vector
Ranking – summary n n Other methods such as HITS and SALSA also based on Markov chain Ranking has been applied in other areas: text summarization n anomaly detection n
Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion
Influence
Maximizing influence model-based n n Problem – finding the k best nodes to activate to maximize the number of nodes activated Models: n n independent cascade – when activated a node has a one-time change to activate neighbors with prob. pij linear threshold – node becomes activated when the percent of its neighbors crosses a threshold
Maximizing influence model-based n n Models: independent cascade & linear threshold A function f: S→S*, can be created using either model Functions use monte-carlo, hill-climbing solution Submodular functions, where S T are proven in another work to be NP -C but by using a hill-climbing solution can get to within 1 -1/e of optimum.
Maximizing influence – cost/benefit n Assumptions: n n n If customer purchases profit is: n n n product x sells for $100 a discount of 10% can be offered to various prospective customers 90 if discount is offered 100 if discount is not offered Expected lift in profit (ELP) from offering discount is: n 90*P(buy|discount) - 100*P(buy|no discount)
Maximizing influence – cost/benefit n n Goal is to find M that maximizes global ELP Three approximations used: n n n single pass greedy hill-climbing n n n Xi is the decision of customer i to buy Y is vector of product attributes M is vector of marketing decision f is a function to set the ith element of M r 0 and r 1 are revenue gained c is the cost of marketing
Comparison of approaches Size of starting set uses attributes probabilities n n Cost/benefit variable - based on max. lift yes extracted from data set Model-based fixed no assigned to links An extension would be to spread influence to the most number of communities Improvements can be made in speed
Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion
Communities
Gibson, Kleinberg and Raghavan Query Search Engine Root Set Base Set: add forward and back links Use HITS to find top 10 hubs and authorities
Reddy and Kitsuregawa n n n Bipartite graph Given an initial set of nodes T build I from the nodes pointed to from T Repeat: n n use relax_cocite to expand T and I prune T and I using dense bipartite graph function DBPG(T, I, α, β) T u v w I
Flake, Lawrence and Giles n n n n Uses Min-cut Start with seed set Add linked nodes Find nodes from outgoing links Create virtual source node Add virtual sink linking it to all nodes Find the min-cut of the virtual source and sink
Neville, Adler and Jensen A n n 0110 Distance based on links and attributes If link exists score is number of B common attributes zero otherwise 1100 score(A, B)=2, score(A, C)=1, score(B, C)=0 Used with 3 partitioning algorithms: n Karger’s Min-Cut n Major. Clust n Spectral partitioning by Shi & Malik C 1101
Communities - summary n n There are many options for building communities around a small group of nodes Possible future directions finding communities in networks having different link types n impact of network type on community finding techniques n
Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion
Link Completion
Goldenberg, Kubica and Komarek Problem: given a network and n-1 members of a community find the nth n n n n random counting popular NB NN c. Graph Bayes. Net EBS and LR
Conclusions n n Link mining is a young, dynamic field of study with problem areas that continue to emerge and morph as techniques continue to evolve Opportunities for improvements exist in n n using community knowledge using network knowledge We are the living links in a life force that moves and plays around and through us, binding the deepest soils with the farthest stars. Alan Chadwick
Ranking n n n Based on Markov Chain Rank is sum of node weights from incoming links Breaks down when cycles exist 3 2 4 5 9 6 14 15 9
Ranking - continued n General approach n n Page. Rank HITS approach n n ap = authority score for p Bp = backlinks of p ap = authority score for p hp = hub score for p Bp = backlinks of p Normalize between iterations