Скачать презентацию I L M I K N I N Скачать презентацию I L M I K N I N

ce53eafa9acfdea76c37ea50a890734f.ppt

  • Количество слайдов: 48

I L M I K N I N Jerry Scripps N G I L M I K N I N Jerry Scripps N G

Overview n n n What is link mining? Motivation Preliminaries n n definitions metrics Overview n n n What is link mining? Motivation Preliminaries n n definitions metrics network types Link mining techniques

What is Link Mining? Graph Theory Statistics Link Mining Data Mining Machine Learning Social What is Link Mining? Graph Theory Statistics Link Mining Data Mining Machine Learning Social Network Analysis Database

What is Link Mining? Examples: n Discovering communities within collaboration networks n Finding authoritative What is Link Mining? Examples: n Discovering communities within collaboration networks n Finding authoritative web pages on a given topic n Selecting the most influential people in a social network

Link Mining – Motivation Emerging Data Sets n n World wide web Social networking Link Mining – Motivation Emerging Data Sets n n World wide web Social networking Collaboration databases etc.

Link Mining – Motivation Direct Applications n n n What is the community around Link Mining – Motivation Direct Applications n n n What is the community around msu. edu? What are the authoritative pages? Who has the most influence? Who is the likely member of terrorist cell? Is this a news story about crime, politics or business?

Link Mining – Motivation Indirect Applications n n Convert ordinary data sets into networks Link Mining – Motivation Indirect Applications n n Convert ordinary data sets into networks Integrate link mining techniques into other techniques

Preliminaries n n n Definitions Metrics Network Types Preliminaries n n n Definitions Metrics Network Types

Definitions Community Node (vertex, point, object) Link (edge, arc) Definitions Community Node (vertex, point, object) Link (edge, arc)

Metrics Node n n Degree Closeness Betweenness Clustering coefficient Node Pair n n n Metrics Node n n Degree Closeness Betweenness Clustering coefficient Node Pair n n n Graph distance Min-cut Common neighbors Jaccard’s coef Adamic/adar Pref. attachment Katz Hitting time Rooted page. Rank sim. Rank Bibliographic metrics Network n n n Characteristic path length Clustering coefficient Min-cut

Network Types Watts & Strogatz Regular Small World Random Network Types Watts & Strogatz Regular Small World Random

Networks – Scale-free n n n Barabasi & Bonabeau Degree follows a power law Networks – Scale-free n n n Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-world networks

Network recap Network Type Clustering coefficient Random Low Characteristic Power Law path length Low Network recap Network Type Clustering coefficient Random Low Characteristic Power Law path length Low No Regular High No Small world High Low ? Scale-free ? ? Yes

Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion

Link-Based Classification ? Include features from linked objects: n building a single model on Link-Based Classification ? Include features from linked objects: n building a single model on all features n Fusion of link and attribute models

Link-Based Classification Chakrabarti, et al. n n Copying data from neighboring web pages actually Link-Based Classification Chakrabarti, et al. n n Copying data from neighboring web pages actually reduced accuracy Using the label from neighboring page improved accuracy 111011 ? 101011 B 010010 A 011110 A

Link-Based Classification Lu & Getoor n Define vectors for attributes and links n n Link-Based Classification Lu & Getoor n Define vectors for attributes and links n n Attribute data OA(X) Link data LD(X) constructed using n n n mode (single feature – class of plurality) count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists) 111011 ? 101011 B 010010 A 011110 A OA (attr) LD (link) 111011 2 110 A … … Model 1 Model 2

Link-Based Classification Lu & Getoor n Define probabilities for both n Attribute n n Link-Based Classification Lu & Getoor n Define probabilities for both n Attribute n n Link Class estimation:

Link-Based Classification Summary n n n Using class of neighbors improves accuracy Using separate Link-Based Classification Summary n n n Using class of neighbors improves accuracy Using separate models for attribute and link data further improves accuracy Other considerations: n n improvements are possible by using community information knowledge of network type could also benefit classifier

Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion

Link Prediction Link Prediction

Link Prediction Liben-Nowell and Kleinberg Tested node-pair metrics: n Graph distance n Common neighbors Link Prediction Liben-Nowell and Kleinberg Tested node-pair metrics: n Graph distance n Common neighbors n Jaccards coefficient n Adamic/adar n Preferential attachment n Katz n Hitting time n Rooted Page. Rank n Sim. Rank Neighborhood Ensemble of paths

Link Prediction - results Link Prediction - results

Link Prediction – summary n n There is room for growth – best predictor Link Prediction – summary n n There is room for growth – best predictor has accuracy of only around 9% Predicting collaborations is difficult Finding communities could help if most collaborations are intra-community New problem could be to predict the direction of the link

Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion

Ranking Ranking

Ranking – Markov Chain Based n n n Random-surfer analogy Problem with cycles Page. Ranking – Markov Chain Based n n n Random-surfer analogy Problem with cycles Page. Rank uses random vector

Ranking – summary n n Other methods such as HITS and SALSA also based Ranking – summary n n Other methods such as HITS and SALSA also based on Markov chain Ranking has been applied in other areas: text summarization n anomaly detection n

Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion

Influence Influence

Maximizing influence model-based n n Problem – finding the k best nodes to activate Maximizing influence model-based n n Problem – finding the k best nodes to activate to maximize the number of nodes activated Models: n n independent cascade – when activated a node has a one-time change to activate neighbors with prob. pij linear threshold – node becomes activated when the percent of its neighbors crosses a threshold

Maximizing influence model-based n n Models: independent cascade & linear threshold A function f: Maximizing influence model-based n n Models: independent cascade & linear threshold A function f: S→S*, can be created using either model Functions use monte-carlo, hill-climbing solution Submodular functions, where S T are proven in another work to be NP -C but by using a hill-climbing solution can get to within 1 -1/e of optimum.

Maximizing influence – cost/benefit n Assumptions: n n n If customer purchases profit is: Maximizing influence – cost/benefit n Assumptions: n n n If customer purchases profit is: n n n product x sells for $100 a discount of 10% can be offered to various prospective customers 90 if discount is offered 100 if discount is not offered Expected lift in profit (ELP) from offering discount is: n 90*P(buy|discount) - 100*P(buy|no discount)

Maximizing influence – cost/benefit n n Goal is to find M that maximizes global Maximizing influence – cost/benefit n n Goal is to find M that maximizes global ELP Three approximations used: n n n single pass greedy hill-climbing n n n Xi is the decision of customer i to buy Y is vector of product attributes M is vector of marketing decision f is a function to set the ith element of M r 0 and r 1 are revenue gained c is the cost of marketing

Comparison of approaches Size of starting set uses attributes probabilities n n Cost/benefit variable Comparison of approaches Size of starting set uses attributes probabilities n n Cost/benefit variable - based on max. lift yes extracted from data set Model-based fixed no assigned to links An extension would be to spread influence to the most number of communities Improvements can be made in speed

Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion

Communities Communities

Gibson, Kleinberg and Raghavan Query Search Engine Root Set Base Set: add forward and Gibson, Kleinberg and Raghavan Query Search Engine Root Set Base Set: add forward and back links Use HITS to find top 10 hubs and authorities

Reddy and Kitsuregawa n n n Bipartite graph Given an initial set of nodes Reddy and Kitsuregawa n n n Bipartite graph Given an initial set of nodes T build I from the nodes pointed to from T Repeat: n n use relax_cocite to expand T and I prune T and I using dense bipartite graph function DBPG(T, I, α, β) T u v w I

Flake, Lawrence and Giles n n n n Uses Min-cut Start with seed set Flake, Lawrence and Giles n n n n Uses Min-cut Start with seed set Add linked nodes Find nodes from outgoing links Create virtual source node Add virtual sink linking it to all nodes Find the min-cut of the virtual source and sink

Neville, Adler and Jensen A n n 0110 Distance based on links and attributes Neville, Adler and Jensen A n n 0110 Distance based on links and attributes If link exists score is number of B common attributes zero otherwise 1100 score(A, B)=2, score(A, C)=1, score(B, C)=0 Used with 3 partitioning algorithms: n Karger’s Min-Cut n Major. Clust n Spectral partitioning by Shi & Malik C 1101

Communities - summary n n There are many options for building communities around a Communities - summary n n There are many options for building communities around a small group of nodes Possible future directions finding communities in networks having different link types n impact of network type on community finding techniques n

Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Techniques n n n Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion

Link Completion Link Completion

Goldenberg, Kubica and Komarek Problem: given a network and n-1 members of a community Goldenberg, Kubica and Komarek Problem: given a network and n-1 members of a community find the nth n n n n random counting popular NB NN c. Graph Bayes. Net EBS and LR

Conclusions n n Link mining is a young, dynamic field of study with problem Conclusions n n Link mining is a young, dynamic field of study with problem areas that continue to emerge and morph as techniques continue to evolve Opportunities for improvements exist in n n using community knowledge using network knowledge We are the living links in a life force that moves and plays around and through us, binding the deepest soils with the farthest stars. Alan Chadwick

Ranking n n n Based on Markov Chain Rank is sum of node weights Ranking n n n Based on Markov Chain Rank is sum of node weights from incoming links Breaks down when cycles exist 3 2 4 5 9 6 14 15 9

Ranking - continued n General approach n n Page. Rank HITS approach n n Ranking - continued n General approach n n Page. Rank HITS approach n n ap = authority score for p Bp = backlinks of p ap = authority score for p hp = hub score for p Bp = backlinks of p Normalize between iterations