Скачать презентацию Enhanced topic distillation using text markup tags and Скачать презентацию Enhanced topic distillation using text markup tags and

22712c06ce6fd06dfbbd8aa7f21443d4.ppt

  • Количество слайдов: 17

Enhanced topic distillation using text, markup tags, and hyperlinks Soumen Chakrabarti Mukul Joshi Vivek Enhanced topic distillation using text, markup tags, and hyperlinks Soumen Chakrabarti Mukul Joshi Vivek Tawde www. cse. iitb. ac. in/~soumen

Topic distillation § Given a query or some example URLs § Collect a relevant Topic distillation § Given a query or some example URLs § Collect a relevant subgraph (community) of the Web § Bipartite reinforcement between hubs and authorities § Prototypes: Keyword query Search engine Root set Expanded set • HITS and Clever • Bharat and Henzinger IIT Bombay 2

Challenges and limitations § Web authoring style in flux since 1996 • Complex pages Challenges and limitations § Web authoring style in flux since 1996 • Complex pages generated from templates • File or page boundary less meaningful • “Clique attacks”—rampant multi-host ‘nepotism’ via rings, ads, banner exchanges § Models are too simplistic • Hub and authority symmetry is illusory • Coarse-grain hub model ‘leaks’ authority • Ad-hoc linear segmentation not content-aware § Deteriorating results of topic distillation IIT Bombay 3

Clique attacks! Irrelevant links form pseudocommunity Relevant regions that lead to inclusion of page Clique attacks! Irrelevant links form pseudocommunity Relevant regions that lead to inclusion of page in base set IIT Bombay 4

Benign drift and generalization Remaining sections generalize and/or drift IIT Bombay This section specializes Benign drift and generalization Remaining sections generalize and/or drift IIT Bombay This section specializes on ‘Shakespeare’ 5

A new fine-grained model html <html>…<body>… <table …> <tr><td><a href=“http: //art. qaz. com”>art</a></td></tr> body A new fine-grained model html ……

body … head
art
ski
table tr td tr td … a a … ski. qaz. com art. qaz. com IIT Bombay Document Object Model (DOM) Frontier of differentiation … Relevant subtree li Irrelevant subtree Toncheese. co. uk www. fromages. com 6

Generative model for hub text § Global hub text distribution 0 relevant to given Generative model for hub text § Global hub text distribution 0 relevant to given query § Authors use internal Model frontier DOM nodes to specialize 0 into I § At a certain frontier in the DOM tree, local distribution directly generates text in ‘hot’ and ‘cold’ subtrees IIT Bombay Global term distribution 0 Progressive ‘distortion’ I Other pages 7

A balanced cost measure Reference distribution 0 Cumulative distortion cost = KL( 0; u) A balanced cost measure Reference distribution 0 Cumulative distortion cost = KL( 0; u) + … + KL( u; v) (for exponential distribution) Goal: Find minimum cost frontier u v Dv Data encoding cost is roughly IIT Bombay 8

Marking ‘hot’ subtrees § § Hard to solve exactly (knapsack) (1+ ) dynamic programming Marking ‘hot’ subtrees § § Hard to solve exactly (knapsack) (1+ ) dynamic programming solution Too slow for 10 million DOM nodes Greedy expansion approach: at each node v, compare the cost of • Directly encoding Dv w. r. t. model v at v • First distorting v to w for each child w of v, then encoding all Dw w. r. t. respective w § If latter is smaller expand v, else prune § Mark relevance subtrees as “must-prune” IIT Bombay 9

Exploiting co-citation in our model 1 2 0. 06 0. 05 0. 13 4 Exploiting co-citation in our model 1 2 0. 06 0. 05 0. 13 4 0. 12 0. 13 0. 12 Aggregate hub scores are copied back to leaves 0. 10 0. 20 0. 12 0. 13 ‘Known’ authorities Must-prune nodes are marked 3 0. 10 0. 20 Have reason to believe these could be good too 0. 10 0. 20 0. 01 Initial values of leaf hub scores = target auth scores Frontier microhubs accumulate scores Non-linear transform, unlike HITS IIT Bombay 10

Complete algorithm § Collect root set and base set § Pre-segment using text and Complete algorithm § Collect root set and base set § Pre-segment using text and mark relevant micro-hubs to be pruned § Assign only root set authority scores to 1 s § Iterate • • Transfer from authority to hub leaves Re-segment hub DOM trees using link + text Smooth and redistribute hub scores Transfer from hub leaves to authority roots § Report top authority and ‘hot’ microhubs IIT Bombay 11

Experimental setup § Large data sets • 28 queries from Clever, >20 topics from Experimental setup § Large data sets • 28 queries from Clever, >20 topics from Dmoz • Collect 2000… 10000 pages per query/topic • Several million DOM nodes and fine links § Find top authorities using various algos § For ad-hoc query, measure cosine similarity of authorities with root-set centroid in vector space § For Dmoz, use an automatic classifier… IIT Bombay 12

Avoiding topic drift via micro-hubs Query: cycling No danger of topic drift IIT Bombay Avoiding topic drift via micro-hubs Query: cycling No danger of topic drift IIT Bombay Query: affirmative action Topic drift from software sites 13

Results for the Clever benchmark § Take top 40 auths § Find average cosine Results for the Clever benchmark § Take top 40 auths § Find average cosine similarity to root set centroid § HITS < DOM+Text < DOM similarity § DOM alone cannot prune well enough: most top auths from root set § HITS drifts often IIT Bombay 14

Dmoz experiments and results § 223 topics from http: //dmoz. org § Sample root Dmoz experiments and results § 223 topics from http: //dmoz. org § Sample root set URLs from a class c § Top authorities not in root set submitted to Rainbow classifier § d Pr(c |d) is the expected number of relevant documents § DOM+Text best IIT Bombay Rainbow classifier DMoz Sample Test Train Music Root set Expanded set Top authority 15

Anecdotes § “amusement parks”: http: //www. 411 fun. com/THEMEPARKS leaks authority via nepotistic links Anecdotes § “amusement parks”: http: //www. 411 fun. com/THEMEPARKS leaks authority via nepotistic links to www. 411 florists. com, www. 411 fashion. com, www. 411 eshopping. com, etc. § New algorithm reduces drift § Mixed hubs accurately segmented, e. g. amusement parks, classical guitar, Shakespeare and sushi § Mixed hubs in top 50 for 13/28 queries IIT Bombay 16

Conclusion and ongoing work § Hypertext shows complex idioms, missed by coarse-grained graph model Conclusion and ongoing work § Hypertext shows complex idioms, missed by coarse-grained graph model § Enhanced fine-grained distillation • Identifies content-bearing ‘hot’ micro-hubs • Disaggregates hub scores • Reduces topic drift via mixed hubs and pseudo-communities § Application: topic-based focused crawling § Need probabilistic combination of evidence from text and links IIT Bombay 17