22712c06ce6fd06dfbbd8aa7f21443d4.ppt
- Количество слайдов: 17
Enhanced topic distillation using text, markup tags, and hyperlinks Soumen Chakrabarti Mukul Joshi Vivek Tawde www. cse. iitb. ac. in/~soumen
Topic distillation § Given a query or some example URLs § Collect a relevant subgraph (community) of the Web § Bipartite reinforcement between hubs and authorities § Prototypes: Keyword query Search engine Root set Expanded set • HITS and Clever • Bharat and Henzinger IIT Bombay 2
Challenges and limitations § Web authoring style in flux since 1996 • Complex pages generated from templates • File or page boundary less meaningful • “Clique attacks”—rampant multi-host ‘nepotism’ via rings, ads, banner exchanges § Models are too simplistic • Hub and authority symmetry is illusory • Coarse-grain hub model ‘leaks’ authority • Ad-hoc linear segmentation not content-aware § Deteriorating results of topic distillation IIT Bombay 3
Clique attacks! Irrelevant links form pseudocommunity Relevant regions that lead to inclusion of page in base set IIT Bombay 4
Benign drift and generalization Remaining sections generalize and/or drift IIT Bombay This section specializes on ‘Shakespeare’ 5
A new fine-grained model html …
| art |
| ski |
- Fromages. com table ul French cheese…
- Teddington… tr tr tr Buy online… … … li li td td td
Generative model for hub text § Global hub text distribution 0 relevant to given query § Authors use internal Model frontier DOM nodes to specialize 0 into I § At a certain frontier in the DOM tree, local distribution directly generates text in ‘hot’ and ‘cold’ subtrees IIT Bombay Global term distribution 0 Progressive ‘distortion’ I Other pages 7
A balanced cost measure Reference distribution 0 Cumulative distortion cost = KL( 0; u) + … + KL( u; v) (for exponential distribution) Goal: Find minimum cost frontier u v Dv Data encoding cost is roughly IIT Bombay 8
Marking ‘hot’ subtrees § § Hard to solve exactly (knapsack) (1+ ) dynamic programming solution Too slow for 10 million DOM nodes Greedy expansion approach: at each node v, compare the cost of • Directly encoding Dv w. r. t. model v at v • First distorting v to w for each child w of v, then encoding all Dw w. r. t. respective w § If latter is smaller expand v, else prune § Mark relevance subtrees as “must-prune” IIT Bombay 9
Exploiting co-citation in our model 1 2 0. 06 0. 05 0. 13 4 0. 12 0. 13 0. 12 Aggregate hub scores are copied back to leaves 0. 10 0. 20 0. 12 0. 13 ‘Known’ authorities Must-prune nodes are marked 3 0. 10 0. 20 Have reason to believe these could be good too 0. 10 0. 20 0. 01 Initial values of leaf hub scores = target auth scores Frontier microhubs accumulate scores Non-linear transform, unlike HITS IIT Bombay 10
Complete algorithm § Collect root set and base set § Pre-segment using text and mark relevant micro-hubs to be pruned § Assign only root set authority scores to 1 s § Iterate • • Transfer from authority to hub leaves Re-segment hub DOM trees using link + text Smooth and redistribute hub scores Transfer from hub leaves to authority roots § Report top authority and ‘hot’ microhubs IIT Bombay 11
Experimental setup § Large data sets • 28 queries from Clever, >20 topics from Dmoz • Collect 2000… 10000 pages per query/topic • Several million DOM nodes and fine links § Find top authorities using various algos § For ad-hoc query, measure cosine similarity of authorities with root-set centroid in vector space § For Dmoz, use an automatic classifier… IIT Bombay 12
Avoiding topic drift via micro-hubs Query: cycling No danger of topic drift IIT Bombay Query: affirmative action Topic drift from software sites 13
Results for the Clever benchmark § Take top 40 auths § Find average cosine similarity to root set centroid § HITS < DOM+Text < DOM similarity § DOM alone cannot prune well enough: most top auths from root set § HITS drifts often IIT Bombay 14
Dmoz experiments and results § 223 topics from http: //dmoz. org § Sample root set URLs from a class c § Top authorities not in root set submitted to Rainbow classifier § d Pr(c |d) is the expected number of relevant documents § DOM+Text best IIT Bombay Rainbow classifier DMoz Sample Test Train Music Root set Expanded set Top authority 15
Anecdotes § “amusement parks”: http: //www. 411 fun. com/THEMEPARKS leaks authority via nepotistic links to www. 411 florists. com, www. 411 fashion. com, www. 411 eshopping. com, etc. § New algorithm reduces drift § Mixed hubs accurately segmented, e. g. amusement parks, classical guitar, Shakespeare and sushi § Mixed hubs in top 50 for 13/28 queries IIT Bombay 16
Conclusion and ongoing work § Hypertext shows complex idioms, missed by coarse-grained graph model § Enhanced fine-grained distillation • Identifies content-bearing ‘hot’ micro-hubs • Disaggregates hub scores • Reduces topic drift via mixed hubs and pseudo-communities § Application: topic-based focused crawling § Need probabilistic combination of evidence from text and links IIT Bombay 17


