1d9864ebb017a25d9fc2ea56d5513939.ppt
- Количество слайдов: 16
Web Spam Detection with Anti. Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University
The World Wide Web • Huge • Distributed content creation, linking (no coordination) • Structured databases, unstructured text, semistructured data. • Content includes truth, lies, obsolete information, contradictions, …
Page. Rank • Intuition: “a page is important if important pages link to it. ” • In high-falutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web. (A few fixups needed. )
Page. Rank • Web graph encoded by matrix M – – NXN matrix (N = number of web pages) Mij = 1/|O(j)| iff there is a link from j to i Mij = 0 otherwise O(j) = set of pages node i links to • Define matrix A as follows – Aij = βMij + (1 -β)/N, where 0<β<1 – 1 -β is the “tax” discussed in prior lecture • Page rank r is first eigenvector of A – Ar = r
Many Random Walkers Model • Imagine a large number M of independent, identical random walkers (MÀN) • At any point in time, let M(p) be the number of random walkers at page p • The page rank of p is the fraction of random walkers that are expected to be at page p i. e. , E[M(p)]/M.
Economic Considerations • Search has become the default gateway to the web • Very high premium to appear on the first page of search results – e. g. , e-commerce sites – advertising-driven sites
What is Web Spam? • Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value • Spam = web pages that are the result of spamming • This is a very broad defintion – SEO industry might disagree! – SEO = search engine optimization • Approximately 10 -15% of web pages are spam
Types of Spamming Techniques • Term spamming – Manipulating the text of web pages in order to appear relevant to queries • Link spamming – Creating link structures that boost page rank or hubs and authorities scores
Link Spam • Three kinds of web pages from a spammer’s point of view – Inaccessible pages – Accessible pages • e. g. , web log comments pages • spammer can post links to his pages – Own pages • Completely controlled by spammer • May span multiple domain names
Link Spam Detection • Open research area • One approach: Trust. Rank
Trust Rank • Basic principle: approximate isolation – It is rare for a “good” page to point to a “bad” (spam) page • • • Sample a set of “seed pages” from the web. Set trust of each trusted page to 1 Propagate trust through links Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam
Anti-Trust Approach • Broadly based on the same “approximate isolation principle” • This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. • Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. • A page can be classified as a spam page if it has Anti. Trust Rank value more than a chosen threshold value.
Seed Set selection • Seed spam set chosen from pages with high page rank. • Nearly 100% URLS containing certain terms like {viagra, gambling, hardporn} as substrings are spam. Use these for evaluation. • Also some seed pages were chosen by an Oracle (Human Expert).
Results • Overall Percentage of “spam” pages =0. 28%. • Average page rank of “spam”/Average Page Rank = 2. 6. • % of “spam” pages in: • top 1000 Anti-Trust rank pages = 25. 3% • Bottom 1000 Trust rank pages = 0. 68% • Ratio of average page ranks of spam pages returned by ATR vs. TR is roughly 6.
Results
References • The Page. Rank citation ranking: Bringing order to the web. L. Page, S. Brin, R. Motwani and T. Winograd. Technical Report, Stanford University, 1998. • Combating Web Spam with Trust Rank. Zoltan Gyongyi, Hector Garcia-Molina and Jan Pedersen. In VLDB 2004. • Topic-sensitive Page. Rank. Taher Haveliwala. In WWW 2002. • The Web. Graph dataset. Online at: • http: //webgraph-data. dsi. unimi. it/
1d9864ebb017a25d9fc2ea56d5513939.ppt