Web Noises Detection and Elimination http net pku

Web Noises Detection and Elimination http: //net. pku. edu. cn/~wbia 黄连恩 hle@net. pku. edu. cn 北京大学信息程学院 10/08/2013

What are Web Noises？

导航Nav. Guide 主题Topic 广告Adv

Call them Noises n n 虽然这些信息对于人浏览Web有用，但常常对自动 Web信息处理带来负面影响，比如Web page clustering, classification, information retrieval and information extraction. hamper automated information gathering and Web data mining,

Non-Relevant Data on the Web n A fundamental problem on the Web: Many pages contain lots of non-relevant data n n “non-relevant” – not directly related to the main topic / functionality of the page Local (intra-page) noise n n Irrelevant items within a Web page. E. g. , banner ads, navigational guides

Duplicate data on the Web n Another problem on the Web: There are much duplicate or near duplicate data n n Mirrors，News copy, etc, Global noise n n n Redundant objects Larger than individual page E. g. , mirror sites, duplicated Web pages

Template detection “Template Detection via Data Mining and its Applications”

DOM Tree 模版Template

DOM trees <BODY bgcolor=WHITE> root <TABLE width=800 height=200 > … bc=white BODY </TABLE> <IMG src="image. gif" width=800> width=800 <TABLE bgcolor=RED> height=200 width=800 bc=red … TABLE IMG </TABLE> </BODY>

Templates

Volume and Evolution of Page Templates n n n Our results show that 40– 50% of the content on the web is template content. Over the last eight years, the fraction of template content has doubled, and the growth shows no sign of abating. Text, links, and total HTML bytes within templates are all growing as a fraction of total content at a rate of between 6 and 8% per year. G. David, P. Kunal, and T. Andrew, "The volume and evolution of web page templates, " in Special interest tracks and posters of the 14 th international conference on World Wide Web. Chiba, Japan: ACM Press, 2005.

Templates Detection n n Semantic Definition: A template is a master HTML shell page that is used as a basis for composing new pages n n Content of new pages plugged into template shell All pages share common look & feel Usually controlled by a central authority n Not necessarily confined to a single site May include variety of data n Navigational bars n Advertisements n Company info and policies

Search pagelet Navigation pagelet Services pagelet Company info pagelet Ad pagelet

Pagelets n n Semantic Definition: A pagelet is a maximal region of a page that has a single topic or functionality n n Not too large n has only one topic / functionality Not too small n any larger region that contains it has other topics / functionalities

Pagelets: Syntactic Definition n A pagelet is a node in the HTML parse tree of a page satisfying the following: n n n Its HTML tag is one of the following: n <TABLE>, <OL>, <UL>, <AREA>, <P>, <DL>, … None of it’s children contains at least k hyperlinks (k=3) None of its ancestor is a pagelet

Templates: Syntactic Definition A template is a collection T = (p 1, …, pk) of pagelets satisfying: n Similarity: p 1, …, pk are identical or almost identical n Connectivity n Every two pages owning pagelets in T are reachable from each other (undirectedely) through other pages owning pagelets in T. p 1 p 2 p 4 p 3 p 5 Template Recognition Problem: Given a set of pages S find all the templates in S.

Template Recognition in Large Sets Calculate shingle(p) for each pagelet p S For each remaining cluster C: Cluster pagelets in S according to shingle Construct graph Gc of pages that own pagelets in C Output components of size > 1 Discard clusters of size 1 Find undirected connected components of Gc

Cleaning via feature weighting

Cleaning via feature weighting n In a given Web site n n n “Eliminating noisy information in Web pages for data mining” Noisy blocks — Share common contents or presentation styles Meaningful (or main) blocks — diverse in contents and presentation style Weighting features makes cleaning automatic (nothing is eliminated)

DOM trees <BODY bgcolor=WHITE> root <TABLE width=800 height=200 > … bc=white BODY </TABLE> <IMG src="image. gif" width=800> width=800 <TABLE bgcolor=RED> height=200 width=800 bc=red … TABLE IMG </TABLE> </BODY>

Build Site style tree (SST) common

SST n Style Node S = (ELEMENTs, n) n n n ELEMENTs — a sequence of element nodes n — number of pages that has this style Element Node E = (Tag, Attr, STYLEs) n n n Tag — tag name. E. g. , TABLE, IMG; Attr — display attributes of Tag. E. g. , bgcolor=RED STYLEs — style nodes below E

Quantify the importance Inner Node Leaf Node

Weighting policy n Inner Node Importance (1) n n l = |E. STYLEs| m = number of pages containing E, |E. parent. n| pi — percentage of tag nodes (in E. parent. n) using the i -th presentation style Inner Node. Imp(E) — diversity of presentation styles

Node. Imp(Body) = -1 log 1001 = 0 Node. Imp(Table) = -(0. 35 log 1000. 35 + 2*0. 25 log 1000. 25+ 0. 15 log 1000. 15) = 0. 29 >0

Weighting policy n n Features( terms) of Leaf Node Importance of Leaf Node’s Features (3) n n m = number of pages containing E, |E. parent. n| pij — probability of ai appears in E of page j HE(ai) — information entropy of ai the higher HE(ai), the less important ai

Weighting policy n Leaf Node Importance (2) n n N — number of features in E ai — a feature of content in E (1 -HE(ai)) — information contained in ai Leaf Node. Imp(E) —content diversity of E

SST : m=3 N = |{PCMag, samsung, root Ep IMG TABLE 3 E t 1 : PCMag, samsung t 2 : PCMag, epson t 3 : PCMag, canon epson, canon}| = 4 HE(PCMag) = -3 * (1/3 log 31/3) = 1 HE(samsung)=HE(epson) =HE(canon) = -(0+0+1 log 31) = 0 Node. Imp(E) = ((1 -1) + 3*(1 -0))/4 = 0. 75

Transitive Weighting policy 0 0. 29 0 Composite Importance 0. 75

Page nosie n noisy element node n n n For an element node E in the SST, if all of its descendents and itself have composite importance less than a specified threshold t, then we say element node E is noisy. Maximal noisy element node meaningful element node : n n If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful. Maximal meaningful element node

Web page cleaning via block elimination n We can use SST (site style tree) to identify & eliminate noise content blocks in a page. n n n Build SST by sample pages crawled from a site. Computing an importance value for each block, using a specified threshold t to decide noisy or not noisy Matching to noisy blocks and not noisy blocks in the tree, given a new page.

Noise Detection and Elimination root Body Table Tr Tr Img A A A Img Text Table P P P Table Text P A P Img A A

After simplification root Body Table Tr Tr Img Text Table

Summary of the technique n Evaluate Common and Diversity of content and styles n n n DOM trees SST Information Entropy Based Evaluation n Node Importance n Composite Importance Noise detection and automatic matching

Near duplicate detection

Syntactic clustering of the web contents WWW 6, 1997 Identifying and Filtering Near-Duplicate Documents CPM, 2000

Document Representation n How to represent a document? n n Represent document content by a feature set， preparing the computations of resemblance or similarity. For document D, extract it’s feature set as S(D)

Defining similarity of documents n n How to express the concept “roughly the same” precisely? Quantity Definition: resemblance n The resemblance fo two documents A and B is a number between 0 and 1.

Defining similarity of documents(cont’d) n Resemblance n Symmetric, reflexive, not transitive, not a metric n n Jaccard coefficient Note r (A, A) = 1 But r (A, B)=1 does not mean A and B are identical! Forgives any number of occurrences and any permutations of the terms. Resemblance distance

Feature Selection n Assume: we have converted page into a sequence of tokens n n Eliminate punctuation, HTML markup, lower case, etc How to do feature selection, S(D)=? n n n Document level Character/word level Shingle level

Shingling n n A contiguous subsequence contained in D is called a shingle. Given a document D we define its w-shingling S(D, w) as the set of all unique shingles of size w contained in D. D = (a, rose, is, a, rose) n S(D, 4) = {(a, rose, is, a), (rose, is, a, rose), (is, a, rose, is)} “a rose is a rose” => a_rose_is_a What is a good Why shingling? rose_is_a_rose S(D, 4). vs. S(D, 1) value for w? is_a_rose_is n

Shingling & Jaccard Coefficient n n Doc 1= "to be or not to be, that is a question!" n n Doc 2= "to be a question or not"

Sketches n Set of all shingles is large n n n Bigger than the original document Can we create a document sketch by sampling only a few shingles? Requirement n Sketch resemblance should be a good estimate of document resemblance

Choosing a sketch n Random sampling n n For s=1: E[|M(A) M(B)|] = 1/n n E. g. , suppose we have identical documents A & B each with n shingles M(A) = set of s shingles from A, chosen uniformly at random; similarly M(B) Does it work? But r(A, B) = 1 So the sketch overlap is an underestimate Verify that this is true for any value of s

Choosing a sketch n Improvements: n n n Random sampling + compare “special” item Random permutations + compare “smallest” shingle Random permutation Let be a set (1. . N e. g. ) n Pick a permutation : uniformly at random ={3, 7, 1, 4, 6, 2, 5} A={2, 3, 6} MIN( (A))=? n n

Estimating Jaccard Coefficient n n Theorem : If permutations are picked uniformly at random from the n! possible permutations,

Choosing a sketch n n Create a “sketch vector” (e. g. , of size 200) for each document For doc d, sketchd[i] is computed as follows: m n Let f map all shingles in the universe to 0. . 2 n Let i be a specific random permutation on 0. . 2 m n Pick MIN i (f(s)) over all shingles s in d

Computing Sketch[i] for Doc 1 Document 1 264 Start with 64 bit shingles 264 264 Permute on the number line with i Pick the min value

Test if Doc 1. Sketch[i] = Doc 2. Sketch[i] Document 2 Document 1 264 264 A Are these equal? Test for 200 random permutations: 264 1, 2, … 200

Finding all near-duplicates n Naïve implementation makes O(N^2) sketch comparisons n Suppose N=100 million How can you do it faster?

Short Features Divide every sketch into k groups of s elements each With high probability, two documents share more than a certain number of features if and only if their resemblance is very high。

High Band Pass Filter Divide every sketch into k groups of s elements each P(k, s, r)Two sketches have r or more equal groups

Finding all near-duplicates n Naïve implementation makes O(N^2) sketch comparisons n n n Divide-Compute-Merge (DCM) n n Suppose N=100 million =>reduce to O(Nlog. N) by filtering Divide data into batches that fit in memory Operate on individual batch and write out partial results in sorted order Merge partial results Generalization of external sorting

DCM Steps doc 1: s 11, s 12, …, s 1 k doc 2: s 21, s 22, …, s 2 k … Invert s 11, doc 1 s 12, doc 1 … s 1 k, doc 1 s 21, doc 2 … sort on shingle_fp t 1, doc 1 t 1, doc. X … t 2, doc 1 t 2, doc. Y … Invert and pair doc 1, doc. X, 2 doc 1, doc. Y, 10 … doc 1, doc. X, 1 … Merge doc 1, doc. Y, 1 … sort on <docid 1, docid 2> doc 1, doc. X, 1 doc 1, doc. Z, 1 … doc 1, doc. Y, 1 …

Finding all near-duplicates 1. 2. 3. 4. 5. 6. 7. Calculate a sketch for each document For each document, write out the pairs <shingle_fp, doc. Id> Sort by shingle_fp (DCM) In a sequential scan, generate triplets of the form <doc. Id 1, doc. Id 2, 1> for pairs of docs that share a shingle (DCM) Sort on <doc. Id 1, doc. Id 2> (DCM) Merge the triplets with common docids to generate triplets of the form <doc. Id 1, doc. Id 2, count> (DCM) Output document pairs whose resemblance exceeds the threshold

本次课小结 root n Body n Web Noises Template Detection n Table Img Table n Tr Tr Text n Document 2 Document 1 264 A 264 264 264 2 6 4 A n Semantic and Syntactic Definition Information Entropy of Features Weighting SST Near duplicates detection n n Jaccard similarity Shingling sketch Filtering

References n n n [1] B. -Y. Ziv and R. Sridhar, "Template detection via data mining and its applications, " in Proceedings of the 11 th international conference on World Wide Web. Honolulu, Hawaii, USA: ACM Press, 2002. [2] Y. Lan, L. Bing, and L. Xiaoli, "Eliminating noisy information in Web pages for data mining, " in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. Washington, D. C. : ACM Press, 2003. [3] G. David, P. Kunal, and T. Andrew, "The volume and evolution of web page templates, " in Special interest tracks and posters of the 14 th international conference on World Wide Web. Chiba, Japan: ACM Press, 2005. [4] Z. B. Andrei, C. G. Steven, S. M. Mark, and Z. Geoffrey, "Syntactic clustering of the Web, " in Selected papers from the sixth international conference on World Wide Web. Santa Clara, California, United States: Elsevier Science Publishers Ltd. , 1997. [5] N. Shivakumar and H. Garca-Molina, "Finding near-replicas of documents on the web, " presented at Proceedings of Workshop on Web Databases (Web. DB'98), Mar, 1998.

Related Resources n n Html-tidy Code http: //code. google. com/p/html-tidy/ Shingle Code http: //research. microsoft. com/research/download s/Details/4 e 0 d 0535 -ff 4 c-4259 -99 faab 34 f 3 f 57 d 67/Details. aspx? 0 sr=d

Thank You! Q&A

阅读材料 n n [1] IIR Chapter 19. 6 [2] G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval, " Inf. Process. Manage. , vol. 24, pp. 523, 1988.

DOM Tree n n W 3 C Document Object Model allow programs and scripts to dynamically access and update the content, structure and style of documents. The document can be further processed and the results of that processing can be incorporated back into the presented page.

Information Entropy n In information theory, entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the information contained in a message, usually in units such as bits.

Estimating algorithm 1. 2. 3. 4. 5. 6. Generate a set of m random permutations for each do compute and check if end for if equality was observed in k cases, estimate

Some other approaches n For set W of shingles, let MINs(W) = set of s smallest shingles in W n n Define n n M(A) = MINs(S(A)) M(AB) = MINs(M(A) M(B)) r’(A, B) = |M(AB) M(A) M(B)| / s By increasing sample size (s) we can make it very unlikely r’(A, B) is significantly different from r(A, B) n n Assume documents have at least s shingles 100 -200 shingles is sufficient in practice Compute a fingerprint f for each shingle (e. g. , Rabin fingerprint) n n 40 bits is usually enough to keep estimates reasonably accurate Fingerprint also eliminates need for random permutation

DCM algorithm 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. for each random permutation do create a file for each document d do write out to end for sort using key s -- this results in contiguous blocks with fixed s containing all associated create a file for each pair within a run of having a given s do write out a document-pair record to end for sort on key end for merge for all in order, counting the number of entries

Some optimizations n n “Invert and Pair” is the most expensive step We can speed it up eliminating very common shingles n n Common headers, footers, etc. Do it as a preprocessing step Also, eliminate exact duplicates up front Probabilistic Counting [5]

Detecting duplicate pages

HOME WORK n 1. Shingling 与 Sim. Hash 是当前的两个主流相似网页检测算法。本次课介绍了 Shingling, 请大家查阅 Sim. Hash 的相关资料，写一个简要综述： n n n 列出几篇主要的文献；简要的算法介绍。 2. 阅读文献 Achieving both high precision and high recall in near-duplicate detection n 回答问题： Shingling 与 Sim. Hash 存在什么缺陷？