Скачать презентацию Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng Скачать презентацию Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng

f624de1918e660691e2c98dd0ae2a560.ppt

  • Количество слайдов: 45

Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA 1

Wrappers for Web Sources n n Extract information from Web pages Used in many Wrappers for Web Sources n n Extract information from Web pages Used in many Web-based applications Wrapper HTML Documents XML Wrapper RDBMS ……… Wrapper Application (e. g. , data Integration) Programs 2

Problem very dynamic: contents, page structures The Web are n n n Original wrappers Problem very dynamic: contents, page structures The Web are n n n Original wrappers can stop working: rely on Web page structures Re-generating wrappers is not easy: heavy workload to system developers Original Wrapper Changed Documents Extract nothing … Original Wrapper Incomplete results ……… Original Wrapper ……… Incorrect results 3

Example The original wrapper fails due to the structure change. 4 Example The original wrapper fails due to the structure change. 4

Problems n Wrapper verification: Is a wrapper is operating correctly? q q n Several Problems n Wrapper verification: Is a wrapper is operating correctly? q q n Several studies have been conducted on the verification problem: E. g. , computing the similarity between a wrapper’s expected and observed output, “regression test” Wrapper maintenance: how to automatically modify a wrapper when the pages have changed? Focus of this work 5

Outline n n n Motivation System overview Schema-Guided Wrapper Maintenance Experiments Related Work and Outline n n n Motivation System overview Schema-Guided Wrapper Maintenance Experiments Related Work and Conclusion 6

The SG-WRAM System Documents Changed Documents Wrapper Generator Wrapper Schema Rule Re-induction Wrapper Executor The SG-WRAM System Documents Changed Documents Wrapper Generator Wrapper Schema Rule Re-induction Wrapper Executor XML Repository Rule Data Feature Discovery Wrapper Maintainer Block Configuration Data Item Recovery 7

User-Defined Schema User provides schema for the target data <!ELEMENT Video. List (Video+)> <!ELEMENT User-Defined Schema User provides schema for the target data 8

Schema-Guided Wrapper Generation n Using a GUI toolkit, users can map data items in Schema-Guided Wrapper Generation n Using a GUI toolkit, users can map data items in HTML pages to elements in DTD HTML page DTD tree 9

Schema-Guided Wrapper Generation n n Internally, the system computes the mappings from the corresponding Schema-Guided Wrapper Generation n n Internally, the system computes the mappings from the corresponding HTML tree to the DTD tree Then generates the extraction rule HTML tree DTD tree 10

Expressing Extraction Rule in XQuery n Each rule is an FLWR XQuery expression Example Expressing Extraction Rule in XQuery n Each rule is an FLWR XQuery expression Example FOR $vedio IN $vedio. List/body/div[0]/table[4]/tr[0]/td[2]/table/tr[0] /td[1] RETURN { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN $name } Paths to the data items Value of the data item 11

Annotations formeaning of a data items Describe the semantic n n Indicate the location Annotations formeaning of a data items Describe the semantic n n Indicate the location of the data item Specified by the user using the GUI Recorded in the function of “contains(path. To. Annotation, annotation. Value)” in XPath /body/div[0]/table[4]/tr[0]/td[2]/table[1]/tr[0]/td [1]/text()[0][contains(null, "directed by")] Data values in HTML page Annotations May Morning - Ugo Liberatore directed by Jane Birkin; John Steiner; Rosella Falk Featuring 15. 38 -23. 26 DVD 14. 98 -18. 99 VHS 12

Outline n n n Motivation System Overview Wrapper Maintenance (four steps): q q n Outline n n n Motivation System Overview Wrapper Maintenance (four steps): q q n n Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction Experiments Related Work and Conclusion 13

Intuition of the approach n n The page structure could change Observation: many “features” Intuition of the approach n n The page structure could change Observation: many “features” of data items are more static, e. g. : q q q n Hyperlink Annotation Pattern These features can help us find the new places of the old data items 14

Step 1: Data-feature discovery Compute features of the data items in the original page Step 1: Data-feature discovery Compute features of the data items in the original page n ID DTD Element L (hyperlink) A (annotation) P (data pattern) 1 Name True NULL [A-Z][a-z]{0, } 2 Director False Directed by [A-Z][a-z]{0, } 3 Actors False Featuring [A-Z][a-z]{0, }(. )* 4 VHSPrice False VHS [$][0 -9]{0, }[0 -9](. )[09]{2} 5 DVDPrice False DVD [$][0 -9]{0, }[0 -9](. )[09]{2} 15

Data-Pattern Feature n n A syntactic feature Represented as a regular expression q q Data-Pattern Feature n n A syntactic feature Represented as a regular expression q q E. g. $ 15. 38 [$][0 -9]{0, }[0 -9](. )[0 -9]{2} Can be extracted using existing technologies, e. g. , [Brin 98], [GHQR 98], [LM 00] 16

Annotations and Hyperlinks Hyperlink Indication n Get annotation and hyperlink information from the original Annotations and Hyperlinks Hyperlink Indication n Get annotation and hyperlink information from the original page q q q Checking the XQuery based extraction rule Hyperlink: step of “…/a/…” in the path Annotation: function of “contains()” { LET $name = $vedio/span[0]/b[0]/a[0]/text()[0] RETURN $name } { LET $actors = $vedio/text()[contains( /preceding-sibling: : b[0] , "Featuring")] RETURN $actors } Path from data item to annotation Annotation Value 17

Step 2: Data-Item Recovery n n Traverse the new HTML tree following the depth-first Step 2: Data-Item Recovery n n Traverse the new HTML tree following the depth-first traversal order Use the old features to identify potential data items using 3 matching conditions: q q q Hyperlink Annotation Data pattern 18

Example Check hyperlink Find annotation [A-Z][aok z]{0, } data Check pattern yes Find value Example Check hyperlink Find annotation [A-Z][aok z]{0, } data Check pattern yes Find value starting from annotation ok Recognize a data item Check data pattern Recognize a data item [$][0 -9]{0, }[0 -9](. )[0 -9]{2} 19

Results of Data Item Recovery n n A mapping list including all the recognized Results of Data Item Recovery n n A mapping list including all the recognized data items Each mapping contains q q q Value of the data item Path to it in the HTML tree Path of the corresponding DTD element A sample mapping: M 1’ (D: “May”, HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0], SP: Video. List/Video/Name ) 20

Step 3: Block Configuration n Observation: Data items are located in semantic blocks Conforms Step 3: Block Configuration n Observation: Data items are located in semantic blocks Conforms to the user-defined schema Data items are grouped in semantic blocks Partial-Match Full-Match Over-Match 21

Computing “Full Match” Blocks “Full match” blocks n n Identify the level in a Computing “Full Match” Blocks “Full match” blocks n n Identify the level in a top-down manner Check the level by recursively considering the matches between candidate blocks and the schema 22

Results of Block Configuration n n A set of blocks that can fully match Results of Block Configuration n n A set of blocks that can fully match with the DTD Each of them is represented as a list of mappings No. Examples Element PATH 1 Title …table[1]/tr[0] /td[1]/span[0]/b[0]/a[0]/text()[0] 2 Director …table[1]/tr[0]/ /td[1]/span[1]/text[contains( /preceding-sibling: : b[0], "Directed by")] 3 Actors …table[1]/tr[0]/ /td[1]/span[2]/text()[contains(/preceding-sibling: : b[0], "Featuring")] 4 Title …table[2]/tr[0] /td[1]/span[0]/b[0]/a[0]/text()[0] 5 Director …table[2]/tr[0]/ /td[1]/span[1]/text[contains( /preceding-sibling: : b[0], "Directed by")] 6 Actors …table[2]/tr[0]/ /td[1]/span[2]/text()[contains(/preceding-sibling: : b[0], "Featuring")] 23

Step 4: Rule Re-Induction n Semantic blocks contain mappings from data items in HTML Step 4: Rule Re-Induction n Semantic blocks contain mappings from data items in HTML to DTD elements Induce new extraction rule by calling the induction algorithm in wrapper generator Refine the rule by trying to ensure the extraction rule cover all other semantic blocks q Generalization is necessary 24

Outline n n n Motivation System Overview Wrapper Maintenance (four steps): q q n Outline n n n Motivation System Overview Wrapper Maintenance (four steps): q q n n Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction Experiments Related Work and Conclusion 25

Web Sources n n From October 2002 to May 2003 Collected Web page changes Web Sources n n From October 2002 to May 2003 Collected Web page changes q q n n From 16 data-intensive sites Using site search engine or from the same URL All the pages have complex table structures Observed changes q q q Data items (add, delete, modify) Table structure non-table structure Complex table structure rearrangement 1 Bookstreet Book Allbooks 4 less Book Amazon Book (search) Amazon Magazine Barnesandnoble Book CIA Factbook CNN Currency Excite Currency Hotels Hotel Yahoo Shopping Video Yahoo Quotes Yahoo People Email 26

Experiment Procedures New Web Docs Original Web Docs step 1 Wrapper Generator Original Wrapper Experiment Procedures New Web Docs Original Web Docs step 1 Wrapper Generator Original Wrapper Repository Wrappers ……… step 2 Repaired Check Extraction Results Changed pages Wrappers step 3 Wrapper Maintainer 27

Experiment Metrics n Recall (R) q n Proportion of the correctly extracted data items Experiment Metrics n Recall (R) q n Proportion of the correctly extracted data items of all the data items that should be extracted Precision (P) q Proportion of the correctly extracted data items of all the data items that have been extracted 28

Original wrappers after changes # of changed pages Item Number Avg Recall Avg Precision Original wrappers after changes # of changed pages Item Number Avg Recall Avg Precision 1 Bookstreet Book 12 6 82. 54 100 Allbooks 4 less Book 15 4 0 - Amazon Book (search) 15 6 40. 49 100 Amazon Magazine 15 5 20. 01 100 Barnesandnoble Book 15 5 0 100 CIA Factbook 5 10 0 100 CNN Currency 15 6 50. 00 100 Excite Currency 18 11 42. 86 100 Hotels Hotel 15 4 0 - Yahoo Shopping Video 15 6 0 - Yahoo Quotes 10 6 0 - Yahoo People Email 10 3 0 - Name 29

New wrappers (after item recovery) Web site Avg Recall Avg Precision 98. 67 71. New wrappers (after item recovery) Web site Avg Recall Avg Precision 98. 67 71. 26 75 32. 69 83. 05 36. 3 100 60. 15 78. 72 43. 13 CIA Factbook 100 CNN Currency 100 Excite Currency 100 Hotels Hotel 50 35. 61 Yahoo Shopping 100 51. 49 Yahoo Quotes 100 Yahoo People 100 53. 54 1 Bookstreet Book Allbooks 4 less Book Amazon Book (search) Amazon Magazine Barnesandnoble 30

New Wrappers (final) Web site Avg recall Avg precision 1 Bookstreet Book 100 Allbooks New Wrappers (final) Web site Avg recall Avg precision 1 Bookstreet Book 100 Allbooks 4 less Book 75 51. 34 83. 05 90. 74 100 78. 72 100 CIA Factbook 100 CNN Currency 100 Excite Currency 100 Hotels Hotel 50 41. 87 Yahoo Shopping 100 92. 86 Yahoo Quotes 100 Yahoo People 100 Amazon Book (search) Amazon Magazine Barnesandnoble 31

Related Work on Wrapper Maintenance n [Kushmerick 99] q n [Lerman K. , Minton Related Work on Wrapper Maintenance n [Kushmerick 99] q n [Lerman K. , Minton S. 00] q n Using simple numeric features of the extracted strings Using the starting and ending strings as the description of the data fields [Chidlovskii B. 01] q Syntactic features of data items to be extracted, and semantic features: URL, time strings, entities… 32

Comparions n These approaches heavily rely on the syntactic features of the data items, Comparions n These approaches heavily rely on the syntactic features of the data items, and often cannot precisely recognize the data items. Title Our Price List Price Data on Web $23. 00 $29. 00 Java Programming $49. 00 $59. 00 Title List Price Our Price Data on Web $29. 00 $23. 00 Java Programming $59. 00 $49. 00 33

Conclusion n n SG-WRAM: a wrapper-maintenance system Intuition: use features that are more stable Conclusion n n SG-WRAM: a wrapper-maintenance system Intuition: use features that are more stable q q q n Four steps of the approach: q q n Pattern Hyperlink Annotation Data-Feature Discovery Item Recovery Block Configuration Rule Re-induction Experiments showed that it is effective 34

Thank you! Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University Thank you! Schema-Guided Wrapper Maintenance for Web-Data Extraction Xiaofeng Meng, Dongdong Hu Renmin University of China, Beijing, China Chen Li University of California, Irvine, CA, USA 35

Backup Slides 36 Backup Slides 36

Induce Extraction Rule from Mappings n n Automatically generalize the paths to the data Induce Extraction Rule from Mappings n n Automatically generalize the paths to the data items in the HTML tree to cover all the possible occurrences Generating extraction rule conforming to the schema For concise expressions, we let Common. Path = /body[0]/div[0]/table[2]/tr[0]/td[0]/table[1]/tr[0]/td[2]/table[0]/tr[0]/td[0] Name Common. Path/table[1]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0] /table Director Common. Path/table[1]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0] /table Actors Common. Path/table[1]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0] /table Name /table Common. Path/table[2]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0] Director /table Common. Path/table[2]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0] Actors /table Common. Path/table[2]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0] 37

Example n n Each mapping has: q Value of the data item n n Example n n Each mapping has: q Value of the data item n n n q Path to it in the new HTML tree n n q Path of the corresponding DTD element n n n M 1’(D: “Lucky Day”, HP: …/table/tr[0]/td[1]/span[0]/b[0]/a[ 0]/text()[0], SP: Video. List/Video/Name ) M 2’(D: “Penelope Buitenhuis”, HP: …/table/tr[0] /td[1]/span/text()[contains(/precedingsibling: : b[0], "Directed by")], SP: Video. List/Video/Director ), M 3’(D: “Amanda Donohoe, Tony Lo Bianco, Andrew Gillies”, HP: …/table/tr[0]/td[1]/span/text()[con tains(/precedingsibling: : b[0], "Featuring")], SP: Video. List/Video/ Actors), 38

Results of Data Item Recovery n A mapping list including all the recognized data Results of Data Item Recovery n A mapping list including all the recognized data items (including possible noises) n Each mapping contains q q q Value of the data item Path to it in the HTML tree Path of the corresponding DTD element Noises A sample mapping: M 1’ (D: “May”, HP: …/table[0]/tr[0]/td[1]/span[0]/b[0]/a[0]/text()[0], SP: Video. List/Video/Name ) 39

Removing Noisy Blocks n Observation q q n Noises sub-trees are commonly located at Removing Noisy Blocks n Observation q q n Noises sub-trees are commonly located at a different level of the correct blocks Sub-tree of correct blocks are always near each other At each level of the HTML tree q q q A weight of each sub-tree is computed Weight = number of recognized items (may contain noises) Condition for excluding sub-tree n n Low weight Does not locate near the group of higher weighted sub-trees After removing noisy blocks 40

Computing Blocks n n n Identify the level in a top-down manner Check the Computing Blocks n n n Identify the level in a top-down manner Check the level by recursively considering the matches between candidate blocks and the schema Excluding noisy sub-trees by counting the recognized data items in each sub-tree 41

Match a Semantic Block to a Schema n A match between a semantic block Match a Semantic Block to a Schema n A match between a semantic block A and the schema can be one of the following three cases: q q q Over match: There is at least one item i in the schema that occurs at least twice in block A. Full match: Block A contains all items of the schema and satisfies the constraint of each item in the schema, such as ‘+’ or ‘*’, ‘? ’ etc. Partial match: Block A contains a proper subset of items of the schema. 42

Experiment Procedures n n n Generated wrappers for each set of original pages using Experiment Procedures n n n Generated wrappers for each set of original pages using SG-WRAP Applied the initial wrapper to the newly collected Web pages. Checked how many data items corresponding to the elements in the DTD can still be correctly extracted to find changed pages For each set of changed pages, we induced a repaired wrapper using our SG-WRAM system Applied the repaired wrapper on the changed pages 43

Statistics of Data Features n The data items’ features of annotation, hyperlink, pattern are Statistics of Data Features n The data items’ features of annotation, hyperlink, pattern are still preserved in most of the changed page. n The case of “Pattern only” increases the difficulty of maintenance: the item recovery step may bring additional noises. 44

Maintenance Web site R%(IR) P%(IR) R%(EX) P%(EX) 98. 67 71. 26 100 75 32. Maintenance Web site R%(IR) P%(IR) R%(EX) P%(EX) 98. 67 71. 26 100 75 32. 69 75 51. 34 83. 05 36. 3 83. 05 90. 74 100 60. 15 100 78. 72 43. 13 78. 72 100 CIA Factbook 100 100 CNN Currency 100 100 Excite Currency 100 100 Hotels Hotel 50 35. 61 50 41. 87 Yahoo Shopping 100 51. 49 100 92. 86 Yahoo Quotes 100 100 Yahoo People 100 53. 54 100 1 Bookstreet Book Allbooks 4 less Book Amazon Book (search) Amazon Magazine Barnesandnoble 45