d43affe0ce07f434b438577d5698b5bb.ppt
- Количество слайдов: 27
Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative Commons License, see last slide)
Data-intensive websites
Data-intensive websites target Website Template 1 Template 2 Database Template 3
Flint goal Last Min Max Stock. Quote … Volume 52 high Open
System architecture Web Search Flint Data Extraction [WIDM 08] Data Integration The Web
Novel contribution Data Extraction • Unsupervised • Automatic • Scalable • No knowledge available Road. Runner [Vldb 01] Ex. Alg [Sigmod 03] Turbo. Wrapper [Vldb 07] Data Integration • Unsupervised • Automatic • Scalable • Uncertain Data • No labels available • No corpus available Web. Tables [Vldb 08] Cimple [Vldb 07] Meta. Querier [Cidr 05] Pay. Go [Cidr 07]
Data Extraction
Data Extraction
Data Extraction AAPL, GOOG, MSFT, INTC, … 128. 09, 439. 54, 34. 89, 112. 37, … 127. 81, 439. 25, 32. 13, 111. 01, … 132. 43, 443. 82, 33. 67, 114. 32, … 0. 50%, -0. 38%, 1. 23%, 3. 92%, -1. 65%, … Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio, Add INTC to Your Portfolio, … …
Data Extraction HTML fragments taken from two pages belonging to the same website: ? /html/body/table/tr[1]/td[2] 1, 132, 228 , 1, 735, 857 /html/body/table/tr[2]/td[2] $20. 66 , $414. 58 /html/body/table/tr[3]/td[2] $11. 70 , $247. 30 /html/body/table/tr[4]/td[2] $20. 72 , $414. 06 /html/body/table/tr[5]/td[2] /html/body/table/tr[6]/td[2] $0. 02 , 99, 494, 200 4, 732, 600 , null Extraction error!
Data Integration 10 33 16 4 25 10 AA GO MS (max) (min) (stock)
Data Integration t=0. 5 10 33 16 4 25 10 AA GO MS (max) (min) (stock)
Data Integration t=0. 5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) 10 33 16 (max) 1. 0 4 25 10 (min) 1. 0 AA GO MS (stock) 1. 0
Data Integration t=0. 5 10 33 16 (max) 4 25 10 (min) t=0. 5 AA GO MS (stock)
Data Integration t=0. 5 10 33 16 (max) 4 25 10 t=0. 5 4 25 10 AA GO MS (min) 0. 6 AA GO MS (stock) 6 26 12 (price) 1. 0 4 25 10 (min) 1. 0 AA GO MS (stock)
Data Integration t=0. 5 10 33 16 (max) 4 25 10 t=0. 5 6 26 12 AA GO MS ? (min) (price) AA GO MS (stock) 1. 0 4 25 10 (min) 1. 0 AA GO MS (stock)
Data Integration t=0. 5 10 33 16 (max) 4 25 10 6 26 12 (min) (price) 4 25 10 (min) AA GO MS (stock) 1. 0 AA GO MS (stock)
Data Integration 10 33 16 (max) t=0. 7 t=0. 5 4 25 10 (min) 6 26 12 (price) t=0. 5 AA GO MS (stock) 1. 0 AA GO MS (stock)
Data Integration 10 33 16 (max) t=0. 7 t=0. 5 4 25 10 (min) 6 26 12 (price) t=0. 5 AA GO MS (stock)
Wrapper Refinement 10 33 16 t=0. 7 t=0. 5 ? (max) 4 25 10 (min) 0. 3 (weak) 10 null 10 (min/max) 0. 3 (weak) 6 26 12 ? (price) 0. 0 t=0. 5 AA GO MS (stock) 0. 0
Wrapper Refinement matching value nearby template tokens //td[contains(text(), ‘Open')]/. . /td[2] //td[contains(text(), ‘Open')]/. . /tr[5]/td[1] //td[contains(text(), ‘Open')]/. . /tr[5]/td[2] //td[contains(text(), ‘High')]/. . /td[2] …
Wrapper Refinement t=0. 7 t=0. 5 10 33 16 4 25 10 (max) 4 25 10 (min) 1. 0 10 33 16 4 25 10 (max) (min) 10 null 10 (min/max) t=0. 7 6 26 12 (price) t=0. 5 AA GO MS (stock) //td[contains(text(), ‘Max')]/. . /td[2] //td[contains(text(), ‘Min')]/. . /td[2]
Wrapper Refinement 10 33 16 (max) t=0. 7 t=0. 5 10 33 16 (max) 10 null 10 (min/max) 4 25 10 (min) 4 25 10 6 26 12 (min) (price) t=0. 5 AA GO MS (stock)
Experimental Results (100 websites for each domain) Soccer domain Videogame domain Finance domain (45, 714 pages) (49, 262 pages) (57, 623 pages) Attribute |m| • Name • Birth Date • Height • Nationality • Club • Position • Weight • League 90 61 54 48 43 43 34 14 • Title • Publisher • Developer • Genre • ESRB rating • Release Date • Platform • # Players 86 59 45 28 40 9 9 6 • Stock Symbol • Price Change • % Change • Volume • Day Low • Day High • Last Price • Open Price 84 73 73 52 43 41 29 24
Demo • Found Websites • Integrated Data
the end! http: //flint. dia. uniroma 3. it
License • This work is licensed under the Creative Commons Attribution-Share. Alike License. To view a copy of this license, visit http: //creativecommons. org/licenses/bysa/1. 0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
d43affe0ce07f434b438577d5698b5bb.ppt