Скачать презентацию CRAWLING THE HIDDEN WEB Authors S Raghavan Скачать презентацию CRAWLING THE HIDDEN WEB Authors S Raghavan

10326e7c664ad66f8fe3008473a4ec94.ppt

  • Количество слайдов: 24

CRAWLING THE HIDDEN WEB Authors: S. Raghavan & H. Garcia-Molina Presenter: Nga Chung CRAWLING THE HIDDEN WEB Authors: S. Raghavan & H. Garcia-Molina Presenter: Nga Chung

OUTLINE Introduction Challenges Approach Experimental Results Contributions Pros and Cons Related Work OUTLINE Introduction Challenges Approach Experimental Results Contributions Pros and Cons Related Work

INTRODUCTION Hidden Web Content stored in databases that can only be retrieved through user INTRODUCTION Hidden Web Content stored in databases that can only be retrieved through user query, such as, medical research databases, flight schedules, product listings, news archives Social media blog posts, comments So why should we care? Scale of the web (55 ~ 60 billions of pages) does not include the deep web or pages behind security walls [2] Estimate in 2001, Hidden Web is 500 times the publicly indexed web Mike Bergman, “The long-term impact of Deep Web search had more to do with transforming business than with satisfying the whims of Web surfers. ” [5]

CHALLENGES From a search engine perspective Locate the hidden databases Identify which databases to CHALLENGES From a search engine perspective Locate the hidden databases Identify which databases to search for a given user query From a crawler’s perspective Interact with a search form Search can be form-based, facet/guided navigation, freetext, which are intended for users [3] Know what keywords to put into the form fields Filter search results returned from search queries Define metrics to measure crawler’s performance

HIDDEN WEB EXPOSER ARCHITECTURE (HIWE) URL List Task Specific Database Parser Crawl Manager Label HIDDEN WEB EXPOSER ARCHITECTURE (HIWE) URL List Task Specific Database Parser Crawl Manager Label Value Set (LVS) Form Analyzer LVS Manager … Data Sources Form Processor Form Submission Feedback Response Analyzer Response WWW

FORM ANALYSIS How does a crawler interact with a search form? Crawler builds an FORM ANALYSIS How does a crawler interact with a search form? Crawler builds an “Internal Form Representation” F = ({E 1, E 2, …, En}, S, M) set of n form elements submission information e. g. submission URL meta-information e. g. URL of form page, web site hosting form, links to form Label(E 1) is descriptive text describing the field e. g. Date Domain(E 1) is set of possible values for the field which can be finite (select box) of infinite (text box)

FORM ANALYSIS Label(E 1) = Make Domain(E 1) = {Acura, Lexus…} Label(E 5) = FORM ANALYSIS Label(E 1) = Make Domain(E 1) = {Acura, Lexus…} Label(E 5) = Your ZIP Domain(E 5) = {s | s is a text string}

TASK SPECIFIC DATABASE How does a crawler know what keywords to put into the TASK SPECIFIC DATABASE How does a crawler know what keywords to put into the form fields? Crawler has a “task-specific database” For instance, if the task is to searchives pertaining to the automobile industry, the database will contain lists of all car makes and models. Database has a Label Value Set (LVS) table Each row contains L – a label e. g. “Car Make” V = {v 1, …, vn} – a graded set of values e, g, {‘Toyota’, ‘Honda’, ‘Mercedes-Benz’, …} Membership function Mv assigns weight to each member of the set V

TASK SPECIFIC DATABASE LVS table can be populated through Explicit initialization by human intervention TASK SPECIFIC DATABASE LVS table can be populated through Explicit initialization by human intervention Built-in entries for commonly used categories e. g. dates Querying external data sources e. g. Open Directory Project Categories Regional: North America: United States Crawler’s encounter with forms that have finite domain fields

TASK SPECIFIC DATABASE Computing weights M(v 1) Case 1: Precomputed Case 2: Computed by TASK SPECIFIC DATABASE Computing weights M(v 1) Case 1: Precomputed Case 2: Computed by respective data source wrapper Case 3: Computed by crawling experience shown below Extract Label Extracted? yes no Find entry that close resembles Domain(E) and add Domain(E) to set Find Label in LVS Table no Found? yes Replace (L, V) with (L, V U Domain(E) Add new entry to LVS

MATCHING FUNCTION “Matching function” maps values from database to form field E 1 = MATCHING FUNCTION “Matching function” maps values from database to form field E 1 = Car Make Match E 2 = Car Model E 1 = Car Make v 1 = Toyota E 2 = Car Model v 2 = Prius Step 1: Label matching Normalize form label and use string matching algorithm to compute minimum edit distance between form label and all LVS labels

MATCHING FUNCTION Step 2: Value assignment Take all possible combinations of value assignments, rank MATCHING FUNCTION Step 2: Value assignment Take all possible combinations of value assignments, rank them, and choose the best set to use form submission There are three ranking functions Fuzzy conjunction Average Probabilistic Example: form with 2 fields car make and year Jaguar, 2009 where Mv 1(Jaguar) = 0. 5 and Mv 2(2009) = 1 ρfuz = 0. 5 ρavg = ½ (0. 5 + 1) = 0. 75 ρprob = 1 – [(1 – 0. 5) * (1 – 1)] = 1 Toyota, 2010, where Mv 1(Toyota) = 1 and Mv 2(2010) = 1 ρfuz = 1 ρavg = ½ (1 + 1) = 1 ρprob = 1 – [(1 – 1) * (1 – 1)] = 1

LAYOUT-BASED INFORMATION EXTRACTION (LITE) Label Extraction Method Prune form page Layout prune page using LAYOUT-BASED INFORMATION EXTRACTION (LITE) Label Extraction Method Prune form page Layout prune page using custom layout engine Identify pieces of text (candidates) physically closest to form element Rank candidates based on position, font size, etc. Choose highest ranked candidate as label Results Method Accuracy LITE 93% Textual Analysis 72% Common Form Layout 83%

RESPONSE ANALYSIS How does crawler determine whether response page contains results or error message? RESPONSE ANALYSIS How does crawler determine whether response page contains results or error message? Identify significant portion of the response page by removing header, footer, etc. and find content in middle of the page See if content matches predefined error messages e. g. “No results, ” “No matches” Store hash of significant portion and assume that if hash occurs very often, then hash is that of an error page

METRICS How to measure the efficiency of the hidden web crawler? Define submission efficiency METRICS How to measure the efficiency of the hidden web crawler? Define submission efficiency SE Ntotal = total number of forms submitted Nsuccess = total number of submissions that resulted in response page containing search results Nvalid = number of semantically correct submissions (e. g. inputting “Orange” form element labeled “Vegetable” is semantically incorrect)

EXPERIMENT Task: Market analyst interested in building an archive of information about the semiconductor EXPERIMENT Task: Market analyst interested in building an archive of information about the semiconductor industry in the past 10 years LVS table populated from online sources such as Semiconductor Research Corporation, Lycos Companies Online Parameter Value Number of sites visited 50 Number of forms encountered 218 Number of forms chosen for submission 94 Label matching threshold 0. 75 Minimum form size 3 Value assignment ranking function ρfuz Minimum acceptable value assignment rank 0. 6

EXPERIMENTAL RESULTS – RANKING FUNCTION Crawler executed 3 times with different ranking function 88. EXPERIMENTAL RESULTS – RANKING FUNCTION Crawler executed 3 times with different ranking function 88. 8% 83. 1% ρfuz and ρavg submission efficiency above 80% ρfuz does better but less forms are submitted as compared to ρavg

EXPERIMENTAL RESULTS – MINIMUM FORM SIZE Effect of minimum form size – crawler performs EXPERIMENTAL RESULTS – MINIMUM FORM SIZE Effect of minimum form size – crawler performs better on larger forms 78. 9% 88. 77% 88. 96%

CONTRIBUTIONS Introduces Hi. WE, one of the first publicly available techniques for crawling the CONTRIBUTIONS Introduces Hi. WE, one of the first publicly available techniques for crawling the hidden web Introduces LITE, a technique to extract form data, by incorporating the physical layout of the HTML page Techniques prior to this were based on pattern recognition of the underlying HTML

PROS Defines clear performance metric from which to analyze the crawler’s efficiency Points out PROS Defines clear performance metric from which to analyze the crawler’s efficiency Points out known limitations of technique from which future work can be done Directs readers to technical report which provides more detailed explanation of Hi. WE implementation

CONS Not an automatic approach, requires human intervention Task-specific Requires creation of LVS table CONS Not an automatic approach, requires human intervention Task-specific Requires creation of LVS table per task Technique has lots of limitations Can only retrieve search results from HTML based forms Cannot support forms that is driven by Javascript events e. g. onclick, onselect No mention of whether forms submitted through HTTP post were stored/indexed

RELATED WORK USC ISI Extract Data from Web (1999 - 2001) [7, 8] Research RELATED WORK USC ISI Extract Data from Web (1999 - 2001) [7, 8] Research at UCLA (2005) [4] Adaptive approach – automatically generate queries by examining results from previous queries Google’s Deep-Web Crawler (2008) [1] Describe relevant information on web page with a formal grammar and automatically adapt to web page changes Select only a small number of input combinations that provides good coverage of content in underlying database and adds the resulting HTML pages into a search engine index Deep. Peep [6] Tracks 45, 000 forms across 7 domains and allows users to search for these forms

Q&A Q&A

REFERENCES [1] J. Madhavan, D. Ko, Ł. Kot, V. Ganapathy, A. Rasmussen, & A. REFERENCES [1] J. Madhavan, D. Ko, Ł. Kot, V. Ganapathy, A. Rasmussen, & A. Halev, “Google’s Deep-Web Crawl, ” Proceedings of the VLDB Endowment, 2008. Available: http: //www. cs. cornell. edu/~lucja/Publications/I 03. pdf. [Accessed June 13, 2010] [2] C. Mattmann, “Characterizing the Web, ” Available: http: //sunset. usc. edu/classes/cs 572_2010/Characterizing_the_Web. ppt. [Accessed May 19, 2010] [3] C. Mattmann, “Query Models, ” Available: http: //sunset. usc. edu/classes/cs 572_2010/Query_Models. ppt. [Accessed June 10, 2010] [4] A. Ntoulas, P. Zerfos, & J. Cho, “Downloading Textual Hidden Web Content by Keyword Queries, ” Proceedings of the Joint Conference on Digital Libraries, June 2005. Available: http: //oak. cs. ucla. edu/~cho/papers/ntoulas-hidden. pdf. [Accessed June 13, 2010] [5] A. Wright, “Exploring a ‘Deep Web’ That Google Can’t Grasp, ” The New York Times, February 22, 2009. Available: http: //www. nytimes. com/2009/02/23/technology/internet/23 search. html? _r=1&th &emc=th. [Accessed June 1, 2010] [6] Deep. Peep beta, Available: http: //www. deeppeep. org/index. jsp [7] C. A. Knoblock, K. Lerman, S. Minton, & I. Muslea, “Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach, ” IEEE Data Engineering Bulletin, 1999. Available: http: //www. isi. edu/~muslea/PS/deb-2 k. pdf. [Accessed June 28, 2010] [8] C. A. Knoblock, S. Minton, & I. Muslea, ” Hierarchical Wrapper Induction for Semistructured Information Sources, ” Journal of Autonomous Agents and Multi. Agent Systems, 2001. Available: http: //www. isi. edu/~muslea/PS/jaamas-2 k. pdf. [Accessed June 28, 2010]