Extracting Data Behind Web Forms Stephen W Liddle

Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen

Introduction l l Web forms are designed in various ways using radio buttons, checkboxes, selection lists, text boxes, hidden controls, and even author-defined objects l 2 According to Bright. Planet. com, the size of “deep/hidden web” is 500 times greater than the “shallow web” Automated form filling is desirable but challenging

Example From www. autointerface. com 3

Objective l l Domain-independent l 4 Automatically fills out web forms, retrieve all the data behind the forms, and eliminates duplicates Unbounded domains are excluded from this study

Issues Considered l Data can only be obtained piecemeal using multiple queries l Result page contain error messages – – l 5 HTTP 404 error pages Message embedded within a series of table, frames, or other typed of HTML divisions Duplicate data are retrieved

Procedure l Automate form filling l Process response pages l Recognize duplicate data Loop! 6

Automate form filling l Parse HTML pages into a parse tree and store the portion of interest – l Store particular information – l l 7 Only the portion between <form> and </form> tags are of interest Source URL of the page, the action URL to which the form will be submitted, the number of fields, details for each field (names, types, default values) Assign proper values to each field Submit a form for CGI processing

Form Submission l Method – – l using HTTP GET verb using HTTP POST verb Plan – – Sampling phase – 8 Issue default query Exhaustive phase

Form Submission (cont’) l The user can specify several thresholds – – – l 9 Percentage of data retrieved Number of queries issues Number of bytes retrieved Amount of time spent Number of consecutive queries with no new data returned Exit form filling process when one of the thresholds above is reached or data are exhausted

Form Submission (cont’) l Estimate the database size Where 10 Di is the estimated data size Oi is the number of unique bytes observed after ith query N is the total number of queries pi is the estimation of probability of finding new data in query i+1 and pi = No. of queries that returned new data / i

Form Submission (cont’) l Estimate data size with windowed probability Where si is a measure of the standard deviation of pi over the previous 2 query cycles 11 Comment: windowed probability estimate is NOT as good as cumulative estimate in practice

Form Submission (cont’) l Estimate the maximum possible space needed Where l bi is the size in bytes of the ith sample query N is the total number of queries n is the number of sample queries, n=C Estimate the remaining time required Where ti is the total duration of the ith sample query 12

Sampling Phase of the Submission l Determine the size of a sampling batch (number of queries to issue at one time) Where N is the total number of possible combinations |fi| represents the number of choice for the ith factor Where c is the cardinality of the largest factor Where C is the size of a sampling batch 13

Sampling Phase of the Submission N = 4*7 = 28 x x X x X x log 2 N = 4. 8 X x x C = max (7, 5) =7 sort-by x x 14 x x x X c=7

Exhaustive Phase of the Submission l l Let user specify various thresholds for completeness of retrieved data l 15 Estimate max possible space needed, max remaining time needed, and data size Process additional batches of C query samples until one of the thresholds is reached or all possible combinations are exhaust

|FA|=60 Exhaustive Phase (Improvement) |FB|=12 16

Process response pages l l l 17 No-record notification -- continue Required field missing – require user intervention Unexpected failure -- timeout Default query retrieves all data in one page Default query retrieved all data showing on more than one page – concatenate all records Default query does not retrieve all data – sampling and exhaustive phase

Recognize duplicated data l l l 18 modify the copy detection system (CDS) by using <s. > tag after </tr>, <p>, <hr>, </table>, <blockquote>, … Strip all HTML tags CDS computes hash values for every record separated by <s. > CDS compares the new hash values with the hash values of all records retrieved previously Remove duplicates store new data in the repository

Experimental Results l l 19 Among 13 different Web sites visited, 5 of the cases returned all the data with a single query The sampling phase takes from a few dozen seconds to several hours Storage requirements are modest To retrieve 80% of the data, the relatively sparse data pattern need to submit about 40% of queries, and the fairly dense data pattern need to submit about 75% of the queries

Conclusions l l l 20 Domain-independent approach for automatically retrieving the data behind a given web form Use two-phase approach to gathering data Analysis of the productivity of various factors in order to emphasize those that yield more data earlier in the search process improves the performance of the system

Future Work l l 21 Automatically retrieve, extract, and integrate just the relevant data from different web sites using the tool of domain-specific ontologies with respect to user queries Unbounded domains, such as text boxes, are under the scope of the work

Related Works l l 22 Microsoft Passport and Wallet System for e-commerce transactions Shop. Bot for domain specific comparison shopping Commercial ventures index the hidden web: Bright. Planet. com, Invisible. Web. com Hi. WE: domain-specific, human-assisted web crawler