Information Extraction from the World Wide Web William

Information Extraction from the World Wide Web William W. Cohen Carnegie Mellon University Andrew Mc. Callum University of Massachusetts Amherst KDD 2003

Example: The Problem Martin Baker, a person Genomics job Employers job posting form 2

Example: A Solution 3

Extracting Job Openings from the Web foodscience. com-Job 2 Job. Title: Ice Cream Guru Employer: foodscience. com Job. Category: Travel/Hospitality Job. Function: Food Services Job. Location: Upper Midwest Contact Phone: 800 -488 -2611 Date. Extracted: January 8, 2001 Source: www. foodscience. com/jobs_midwest. htm Other. Company. Jobs: foodscience. com-Job 1 4

5 Category = Food Services Keyword = Baker Location = Continental U. S. Job Openings:

Data Mining the Extracted Job Information 6

IE from Research Papers 7

IE from Chinese Documents regarding Weather Chinese Academy of Sciences 200 k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries 8

IE from SEC Filings This filing covers the period from December 1996 to September 1997. ENRON GLOBAL POWER & PIPELINES L. L. C. CONSOLIDATED BALANCE SHEETS (IN THOUSANDS, EXCEPT SHARE AMOUNTS) SEPTEMBER 30, 1997 ------(UNAUDITED) ASSETS Current Assets Cash and cash equivalents Accounts receivable Current portion of notes receivable Other current assets Total Current Assets Investments in to Unconsolidated Subsidiaries Notes Receivable Total Assets LIABILITIES AND SHAREHOLDERS' EQUITY Current Liabilities Accounts payable Accrued taxes Total Current Liabilities Deferred Income Taxes DECEMBER 31, 1996 ------ $ 54, 262 8, 473 1, 470 336 -------71, 730 -------286, 340 16, 059 -------$374, 408 ==== $ 24, 582 6, 301 1, 394 404 -------32, 681 -------298, 530 12, 111 -------$343, 843 ==== $ 13, 461 1, 910 -------15, 371 -------525 $ 11, 277 1, 488 -------49, 348 -------4, 301 The U. S. energy markets in 1997 were subject to significant fluctuation Data mine these reports for - suspicious behavior, - to better understand what is normal. 9

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… 10

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft. . "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… 11

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 12

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 13

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 14

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation NAME Bill Gates Bill Veghte Richard Stallman For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft. . October 14, 2002, 4: 00 a. m. PT 15

IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Load DB Document collection Train extraction models Label training data Database Query, Search Data mine 16

Why IE from the Web? • Science – Grand old dream of AI: Build large KB* and reason with it. IE from the Web enables the creation of this KB. – IE from the Web is a complex problem that inspires new advances in machine learning. • Profit – Many companies interested in leveraging data currently “locked in unstructured text on the Web”. – Not yet a monopolistic winner in this space. • Fun! – Build tools that we researchers like to use ourselves: Cora & Cite. Seer, MRQE. com, FAQFinder, … – See our work get used by the general public. * KB = “Knowledge Base” 17

Tutorial Outline • IE History • Landscape of problems and solutions • Parade of models for segmenting/classifying: – – Sliding window Boundary finding Finite state machines Trees 15 min break • Overview of related problems and solutions – Association, Clustering – Integration with Data Mining • Where to go from here 18

IE History Pre-Web • Mostly news articles – De Jong’s FRUMP [1982] • Hand-built system to fill Schank-style “scripts” from news wire – Message Understanding Conference (MUC) DARPA [’ 87 -’ 95], TIPSTER [’ 92 -’ 96] • Most early work dominated by hand-built models – E. g. SRI’s FASTUS, hand-built FSMs. – But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’ 97], BBN [Bikel et al ’ 98] Web • AAAI ’ 94 Spring Symposium on “Software Agents” – Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. • Tom Mitchell’s Web. KB, ‘ 96 – Build KB’s from the Web. • Wrapper Induction – Initially hand-build, then ML: [Soderland ’ 96], [Kushmeric ’ 97], … 19

What makes IE from the Web Different? Less grammar, but more formatting & linking Newswire Web www. apple. com/retail Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002 -Apple's first retail store in New York City will open in Manhattan's So. Ho district on Thursday, July 18 at 8: 00 a. m. EDT. The So. Ho store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. "Fourteen months after opening our first retail store, our 31 stores are attracting over 100, 000 visitors each week, " said Steve Jobs, Apple's CEO. "We hope our So. Ho store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles. " The directory structure, link structure, formatting & layout of the Web is its own new grammar. www. apple. com/retail/soho/theatre. html 20

Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs without formatting Grammatical sentences and some formatting & links Astro Teller is the CEO and co-founder of Body. Media. Astro holds a Ph. D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M. S. in symbolic and heuristic computation and B. S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables 21

Landscape of IE Tasks (2/4): Pattern Scope Web site specific Formatting Amazon. com Book Pages Genre specific Layout Resumes Wide, non-specific Language University Names 22

Landscape of IE Tasks (3/4): Pattern Complexity E. g. word patterns: Closed set Regular set U. S. states U. S. phone numbers He was born in Alabama… Phone: (413) 545 -1323 The big Wyoming sky… The CALD main office can be reached at 412 -268 -1299 Complex pattern U. S. postal addresses University of Arkansas P. O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4 th Floor Cincinnati, Ohio 45210 Ambiguous patterns, needing context and many sources of evidence Person names …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at Whiz. Bang Labs. 23

Landscape of IE Tasks (4/4): Pattern Combinations Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Location: Connecticut N-ary record Relation: Company: Title: Out: In: Succession General Electric CEO Jack Welsh Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut “Named entity” extraction 24

Evaluation of Single Entity Extraction TRUTH: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. PRED: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. Precision = # correctly predicted segments = # predicted segments Recall = # correctly predicted segments # true segments F 1 = 2 6 = 2 4 Harmonic mean of Precision & Recall = 1 ((1/P) + (1/R)) / 2 25

State of the Art Performance • Named entity recognition – Person, Location, Organization, … – F 1 in high 80’s or low- to mid-90’s • Binary relation extraction – Contained-in (Location 1, Location 2) Member-of (Person 1, Organization 1) – F 1 in 60’s or 70’s or 80’s • Wrapper induction – Extremely accurate performance obtainable – Human effort (~30 min) required on each site 26

Landscape of IE Techniques (1/1): Models Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky. Context Free Grammars Abraham Lincoln was born in Kentucky. V P NP Classifier st PP which class? VP NP BEGIN END pa rs V ly NNP lik e NNP Mo Most likely state sequence? BEGIN VP S Any of these models can be used to capture words, formatting or both. …and beyond 27

Landscape: Focus of this Tutorial Pattern complexity Pattern feature domain Pattern scope Pattern combinations Models closed set words regular complex words + formatting site-specific formatting genre-specific entity binary lexicon regex ambiguous general n-ary window boundary FSM CFG 28

Sliding Windows 29

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU Use. Net Seminar Announcement 30

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU Use. Net Seminar Announcement 31

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU Use. Net Seminar Announcement 32

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU Use. Net Seminar Announcement 33

A “Naïve Bayes” Sliding Window Model [Freitag 1997] … 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … w t-m w t-1 w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. Other examples of sliding window: [Baluja et al 2000] 34 (decision tree over individual words & their context)

“Naïve Bayes” Sliding Window Results Domain: CMU Use. Net Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field Person Name: Location: Start Time: F 1 30% 61% 98% 35

SRV: a realistic sliding-window-classifier IE system [Frietag AAAI ‘ 98] • What windows to consider? – all windows containing as many tokens as the shortest example, but no more tokens than the longest example • How to represent a classifier? It might: – Restrict the length of window; – Restrict the vocabulary or formatting used before/after/inside window; – Restrict the relative order of tokens; – Etc… <title>Course Information for CS 213</title> <h 1>CS 213 C++ Programming</h 1> 36

SRV: a rule-learner for sliding-window classification Rule learning: greedily add conditions to rules, rules to rule set Search metric: SRV algorithm greedily adds conditions to maximize “information gain” To prevent overfitting: rules are built on 2/3 of data, then their false positive rate is estimated on the 1/3 holdout set. Candidate conditions: … <title>Course Information for CS 213</title> <h 1>CS 213 C++ Programming</h 1> course. Number(X) : token. Length(X, =, 2), every(X, in. Title, false), some(X, A, <previous. Token>, in. Title, true), some(X, B, <>, tripleton, true) “Two tokens, one a 3 -char token, starting just after the title” 38

SRV: a rule-learner for sliding-window classification • Primitive predicates used by SRV: – token(X, W), all. Lower. Case(W), numerical(W), … – next. Token(W, U), previous. Token(W, V) • HTML-specific predicates: – in. Title. Tag(W), in. H 1 Tag(W), in. Em. Tag(W), … – emphasized(W) = “in. Em. Tag(W) or in. BTag(W) or …” – table. Next. Col(W, U) = “U is some token in the column after the column W is in” – table. Previous. Col(W, V), table. Row. Header(W, T), … 40

SRV: a rule-learner for sliding-window classification • Non-primitive “conditions” used by SRV: – – every(+X, f, c) = for all W in X : f(W)=c some(+X, W, <f 1, …, fk>, g, c)= exists W: g(fk(…(f 1(W)…))=c token. Length(+X, relop, c): position(+W, direction, relop, c): • e. g. , token. Length(X, >, 4), position(W, from. End, <, 2) course. Number(X) : token. Length(X, =, 2), every(X, in. Title, false), some(X, A, <previous. Token>, in. Title, true), some(X, B, <>. tripleton, true) Non-primitive conditions make greedy search easier 41

Rapier: an alternative approach A bottom-up rule learner: [Califf & Mooney, AAAI ‘ 99] initialize RULES to be one rule per example; repeat { randomly pick N pairs of rules (Ri, Rj); let {G 1…, GN} be the consistent pairwise generalizations; let G* = Gi that optimizes “compression” let RULES = RULES + {G*} – {R’: covers(G*, R’)} } where compression(G, RULES) = size of RULES- {R’: covers(G, R’)} and “covers(G, R)” means every example matching G matches R 43

<title>Course Information for CS 213</title> <h 1>CS 213 C++ Programming</h 1> … Differences dropped course. Num(window 1) : - token(window 1, ’CS’), doubleton(‘CS’), prev. Token(‘CS’, ’CS 213’), in. Title(‘CS 213’), next. Tok(‘CS’, ’ 213’), numeric(‘ 213’), tripleton(‘ 213’), next. Tok(‘ 213’, ’C++’), tripleton(‘C++’), …. <title>Syllabus and meeting times for Eng 214</title> <h 1>Eng 214 Software Engineering for Non-programmers </h 1>… course. Num(window 2) : - token(window 2, ’Eng’), tripleton(‘Eng’), prev. Token(‘Eng’, ’ 214’), in. Title(‘ 214’), next. Tok(‘Eng’, ’ 214’), numeric(‘ 214’), tripleton(‘ 214’), next. Tok(‘ 214’, ’Software’), … course. Num(X) : token(X, A), prev. Token(A, B), in. Title(B), next. Tok(A, C)), numeric(C), tripleton(C), next. Tok(C, D), … Common conditions carried over to generalization 44

Rapier: an alternative approach - Combines top-down and bottom-up learning - Bottom-up to find common restrictions on content - Top-down greedy addition of restrictions on context - Use of part-of-speech and semantic features (from WORDNET). - Special “pattern-language” based on sequences of tokens, each of which satisfies one of a set of given constraints - < <tok 2{‘ate’, ’hit’}, POS 2{‘vb’}>, <tok 2{‘the’}>, <POS 2{‘nn’>> 45

Rapier: results – precision/recall 46

Rule-learning approaches to slidingwindow classification: Summary • SRV, Rapier, and WHISK [Soderland KDD ‘ 97] – Representations for classifiers allow restriction of the relationships between tokens, etc – Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) – Use of these “heavyweight” representations is complicated, but seems to pay off in results • Can simpler representations for classifiers work? 48

BWI: Learning to detect boundaries [Freitag & Kushmerick, AAAI 2000] • Another formulation: learn three probabilistic classifiers: – START(i) = Prob( position i starts a field) – END(j) = Prob( position j ends a field) – LEN(k) = Prob( an extracted field has length k) • Then score a possible extraction (i, j) by START(i) * END(j) * LEN(j-i) • LEN(k) is estimated from a histogram 49

BWI: Learning to detect boundaries • BWI uses boosting to find “detectors” for START and END • Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i). • Each “pattern” is a sequence of tokens and/or wildcards like: any. Alphabetic. Token, any. Upper. Case. Letter, any. Number, … • Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE, AFTER patterns 50

BWI: Learning to detect boundaries Field Person Name: Location: Start Time: F 1 30% 61% 98% 51

Problems with Sliding Windows and Boundary Finders • Decisions in neighboring parts of the input are made independently from each other. – Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”. – It is possible for two overlapping windows to both be above threshold. – In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step. 52

Finite State Machines 53

Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model. . . S t-1 St State sequence Observation sequence observations O o 1 o 2 o 3 o 4 o 5 o 6 o 7 transitions . . . Generates: S t+1 t -1 Ot O t +1 o 8 Parameters: for all states S={s 1, s 2, …} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Usually a multinomial over Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet Training: 54 Maximize probability of training observations (w/ prior)

IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Pedro Domingos 55

HMM Example: “Nymble” [Bikel, et al 1998], [BBN “Identi. Finder”] Task: Named Entity Extraction Transition probabilities Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) Person start-ofsentence end-ofsentence Org or (Five other name classes) Back-off to: Train on ~500 k words of news wire text. Case Mixed Upper Mixed Language English Spanish Back-off to: P(st | st-1 ) Other Results: P(ot | st , ot-1 ) P(ot | st ) P(ot ) F 1. 93% 91% 90% 56 Other examples of shrinkage for HMMs in IE: [Freitag and Mc. Callum ‘ 99]

We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 identity of word ends in “-ski” is capitalized is part of a noun phrase is “Wisniewski” is in a list of city names is under node X in Word. Net part of ends in is in bold font noun phrase “-ski” is indented O t 1 is in hyperlink anchor last person name was female next two words are “and Associates” St S t+1 … … Ot O t +1 57

Problems with Richer Representation and a Joint Model These arbitrary features are not independent. – Multiple levels of granularity (chars, words, phrases) – Multiple dependent modalities (words, formatting, layout) – Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t-1 St S t+1 O Ot O t +1 t -1 58

Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s, o): – Can examine features, but not responsible for generating them. – Don’t have to explicitly model their dependencies. – Don’t “waste modeling effort” trying to generate what we are given at test time anyway. 59

Conditional Finite State Sequence Models From HMMs to CRFs [Mc. Callum, Freitag & Pereira, 2000] [Lafferty, Mc. Callum, Pereira 2001] St-1 St St+1. . . Joint Ot-1 Ot Ot+1 . . . Conditional St-1 Ot-1 where St Ot St+1. . . Ot+1 . . . (A super-special case of Conditional Random Fields. ) 60

Conditional Random Fields [Lafferty, Mc. Callum, Pereira 2001] 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates") 61

Feature Functions o = Yesterday Pedro Domingos spoke this example sentence. o 1 o 2 s 1 o 4 o 5 o 6 o 7 s 2 s 3 o 3 s 4 62

Learning Parameters of CRFs Maximize log-likelihood of parameters L = {lk} given training data D Log-likelihood gradient: Methods: • iterative scaling (quite slow) • conjugate gradient (much faster) • limited-memory quasi-Newton methods, BFGS (super fast) [Sha & Pereira 2002] & [Malouf 2002] 64

Voted Perceptron Sequence Models [Collins 2002] Like CRFs with stochastic gradient ascent and a Viterbi approximation. Analogous to the gradient for this one training instance Avoids calculating the partition function (normalizer), Zo, but gradient ascent, not 2 nd-order or conjugate gradient method. 65

General CRFs vs. HMMs • More general and expressive modeling technique • Comparable computational efficiency • Features may be arbitrary functions of any or all observations • Parameters need not fully specify generation of observations; require less training data • Easy to incorporate domain knowledge • State means only “state of process”, vs “state of process” and “observational history I’m keeping” 66

Person name Extraction [Mc. Callum 2001, unpublished] 68

Person name Extraction 69

Features in Experiment Capitalized Xxxxx Mixed Caps Xx. Xxxx All Caps XXXXX Initial Cap X…. Contains Digit xxx 5 All lowercase xxxx Initial X Punctuation. , : ; !(), etc Period. Comma , Apostrophe ‘ Dash Preceded by HTML tag Character n-gram classifier Hand-built FSM person-name says string is a person extractor says yes, name (80% accurate) (prec/recall ~ 30/95) In stopword list Conjunctions of all previous (the, of, their, etc) feature pairs, evaluated at the current time step. In honorific list (Mr, Mrs, Dr, Sen, etc) Conjunctions of all previous feature pairs, evaluated at In person suffix list current step and one step (Jr, Sr, Ph. D, etc) ahead. In name particle list All previous features, evaluated (de, la, van, der, etc) two steps ahead. In Census lastname list; All previous features, evaluated segmented by P(name) one step behind. In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (“J. C. Penny”) Total number of features = ~500 k In list of company suffixes (Inc, & Associates, Foundation) 70

Training and Testing • Trained on 65 k words from 85 pages, 30 different companies’ web sites. • Training takes 4 hours on a 1 GHz Pentium. • Training precision/recall is 96% / 96%. • Tested on different set of web pages with similar size characteristics. • Testing precision is 92 – 95%, recall is 89 – 91%. 71

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19. 9 billion dollars, was slightly below 1994. Producer returns averaged $12. 93 per hundredweight, $0. 19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1. 56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993 -95 ----------------------------------------: : Production of Milk and Milkfat 2/ : Number : ---------------------------Year : of : Per Milk Cow : Percentage : Total : Milk Cows 1/: ----------: of Fat in All : ---------: : Milkfat : Milk Produced : Milkfat ----------------------------------------: 1, 000 Head --- Pounds --Percent Million Pounds : 1993 : 9, 589 15, 704 575 3. 66 150, 582 5, 514. 4 1994 : 9, 500 16, 175 592 3. 66 153, 664 5, 623. 7 1995 : 9, 461 16, 451 602 3. 66 155, 644 5, 694. 3 ----------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. 73

Table Extraction from Government Reports [Pinto, Mc. Callum, Wei, Croft, 2003] 100+ documents from www. fedstats. gov CRF of milk during 1995 at $19. 9 billion dollars, was eturns averaged $12. 93 per hundredweight, 1994. Marketings totaled 154 billion pounds, ngs include whole milk sold to plants and dealers consumers. ds of milk were used on farms where produced, es were fed 78 percent of this milk with the cer households. 1993 -95 ------------------ n of Milk and Milkfat 2/ -------------------: Percentage : • • Non-Table Title Table Header Table Data Row Table Section Data Row Table Footnote. . . (12 in all) Features: uction of Milk and Milkfat: w Labels: Total ----: of Fat in All : ---------Milk Produced : Milkfat ------------------ • • Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. . Conjunctions of all previous features, 74 time offset: {0, 0}, {-1, 0}, {0, 1}, {1, 2}.

Table Extraction Experimental Results [Pinto, Mc. Callum, Wei, Croft, 2003] Line labels, percent correct HMM 65 % Stateless Max. Ent 85 % CRF w/out conjunctions 52 % CRF 95 % D error = 85% 75

Named Entity Recognition Reuters stories on international news CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996 -08 -22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip De. Freitas as Boland's overseas professional. Train on ~300 k words Labels: PER ORG LOC MISC Examples: Yayuk Basuki Innocent Butare 3 M KDP Leicestershire Nirmal Hriday The Oval Java Basque 1, 000 Lakes Rally 76

Automatically Induced Features [Mc. Callum 2003] Index Feature 0 inside-noun-phrase (ot-1) 5 stopword (ot) 20 capitalized (ot+1) 75 word=the (ot) 100 in-person-lexicon (ot-1) 200 word=in (ot+2) 500 word=Republic (ot+1) 711 word=RBI (ot) & header=BASEBALL 1027 header=CRICKET (ot) & in-English-county-lexicon (ot) 1298 company-suffix-word (firstmentiont+2) 4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1) 4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot) 4474 word=the (ot-2) & word=of (ot) 77

Named Entity Extraction Results [Mc. Callum & Li, 2003] Method F 1 # parameters BBN's Identifinder, word features 79% ~500 k CRFs word features, w/out Feature Induction 80% ~500 k CRFs many features, w/out Feature Induction 75% ~3 million CRFs many candidate features with Feature Induction 90% ~60 k 78

Inducing State-Transition Structure [Chidlovskii, 2000] K-reversible grammars Structure learning for HMMs + IE [Seymore et al 1999] 79 [Frietag & Mc. Callum 2000]

Limitations of Finite State Models • Finite state models have a linear structure • Web documents have a hierarchical structure – Are we suffering by not modeling this structure more explicitly? • How can one learn a hierarchical extraction model? 80

Tree-based Models 81

• Extracting from one web site – Use site-specific formatting information: e. g. , “the Job. Title is a boldfaced paragraph in column 2” – For large well-structured sites, like parsing a formal language • Extracting from many web sites: – Need general solutions to entity extraction, grouping into records, etc. – Primarily use content information – Must deal with a wide range of ways that users present data. – Analogous to parsing natural language • Problems are complementary: – Site-dependent learning can collect training data for a siteindependent learner – Site-dependent learning can boost accuracy of a site-independent learner on selected key sites 82

83

User gives first K positive—and thus many implicit negative examples Learner 84

85

STALKER: Hierarchical boundary finding [Muslea, Minton & Knoblock 99] • Main idea: – To train a hierarchical extractor, pose a series of learning problems, one for each node in the hierarchy – At each stage, extraction is simplified by knowing about the “context. ” 86

87

(BEFORE=null, AFTER=(Tutorial, Topics)) (BEFORE=null, AFTER=(Tutorials, and)) 88

(BEFORE=null, AFTER=(<, li, >, )) 89

(BEFORE=(: ), AFTER=null) 90

(BEFORE=(: ), AFTER=null) 91

(BEFORE=(: ), AFTER=null) 92

Stalker: hierarchical decomposition of two web sites 93

Stalker: summary and results • Rule format: – “landmark automata” format for rules which extended BWI’s format • E. g. : <a>W. Cohen</a> CMU: Web IE </li> • BWI: BEFORE=(<, /, a, >, ANY, : ) • STALKER: BEGIN = Skip. To(<, /, a, >), Skip. To(: ) • Top-down rule learning algorithm – Carefully chosen ordering between types of rule specializations • Very fast learning: e. g. 8 examples vs. 274 • A lesson: we often control the IE training data! 94

Why low sample complexity is important in “wrapper learning” At training time, only four examples are available—but one would like to generalize to future pages as well… 95

“Wrapster”: a hybrid approach to representing wrappers [Cohen, Jensen&Hurst WWW 02] • Common representations for web pages include: – a rendered image – a DOM tree (tree of HTML markup & text) • gives some of the power of hierarchical decomposition – a sequence of tokens – a bag of words, a sequence of characters, a node in a directed graph, . . . • Questions: – How can we engineer a system to generalize quickly? – How can we explore representational choices easily? 96

Example Wrapster predicate html http: //was. Bang. org/aboutus. html head … body Was. Bang. com contact info: p p “Was. Bang. com. . info: ” ul “Currently. . ” – Pittsburgh, PA – Provo, UT li a “Pittsburgh, PA” Currently we have offices in two locations: li a “Provo, UT” 99

Example Wrapster predicate http: //was. Bang. org/aboutus. html Example: p(s 1, s 2) iff s 2 are the tokens below an li node inside a ul node inside s 1. EXECUTE(p, s 1) extracts – “Pittsburgh, PA” – “Provo, UT” Was. Bang. com contact info: Currently we have offices in two locations: – Pittsburgh, PA – Provo, UT 100

Wrapster builders • Builders are based on simple, restricted languages, for example: – Ltagpath: p is defined by tag 1, …, tagk and ptag 1, …, tagk(s 1, s 2) is true iff s 1 and s 2 correspond to DOM nodes and s 2 is reached from s 1 by following a path ending in tag 1, …, tagk • EXECUTE(pul, li, s 1) = {“Pittsburgh, PA”, “Provo, UT”} – Lbracket: p is defined by a pair of strings (l, r), and pl, r(s 1, s 2) is true iff s 2 is preceded by l and followed by r. • EXECUTE(pin, locations, s 1) = {“two”} 101

Wrapster builders For each language L there is a builder B which implements: • LGG( positive examples of p(s 1, s 2)): least general p in L that covers all the positive examples (like pairwise generalization) – For Lbracket, longest common prefix and suffix of the examples. • REFINE(p, examples ): a set of p’s that cover some but not all of the examples. – For Ltagpath, extend the path with one additional tag that appears in the examples. • Builders/languages can be combined: – E. g. to construct a builder for (L 1 and L 2) or (L 1 compose. With L 2) 102

Wrapster builders - examples • Compose `tagpaths’ and `brackets’ – E. g. , “extract strings between ‘(‘ and ‘)’ inside a list item inside an unordered list” • Compose `tagpaths’ and language-based extractors – E. g. , “extract city names inside the first paragraph” • Extract items based on position inside a rendered table, or properties of the rendered text – E. g. , “extract items inside any column headed by text containing the words ‘Job’ and ‘Title’” – E. g. “extract items in boldfaced italics” 103

Wrapster results F 1 107 #examples

Broader Issues in IE 109

Broader View Up to now we have been focused on segmentation and classification Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Load DB Document collection Train extraction models Label training data Database Query, Search Data mine 110

Broader View Now touch on some other issues 3 Create ontology Spider Filter by relevance Tokenize 1 2 IE Segment Classify Associate Cluster Load DB Document collection 4 Train extraction models Label training data Database Query, Search 5 Data mine 111

(1) Association as Binary Classification Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair. Person Role Person-Role (Christos Faloutsos, KDD 2003 General Chair) NO Person-Role ( Ted Senator, KDD 2003 General Chair) YES Do this with SVMs and tree kernels over parse trees. [Zelenko et al, 2002] 112

(1) Association with Finite State Machines [Ray & Craven, 2001] … This enzyme, UBC 6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. … DET N N V PREP ART ADJ N V ART N this enzyme ubc 6 localizes to the endoplasmic reticulum with the catalytic domain facing the cytosol Subcellular-localization (UBC 6, endoplasmic reticulum) 113

(1) Association using Parse Tree Simultaneously POS tag, parse, extract & associate! [Miller et al 2000] Increase space of parse constituents to include entity and relation tags Notation Description . ch cm Xp t w head constituent category modifier constituent category X of parent node POS tag word Parameters e. g. . P(ch|cp) P(cm|cp, chp, cm-1, wp) P(tm|cm, th, wh) P(wm|cm, th, wh) P(vp|s) P(per/np|s, vp, null, said) P(per/nnp|per/np, vbd, said) P(nance|per/np, per/nnp, vbd, said) (This is also a great example of extraction using a tree model. ) 114

(1) Association with Graphical Models Capture arbitrary-distance dependencies among predictions. Random variable over the class of entity #2, e. g. over {person, location, …} [Roth & Yih 2002] Random variable over the class of relation between entity #2 and #1, e. g. over {lives-in, is-boss-of, …} Local language models contribute evidence to relation classification. Local language models contribute evidence to entity classification. Dependencies between classes of entities and relations! Inference with loopy belief propagation. 115

(1) Association with Graphical Models [Roth & Yih 2002] Also capture long-distance dependencies among predictions. Random variable over the class of entity #1, e. g. over {person, location, …} person lives-in person? Local language models contribute evidence to entity classification. Random variable over the class of relation between entity #2 and #1, e. g. over {lives-in, is-boss-of, …} Local language models contribute evidence to relation classification. Dependencies between classes of entities and relations! Inference with loopy belief propagation. 116

(1) Association with Graphical Models [Roth & Yih 2002] Also capture long-distance dependencies among predictions. Random variable over the class of entity #1, e. g. over {person, location, …} person lives-in location Local language models contribute evidence to entity classification. Random variable over the class of relation between entity #2 and #1, e. g. over {lives-in, is-boss-of, …} Local language models contribute evidence to relation classification. Dependencies between classes of entities and relations! Inference with loopy belief propagation. 117

(1) Association with “Grouping Labels” [Jensen & Cohen, 2001] • Create a simple language that reflects a field’s relation to other fields • Language represents ability to define: – Disjoint fields – Shared fields – Scope • Create rules that use field labels 118

(1) Grouping labels: A simple example Next: Name: recordstart Kites Buy a kite Box Kite $100 Stunt Kite $300 Name: Box Kite Company: Location: Order: Cost: $100 Description: Color: Size: - 120

(2) Grouping labels: A messy example next: Name: recordstart prevlink: Cost Kites Buy a kite Box Kite Stunt Kite Box Kite Great for kids Detailed specs Specs Color: blue Size: small $100 $300 Name: Box Kite Company: Location: Order: Cost: $100 Description: Great for kids Color: blue Size: small pagetype: Product 121

(2) User interface: adding labels to extracted fields 122

(1) Experimental Evaluation of Grouping Labels Fixed language, then wrapped 499 new sites—all of which could be handled. 123

Broader View Now touch on some other issues 3 Create ontology Spider Filter by relevance Tokenize 1 2 IE Segment Classify Associate Cluster Load DB Document collection Database 4 Train extraction models Query, Search 5 Data mine Label training data Object Consolidation 124

(2) Learning a Distance Metric Between Records [Borthwick, 2000; Cohen & Richman, 2001; Bilenko & Mooney, 2002, 2003] Learn Pr ({duplicate, not-duplicate} | record 1, record 2) with a Maximum Entropy classifier. Do greedy agglomerative clustering using this Probability as a distance 125 metric.

(2) String Edit Distance • distance(“William Cohen”, “Willliam Cohon”) s W I L L I A M _ C O H E N alignment t op cost W I L L L I A M _ C O H O N C C C I C C C S C 0 0 1 1 1 1 2 2 126

(2) Computing String Edit Distance D(i, j) = min D(i-1, j-1) + d(si, tj) //subst/copy D(i-1, j)+1 //insert D(i, j-1)+1 //delete C A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment M (may be more than 1) C C 1 1 2 learn these parameters O H E N 2 3 4 5 3 3 4 5 O 3 2 H 4 3 N 5 4 2 3 3 3 4 3 127

(2) String Edit Distance Learning [Bilenko & Mooney, 2002, 2003] Precision/recall for MAILING dataset duplicate detection 128

(2) Information Integration [Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001], [Richardson & Domingos 2003] Goal might be to merge results of two IE systems: Name: Introduction to Computer Science Title: Intro. to Comp. Sci. Num: 101 Dept: Computer Science Teacher: Dr. Klüdge Number: CS 101 Teacher: M. A. Kludge Time: 9 -11 am TA: John Smith Name: Data Structures in Java Topic: Java Programming Room: 5032 Wean Hall Start time: 9: 10 AM 129

(2) Two further Object Consolidation Issues • Efficiently clustering large data sets by preclustering with a cheap distance metric (hybrid of string-edit distance and term-based distances) – [Mc. Callum, Nigam & Ungar, 2000] • Don’t simply merge greedily: capture dependencies among multiple merges. – [Cohen, Mac. Allister, Kautz KDD 2000; Pasula, Marthi, Milch, Russell, Shpitser, NIPS 2002; Mc. Callum and Wellner, KDD WS 2003] 130

Relational Identity Uncertainty with Probabilistic Relational Models (PRMs) [Russell 2001], [Pasula et al 2002] [Marthi, Milch, Russell 2003] (Applied to citation matching, and object correspondence in vision) N id context words id surname distance fonts . . . gender age . . 131

A Conditional Random Field for Co-reference [Mc. Callum & Wellner, 2003] . . . Mr Powell. . . -(45) . . . Powell. . . N -(-30) N Y +(11) . . . she. . . -4 132

Inference in these CRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Condoleezza Rice. . . 45 . . . she. . . -106 -30 -134 11 . . . Powell. . . Mr. Powell. . . 10 = -22 133

Inference in these CRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Condoleezza Rice. . . 45 . . . she. . . -106 -30 -134 11 . . . Powell. . . Mr. Powell. . . 10 = 314 134

Broader View Now touch on some other issues 3 Create ontology Spider Filter by relevance Tokenize 1 2 IE Segment Classify Associate Cluster Load DB Document collection 4 Train extraction models Database Query, Search 5 Data mine Label training data 1 135

(3) Automatically Inducing an Ontology [Riloff, ‘ 95] Two inputs: (1) (2) Heuristic “interesting” meta-patterns. 136

(3) Automatically Inducing an Ontology [Riloff, ‘ 95] Subject/Verb/Object patterns that occur more often in the relevant documents than the irrelevant ones. 137

Broader View Now touch on some other issues 3 Create ontology Spider Filter by relevance Tokenize 1 2 IE Segment Classify Associate Cluster Load DB Document collection 4 Train extraction models Database Query, Search 5 Data mine Label training data 1 138

(4) Training IE Models using Unlabeled Data Consider just appositives and prepositional phrases. . . [Collins & Singer, 1999] …says Mr. Cooper, a vice president of … NNP appositive phrase, head=president Use two independent sets of features: Contents: full-string=Mr. _Cooper, contains(Mr. ), contains(Cooper) Context: context-type=appositive, appositive-head=president 1. Start with just seven rules: and ~1 M sentences of NYTimes full-string=New_York fill-string=California full-string=U. S. contains(Mr. ) contains(Incorporated) full-string=Microsoft full-string=I. B. M. Location Person Organization 2. Alternately train & label using each feature set. 3. Obtain 83% accuracy at finding person, location, organization & other in appositives and prepositional phrases! See also [Brin 1998], [Blum & Mitchell 1998], [Riloff & Jones 1999] 139

Broader View Now touch on some other issues 3 Create ontology Spider Filter by relevance Tokenize 1 2 IE Segment Classify Associate Cluster Load DB Document collection 4 Train extraction models Database Query, Search 5 Data mine Label training data 1 140

(5) Data Mining: Working with IE Data • Some special properties of IE data: – It is based on extracted text – It is “dirty”, (missing extraneous facts, improperly normalized entity names, etc. ) – May need cleaning before use • What operations can be done on dirty, unnormalized databases? – Datamine it directly. – Query it directly with a language that has “soft joins” across similar, but not identical keys. [Cohen 1998] – Use it to construct features for learners [Cohen 2000] – Infer a “best” underlying clean database [Cohen, Kautz, Mac. Allester, KDD 2000] 141

(5) Data Mining: Mutually supportive [Nahm & Mooney, 2000] IE and Data Mining Extract a large database Learn rules to predict the value of each field from the other fields. Use these rules to increase the accuracy of IE. Example DB record Sample Learned Rules platform: AIX & !application: Sybase & application: DB 2 application: Lotus Notes language: C++ & language: C & application: Corba & title=Software. Engineer platform: Windows language: HTML & platform: Windows. NT & application: Active. Server. Pages area: Database Language: Java & area: Active. X & area: Graphics area: Web 142

(5) Working with IE Data • Association rule mining using IE data • Classification using IE data 143

[Cohen, ICML 2000] Idea: do very lightweight “site wrapping” of relevant pages Make use of (partial, noisy) wrappers 144

145

146

147

148

(5) Working with IE Data • Association rule mining using IE data • Classification using IE data – Many features based on lists, tables, etc are proposed – The learner filters these features and decides which to use in a classifier – How else can proposed structures be filtered? 149

(5) Finding “pretty good” wrappers without site-specific training data • Local structure in extraction – Assume a set of “seed examples” similar to field to be extracted. – Identify small possible wrappers (e. g. , simple tagpaths) – Use semantic information to evaluate, for each wrapper • Average minimum TFIDF distance to a known positive “seed example” over all extracted strings – Adopt best single tagpath • Results on 84 pre-wrapped page types (Cohen, AAAI -99) – 100% equivalent to target wrapper 80% of time – More conventional learning approach: 100% equivalent to target wrapper 50% of the time (Cohen & Fan, WWW-99) 150

151

152

153

154

(5) Working with IE Data • Association rule mining using IE data • Classification using IE data – Many features based on lists, tables, etc are proposed – The learner filters these features and decides which to use in a classifier – How else can proposed structures be filtered? – How else can structures be proposed? 156

List 1 builder predicate Task: classify links as to whether they point to an “executive biography” page 158

List 2 builder predicate 159

List 3 builder predicate 160

Features extracted: { List 1, List 3, …}, { List 1, List 2, List 3, …}, { List 2, List 3, …}, … 161

Experimental results Error reduced by almost half on average Builder features hurt No improvement 162

Learning Formatting Patterns “On the Fly”: “Scoped Learning” for IE [Blei, Bagnell, Mc. Callum, 2002] [Taskar, Wong, Koller 2003] Formatting is regular on each site, but there are too many different sites to wrap. 163 Can we get the best of both worlds?

Scoped Learning Generative Model 1. For each of the D documents: q a a) Generate the multinomial formatting feature parameters f from p(f|a) f 2. For each of the N words in the document: a) Generate the nth category cn from p(cn). b) Generate the nth word (global feature) from p(wn|cn, q) c) Generate the nth formatting feature (local feature) from p(fn|cn, f) c w f N D 164

Global Extractor: Precision = 46%, Recall = 75% 167

Scoped Learning Extractor: Precision = 58%, Recall = 75% 168 D Error = -22%

Wrap-up 169

IE Resources • Data – RISE, http: //www. isi. edu/~muslea/RISE/index. html – Linguistic Data Consortium (LDC) • Penn Treebank, Named Entities, Relations, etc. – http: //www. biostat. wisc. edu/~craven/ie – http: //www. cs. umass. edu/~mccallum/data • Code – Text. Pro, http: //www. ai. sri. com/~appelt/Text. Pro – MALLET, http: //www. cs. umass. edu/~mccallum/mallet – Second. String, http: //secondstring. sourceforge. net/ • Both – http: //www. cis. upenn. edu/~adwait/penntools. html 170

Where from Here? • Science – Higher accuracy, integration with data mining. – Relational Learning, Minimizing labeled data needs, unified models of all four of IE’s components. – Multi-modal IE: text, images, video, audio. Multi-lingual. • Profit – SRA, Inxight, Fetch, Mohomine, Cymfony, … you? – Bio-informatics, Intelligent Tutors, Information Overload, Anti -terrorism • Fun – Search engines that return “things” instead of “pages” (people, companies, products, universities, courses…) – New insights by mining previously untapped knowledge. 171

Thank you! More information: William Cohen: http: //www. cs. cmu. edu/~wcohen Andrew Mc. Callum http: //www. cs. umass. edu/~mccallum 172

References • • • • • [Bikel et al 1997] Bikel, D. ; Miller, S. ; Schwartz, R. ; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’ 97, p 194 -201. [Califf & Mooney 1999], Califf, M. E. ; Mooney, R. : Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). [Cohen, Hurst, Jensen, 2002] Cohen, W. ; Hurst, M. ; Jensen, L. : A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002) [Cohen, Kautz, Mc. Allester 2000] Cohen, W; Kautz, H. ; Mc. Allester, D. : Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000). [Cohen, 1998] Cohen, W. : Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98. [Cohen, 2000 a] Cohen, W. : Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3). [Cohen, 2000 b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000). [Collins & Singer 1999] Collins, M. ; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora 1999. , [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149 -176. [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98). [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph. D. dissertation, Carnegie Mellon University. [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99 -101 (2000). Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D. : Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) [Freitag & Mc. Callum 1999] Freitag, D. and Mc. Callum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99 -11. [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15 -68). [Lafferty, Mc. Callum & Pereira 2001] Lafferty, J. ; Mc. Callum, A. ; and Pereira, F. , Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001. [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego. [Mc. Callum, Freitag & Pereira 2000] Mc. Callum, A. ; Freitag, D. ; and Pereira. F. , Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000 [Miller et al 2000] Miller, S. ; Fox, H. ; Ramshaw, L. ; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1 st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233. 173

References • • [Muslea et al, 1999] Muslea, I. ; Minton, S. ; Knoblock, C. A. : A Hierarchical Approach to Wrapper Induction. Proceedings of Autonomous Agents-99. [Muslea et al, 2000] Musclea, I. ; Minton, S. ; and Knoblock, C. Hierarhical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems. [Nahm & Mooney, 2000] Nahm, Y. ; and Mooney, R. A mutually beneficial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence pages 627 --632, Austin, TX. , [Punyakanok & Roth 2001] Punyakanok, V. ; and Roth, D. The use of classifiers in sequential inference. Advances in Neural Information Processing Systems 13. [Ratnaparkhi 1996] Ratnaparkhi, A. , A maximum entropy part-of-speech tagger, in Proc. Empirical Methods in Natural Language Processing Conference, p 133 -141. [Ray & Craven 2001] Ray, S. ; and Craven, Ml. Representing Sentence Structure in Hidden Markov Models for Information Extraction. Proceedings of the 17 th International Joint Conference on Artificial Intelligence, Seattle, WA. Morgan Kaufmann. [Soderland 1997]: Soderland, S. : Learning to Extract Text-Based Information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97). [Soderland 1999] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1/3): 233 -277. 174