Скачать презентацию Make-up Class Tomorrow Wed 10 30 11 45 Скачать презентацию Make-up Class Tomorrow Wed 10 30 11 45

abcbfb7c4d817c956dc453d8f93f2165.ppt

  • Количество слайдов: 90

Make-up Class: Tomorrow (Wed) 10: 30— 11: 45 AM BY 210 (next to the Make-up Class: Tomorrow (Wed) 10: 30— 11: 45 AM BY 210 (next to the advising office) Information Extraction (Several slides based on those by Ray Mooney, Cohen/Mc. Callum (via Dan Weld’s class) 1

Intended Use of Semantic Web? • Pages should be annotated with RDF triples, with Intended Use of Semantic Web? • Pages should be annotated with RDF triples, with links to RDF-S (our OWL) background ontology. • E. g. See Jim Hendler’s page… 2

Database vs. Semantic Web Inference (and the Magellan Story) • Also templated extraction as Database vs. Semantic Web Inference (and the Magellan Story) • Also templated extraction as undoing XML HTML conversion. Templated extraction is by DOM-patterns; unstructured extraction is (sort of) by grammar parse tree patterns. Grammar learning is mostly from +ve examples. Rinku Patel To be added 3

Who will annotate the data? • Semantic web works if the users annotate their Who will annotate the data? • Semantic web works if the users annotate their pages using some existing ontology (or their own ontology, but with mapping to other ontologies) – But users typically do not conform to standards. . • and are not patient enough for delayed gratification… • Two Solutions – 1. Intercede in the way pages are created (act as if you are helping them write web-pages) • What if we change the MS Frontpage/Claris Homepage so that they (slyly) add annotations? • E. g. The Mangrove project at U. Wash. – Help user in tagging their data (allow graphical editing) – Provide instant gratification by running services that use the tags. – 2. Collaborative tagging! • “Folksonomies” (look at Wikipedia article) – FLICKR, Technorati, deli. cio. us etc • CBIOC, ESP game etc. – Need to incentivize users to do the annotations. . – 3. Automated information extraction (next topic) 4

Folksonomies—The good • Bottom-up approach to taxonomies/ontologies – [In systems like] Furl, Flickr and Folksonomies—The good • Bottom-up approach to taxonomies/ontologies – [In systems like] Furl, Flickr and Del. icio. us. . . people classify their pictures/bookmarks/web pages with tags (e. g. wedding), and then the most popular tags float to the top (e. g. Flickr's tags or Del. icio. us on the right). . – [F]olksonomies can work well for certain kinds of information because they offer a small reward for using one of the popular categories (such as your photo appearing on a popular page). People who enjoy the social aspects of the system will gravitate to popular categories while still having the freedom to keep their own lists of tags. 5

Works best when Many people Tag the same Info… 6 Works best when Many people Tag the same Info… 6

Folksonomies… the bad • On the other hand, not hard to see a few Folksonomies… the bad • On the other hand, not hard to see a few reasons why a folksonomy would be less than ideal in a lot of cases: – None of the current implementations have synonym control (e. g. "selfportrait" and "me" are distinct Flickr tags, as are "mac" and "macintosh" on Del. icio. us). – Also, there's a certain lack of precision involved in using simple one-word tags--like which Lance are we talking about? – And, of course, there's no heirarchy and the content types (bookmarks, photos) are fairly simple. • For indexing and library people, folksonomies are about as appealing as Wikipedia is to encyclopedia editors. – But. . there's some interesting stuff happening around them. 7

Mass Collaboration (& Mice running the Earth) • The quality of the tags generated Mass Collaboration (& Mice running the Earth) • The quality of the tags generated through folksonomies is notoriously hard to control – So, design mechanisms that ensure correctness of tags. . • ESP game makes it fun to • CBIOC and Google Co-op restrict annotation previleges to trusted users. . • It is hard to get people to tag things in which they don’t have personal interest. . – Find incentive structures. . • ESP makes it a “game” with points • CBIOC and Google Co-op try to promise delayed gratification in terms of improved search later. . 8

Who will annotate the data? • Semantic web works if the users annotate their Who will annotate the data? • Semantic web works if the users annotate their pages using some existing ontology (or their own ontology, but with mapping to other ontologies) – But users typically do not conform to standards. . • and are not patient enough for delayed gratification… • Two Solutions – 1. Intercede in the way pages are created (act as if you are helping them write web-pages) • What if we change the MS Frontpage/Claris Homepage so that they (slyly) add annotations? • E. g. The Mangrove project at U. Wash. – Help user in tagging their data (allow graphical editing) – Provide instant gratification by running services that use the tags. – 2. Collaborative tagging! • “Folksonomies” (look at Wikipedia article) – FLICKR, Technorati, deli. cio. us etc • CBIOC, ESP game etc. – Need to incentivize users to do the annotations. . – 3. Automated information extraction Next Topic 9

Information Extraction (IE) • Identify specific pieces of information (data) in a unstructured or Information Extraction (IE) • Identify specific pieces of information (data) in a unstructured or semi-structured textual document. • Transform unstructured information in a corpus of documents or web pages into a structured database. • Applied to different types of text: – – – – Newspaper articles Web pages Scientific articles Newsgroup messages Classified ads Medical notes Wikipedia (info boxes). . 10

Information Extraction vs. NLP? • Information extraction is attempting to find some of the Information Extraction vs. NLP? • Information extraction is attempting to find some of the structure and meaning in the hopefully template driven web pages. • As IE becomes more ambitious and text becomes more free form, then ultimately we have IE becoming equal to NLP. • Web does give one particular boost to NLP – Massive corpora. . 11

MUC • DARPA funded significant efforts in IE in the early to mid 1990’s. MUC • DARPA funded significant efforts in IE in the early to mid 1990’s. • Message Understanding Conference (MUC) was an annual event/competition where results were presented. • Focused on extracting information from news articles: – Terrorist events – Industrial joint ventures – Company management changes • Information extraction of particular interest to the intelligence community (CIA, NSA). 12

Other Applications • Job postings: – Newsgroups: Rapier from austin. jobs – Web pages: Other Applications • Job postings: – Newsgroups: Rapier from austin. jobs – Web pages: Flipdog • Job resumes: – Burning. Glass – Mohomine • • • Seminar announcements Company information from the web Continuing education course info from the web University information from the web Apartment rental ads Molecular biology information from MEDLINE 13

Wikipedia Infoboxes. . • Wikipedia has both unstructured text and structured info boxes. . Wikipedia Infoboxes. . • Wikipedia has both unstructured text and structured info boxes. . Infobox 14

Sample Job Posting Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov 1996 17: 37: 29 GMT Sample Job Posting Subject: US-TN-SOFTWARE PROGRAMMER Date: 17 Nov 1996 17: 37: 29 GMT Organization: Reference. Com Posting Service Message-ID: <56 nigp$mrs@bilbo. reference. com> SOFTWARE PROGRAMMER Position available for Software Programmer experienced in generating software for PCBased Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. Please reply to: Kim Anderson Ad. NET (901) 458 -2888 fax kimander@memphisonline. com 15

Extracted Job Template computer_science_job id: 56 nigp$mrs@bilbo. reference. com title: SOFTWARE PROGRAMMER salary: company: Extracted Job Template computer_science_job id: 56 nigp$mrs@bilbo. reference. com title: SOFTWARE PROGRAMMER salary: company: recruiter: state: TN city: country: US language: C platform: PC DOS OS-2 UNIX application: area: Voice Mail req_years_experience: 2 desired_years_experience: 5 req_degree: desired_degree: post_date: 17 Nov 1996 16

The Age of Spiritual Machines : When" src="https://present5.com/presentation/abcbfb7c4d817c956dc453d8f93f2165/image-17.jpg" alt="Amazon Book Description …. The Age of Spiritual Machines : When" /> Amazon Book Description …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by Ray Kurzweil List Price: $14. 95 Our Price: $11. 96 You Save: $2. 99 (20%) 17

Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Intelligence Author: Ray Kurzweil List-Price: $14. 95 Price: $11. 96 : : 18

Extraction from Templated Text • Many web pages are generated automatically from an underlying Extraction from Templated Text • Many web pages are generated automatically from an underlying database. • Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). • However, output is intended for human consumption, not machine interpretation. • An IE system for such generated pages allows the web site to be viewed as a structured database. • An extractor for a semi-structured web site is sometimes referred to as a wrapper. • Process of extracting from such pages is sometimes referred to as screen scraping. 19

Templated Extraction using DOM Trees • Web extraction may be aided by first parsing Templated Extraction using DOM Trees • Web extraction may be aided by first parsing web pages into DOM trees. • Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract. • May still need regex patterns to identify proper portion of the final Character. Data node. 20

Sample DOM Tree Extraction HTML Element HEADER BODY B Age of Spiritual Machines Character-Data Sample DOM Tree Extraction HTML Element HEADER BODY B Age of Spiritual Machines Character-Data FONT by A Ray Kurzweil Title: HTML BODY B Character. Data Author: HTML BODY FONT A Character. Data 21

Template Types • Slots in template typically filled by a substring from the document. Template Types • Slots in template typically filled by a substring from the document. • Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text itself. – Terrorist act: threatened, attempted, accomplished. – Job type: clerical, service, custodial, etc. – Company type: SEC code • Some slots may allow multiple fillers. – Programming language • Some domains may allow multiple extracted templates per document. – Multiple apartment listings in one ad 22

Simple Extraction Patterns • Specify an item to extract for a slot using a Simple Extraction Patterns • Specify an item to extract for a slot using a regular expression pattern. – Price pattern: “b$d+(. d{2})? b” • May require preceding (pre-filler) pattern to identify proper context. – Amazon list price: • Pre-filler pattern: “List Price: ” • Filler pattern: “$d+(. d{2})? b” • May require succeeding (post-filler) pattern to identify the end of the filler. – Amazon list price: • Pre-filler pattern: “List Price: ” • Filler pattern: “. +” • Post-filler pattern: “” 23

Simple Template Extraction • Extract slots in order, starting the search for the filler Simple Template Extraction • Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. – – Title Author List price … • Make patterns specific enough to identify each filler always starting from the beginning of the document. 24

Pre-Specified Filler Extraction • If a slot has a fixed set of pre-specified possible Pre-Specified Filler Extraction • If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. – Job category – Company type • Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler. 25

Learning for IE • Writing accurate patterns for each slot for each domain (e. Learning for IE • Writing accurate patterns for each slot for each domain (e. g. each web site) requires laborious software engineering. • Alternative is to use machine learning: – Build a training set of documents paired with human-produced filled extraction templates. – Learn extraction patterns for each slot using an appropriate machine learning algorithm. 26

Information Extraction from unstructured text 37 Information Extraction from unstructured text 37

Information Extraction from Unstructured Text: • Semantic web needs: – Tagged data – Background Information Extraction from Unstructured Text: • Semantic web needs: – Tagged data – Background knowledge • (blue sky approaches to) automate both – Knowledge Extraction • Extract base level knowledge (“facts”) directly from the web – Automated tagging • Start with a background ontology and tag other web pages – Semtag/Seeker 39

Fielded IE Systems: Citeseer, Google Scholar; Libra How do they do it? Why do Fielded IE Systems: Citeseer, Google Scholar; Libra How do they do it? Why do they fail? 40

What is “Information Extraction” As a task: Filling slots in a database from sub-segments What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… 41 Slides from Cohen & Mc. Callum

What is “Information Extraction” As a task: Filling slots in a database from sub-segments What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft. . "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… 42 Slides from Cohen & Mc. Callum

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 43 Slides from Cohen & Mc. Callum

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 44 Slides from Cohen & Mc. Callum

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4: 00 a. m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation 45 Slides from Cohen & Mc. Callum

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source, " said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation NAME Bill Gates Bill Veghte Richard Stallman For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft. . October 14, 2002, 4: 00 a. m. PT 46 Slides from Cohen & Mc. Callum

IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Load DB Document collection Train extraction models Label training data Database Query, Search Data mine 47 Slides from Cohen & Mc. Callum

IE History Pre-Web • Mostly news articles – De Jong’s FRUMP [1982] • Hand-built IE History Pre-Web • Mostly news articles – De Jong’s FRUMP [1982] • Hand-built system to fill Schank-style “scripts” from news wire – Message Understanding Conference (MUC) DARPA [’ 87 -’ 95], TIPSTER [’ 92’ 96] • Most early work dominated by hand-built models – E. g. SRI’s FASTUS, hand-built FSMs. – But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’ 97], BBN [Bikel et al ’ 98] Web • AAAI ’ 94 Spring Symposium on “Software Agents” – Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. • Tom Mitchell’s Web. KB, ‘ 96 – Build KB’s from the Web. • Wrapper Induction – First by hand, then ML: [Doorenbos ‘ 96], [Soderland ’ 96], [Kushmerick ’ 97], … 48 Slides from Cohen & Mc. Callum

What makes IE from the Web Different? Less grammar, but more formatting & linking What makes IE from the Web Different? Less grammar, but more formatting & linking Newswire Web www. apple. com/retail Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002 -Apple's first retail store in New York City will open in Manhattan's So. Ho district on Thursday, July 18 at 8: 00 a. m. EDT. The So. Ho store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. "Fourteen months after opening our first retail store, our 31 stores are attracting over 100, 000 visitors each week, " said Steve Jobs, Apple's CEO. "We hope our So. Ho store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles. " The directory structure, link structure, formatting & layout of the Web is its own new grammar. www. apple. com/retail/soho/theatre. html 49 Slides from Cohen & Mc. Callum

Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs without formatting Grammatical sentences Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs without formatting Grammatical sentences and some formatting & links Astro Teller is the CEO and co-founder of Body. Media. Astro holds a Ph. D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M. S. in symbolic and heuristic computation and B. S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables 50 Slides from Cohen & Mc. Callum

Landscape of IE Tasks (2/4): Pattern Scope Web site specific Formatting Amazon. com Book Landscape of IE Tasks (2/4): Pattern Scope Web site specific Formatting Amazon. com Book Pages Genre specific Layout Resumes Wide, non-specific Language University Names 51 Slides from Cohen & Mc. Callum

Landscape of IE Tasks (3/4): Pattern Complexity E. g. word patterns: Closed set Regular Landscape of IE Tasks (3/4): Pattern Complexity E. g. word patterns: Closed set Regular set U. S. states U. S. phone numbers He was born in Alabama… Phone: (413) 545 -1323 The big Wyoming sky… The CALD main office can be reached at 412 -268 -1299 Complex pattern U. S. postal addresses University of Arkansas P. O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4 th Floor Cincinnati, Ohio 45210 Ambiguous patterns, needing context + many sources of evidence Person names …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at Whiz. Bang Labs. 52 Slides from Cohen & Mc. Callum

Landscape of IE Tasks (4/4): Pattern Combinations Jack Welch will retire as CEO of Landscape of IE Tasks (4/4): Pattern Combinations Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Location: Connecticut N-ary record Relation: Company: Title: Out: In: Succession General Electric CEO Jack Welsh Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut “Named entity” extraction 53 Slides from Cohen & Mc. Callum

Evaluation of Single Entity Extraction TRUTH: Michael Kearns and Sebastian Seung will start Monday’s Evaluation of Single Entity Extraction TRUTH: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. PRED: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. Precision = # correctly predicted segments = # predicted segments Recall = # correctly predicted segments # true segments F 1 = 2 6 = 2 4 Harmonic mean of Precision & Recall = 1 ((1/P) + (1/R)) / 2 54 Slides from Cohen & Mc. Callum

State of the Art Performance • Named entity recognition – Person, Location, Organization, … State of the Art Performance • Named entity recognition – Person, Location, Organization, … – F 1 in high 80’s or low- to mid-90’s • Binary relation extraction – Contained-in (Location 1, Location 2) Member-of (Person 1, Organization 1) – F 1 in 60’s or 70’s or 80’s • Wrapper induction – Extremely accurate performance obtainable – Human effort (~30 min) required on each site 55 Slides from Cohen & Mc. Callum

Landscape of IE Techniques (1/1): Models Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born Landscape of IE Techniques (1/1): Models Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky. Context Free Grammars Abraham Lincoln was born in Kentucky. V P NP Classifier st PP which class? VP NP BEGIN END pa rs V ly NNP lik e NNP Mo Most likely state sequence? BEGIN VP S …and beyond 56 Any of these models can be used to capture words, formatting or both. Cohen & Mc. Callum Slides from

Sliding Windows 58 Slides from Cohen & Mc. Callum Sliding Windows 58 Slides from Cohen & Mc. Callum

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU Use. Net Seminar Announcement 59 Slides from Cohen & Mc. Callum

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU Use. Net Seminar Announcement 60 Slides from Cohen & Mc. Callum

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU Use. Net Seminar Announcement 61 Slides from Cohen & Mc. Callum

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU Use. Net Seminar Announcement 62 Slides from Cohen & Mc. Callum

A “Naïve Bayes” Sliding Window Model [Freitag 1997] … 00 : pm Place : A “Naïve Bayes” Sliding Window Model [Freitag 1997] … 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … w t-m w t-1 w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. Other examples of sliding window: [Baluja et al 2000] 63 (decision tree over individual words & their context) Slides from Cohen & Mc. Callum

“Naïve Bayes” Sliding Window Results Domain: CMU Use. Net Seminar Announcements GRAND CHALLENGES FOR “Naïve Bayes” Sliding Window Results Domain: CMU Use. Net Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3: 30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970 s into a vibrant and popular discipline in artificial intelligence during the 1980 s and 1990 s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e. g. analogy, explanation-based learning), learning theory (e. g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field Person Name: Location: Start Time: F 1 30% 61% 98% 64 Slides from Cohen & Mc. Callum

Realistic sliding-window-classifier IE • What windows to consider? – all windows containing as many Realistic sliding-window-classifier IE • What windows to consider? – all windows containing as many tokens as the shortest example, but no more tokens than the longest example • How to represent a classifier? It might: – Restrict the length of window; – Restrict the vocabulary or formatting used before/after/inside window; – Restrict the relative order of tokens, etc. • Learning Method – SRV: Top-Down Rule Learning [Frietag AAAI ‘ 98] – Rapier: Bottom-Up [Califf & Mooney, AAAI ‘ 99] 65 Slides from Cohen & Mc. Callum

Rapier: results – precision/recall 66 Slides from Cohen & Mc. Callum Rapier: results – precision/recall 66 Slides from Cohen & Mc. Callum

Rule-learning approaches to slidingwindow classification: Summary • SRV, Rapier, and WHISK [Soderland KDD ‘ Rule-learning approaches to slidingwindow classification: Summary • SRV, Rapier, and WHISK [Soderland KDD ‘ 97] – Representations for classifiers allow restriction of the relationships between tokens, etc – Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) – Use of these “heavyweight” representations is complicated, but seems to pay off in results • Can simpler representations for classifiers work? 68 Slides from Cohen & Mc. Callum

BWI: Learning to detect boundaries [Freitag & Kushmerick, AAAI 2000] • Another formulation: learn BWI: Learning to detect boundaries [Freitag & Kushmerick, AAAI 2000] • Another formulation: learn three probabilistic classifiers: – START(i) = Prob( position i starts a field) – END(j) = Prob( position j ends a field) – LEN(k) = Prob( an extracted field has length k) • Then score a possible extraction (i, j) by START(i) * END(j) * LEN(j-i) • LEN(k) is estimated from a histogram 69 Slides from Cohen & Mc. Callum

BWI: Learning to detect boundaries • BWI uses boosting to find “detectors” for START BWI: Learning to detect boundaries • BWI uses boosting to find “detectors” for START and END • Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i). • Each “pattern” is a sequence of – tokens and/or – wildcards like: any. Alphabetic. Token, any. Number, … • Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE, AFTER patterns 70 Slides from Cohen & Mc. Callum

BWI: Learning to detect boundaries Field Person Name: Location: Start Time: F 1 30% BWI: Learning to detect boundaries Field Person Name: Location: Start Time: F 1 30% 61% 98% 71 Slides from Cohen & Mc. Callum

Problems with Sliding Windows and Boundary Finders • Decisions in neighboring parts of the Problems with Sliding Windows and Boundary Finders • Decisions in neighboring parts of the input are made independently from each other. – Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”. – It is possible for two overlapping windows to both be above threshold. – In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step. Solution? Joint inference… 72 Slides from Cohen & Mc. Callum

More Ambitious (Blue Sky) Approaches • The information extraction tasks in fielded applications like More Ambitious (Blue Sky) Approaches • The information extraction tasks in fielded applications like Citeseer/Libra are narrowly focused – We assume that we are learning specific relations (e. g. author/title etc) – We assume that the extracted relations will be put in a database for dbstyle look-up • Semantic web needs: – Tagged data – Background knowledge • (blue sky approaches to) automate both – Knowledge Extraction • Extract base level knowledge (“facts”) directly from the web – Automated tagging • Start with a background ontology and tag other web pages – Semtag/Seeker Let’s look at state of the feasible art before going to blue-sky. . 82

 • If extracting from automatically generated web pages, simple regex patterns usually work. • If extracting from automatically generated web pages, simple regex patterns usually work. • If extracting from more natural, unstructured, human-written text, some NLP may help. – Part-of-speech (POS) tagging • Mark each word as a noun, verb, preposition, etc. – Syntactic parsing • Identify phrases: NP, VP, PP – Semantic word categories (e. g. from Word. Net) • KILL: kill, murder, assassinate, strangle, suffocate • Off-the-shelf software available to do this! – The “Brill” tagger • Extraction patterns can use POS or phrase tags. Analogy to regex patterns on DOM trees for structured tex Extraction from Free Text involves Natural Language Processing 83

I. Generate-n-Test Architecture Generic extraction patterns (Hearst ’ 92): • “…Cities such as Boston, I. Generate-n-Test Architecture Generic extraction patterns (Hearst ’ 92): • “…Cities such as Boston, Los Angeles, and Seattle…” (“C such as NP 1, NP 2, and NP 3”) => IS-A(each(head(NP)), C), … Template Driven Extraction (where template In in terms of Syntax Tree) • Detailed information for several countries such as maps, …” Proper. Noun(head(NP)) • “I listen to pretty much all music but prefer country such as Garth Brooks” 84

Test Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’ 01). Many variations are Test Assess candidate extractions using Mutual Information (PMI-IR) (Turney ’ 01). Many variations are possible… 85

. . but many things indicate “city”ness Discriminator phrases fi : “x is a . . but many things indicate “city”ness Discriminator phrases fi : “x is a city” “x has a population of” “x is the capital of y” “x’s baseball team…” • PMI = frequency of I & D co-occurrence • 5 -50 discriminators Di • Each PMI for Di is a feature fi • Naïve Bayes evidence combination: Keep the probablities with the extracted facts PMI is used for feature selection. NBC is used for learning. Hits used for assessing 86 PMI as well as conditional probabilities

Assessment In Action 1. 2. 3. 4. I = “Yakima” (1, 340, 000) D Assessment In Action 1. 2. 3. 4. I = “Yakima” (1, 340, 000) D = I+D = “Yakima city” (2760) PMI = (2760 / 1. 34 M)= 0. 02 • I = “Avocado” (1, 000) • I+D =“Avocado city” (10) PMI = 0. 00001 << 0. 02 87

Some Sources of ambiguity • • Time: “Clinton is the president” (in 1996). Context: Some Sources of ambiguity • • Time: “Clinton is the president” (in 1996). Context: “common misconceptions. . ” Opinion: Elvis… Multiple word senses: Amazon, Chicago, Chevy Chase, etc. – Dominant senses can mask recessive ones! – Approach: unmasking. ‘Chicago –City’ 88

Chicago City Movie 89 Chicago City Movie 89

Chicago Unmasked City sense Movie sense 90 Chicago Unmasked City sense Movie sense 90

Impact of Unmasking on PMI Name Washington Casablanca Chevy Chase Chicago Recessive city actor Impact of Unmasking on PMI Name Washington Casablanca Chevy Chase Chicago Recessive city actor movie Original Unmask Boost 0. 50 0. 99 96% 0. 41 0. 93 127% 0. 09 0. 58 512% 0. 02 0. 21 972% 91

CBio. C: Collaborative Bio. Curation n Motivation ¡ ¡ ¡ To help get information CBio. C: Collaborative Bio. Curation n Motivation ¡ ¡ ¡ To help get information nuggets of articles and abstracts and store in a database. The challenge is that the number of articles are huge and they keep growing, and need to process natural language. The two existing approaches n n human curation and use of automatic information extraction systems They are not able to meet the challenge, as the first is expensive, while the second is error-prone. 92

CBio. C (cont’d) n Approach: We propose a solution that is inexpensive, and that CBio. C (cont’d) n Approach: We propose a solution that is inexpensive, and that scales up. ¡ Our approach takes advantage of automatic information extraction methods as a starting point, n ¡ ¡ Based on the premise that if there a lot of articles, then there must be a lot of readers and authors of these articles. We provide a mechanism by which the readers of the articles can participate and collaborate in the curation of information. We refer to our approach as “Collaborative Curation''. 93

Using the C-Bio. Curator System (cont’d) 94 Using the C-Bio. Curator System (cont’d) 94

What is the main difference between Knowitall and CBIOC? 95 Assessment– Knowitall does it What is the main difference between Knowitall and CBIOC? 95 Assessment– Knowitall does it by HITS. CBio. C by voting

Annotation “The Chicago Bulls announced yesterday that Michael Jordan will. . . ” The Annotation “The Chicago Bulls announced yesterday that Michael Jordan will. . . ” The Chicago Bulls announced yesterday that Michael Jordan will. . . ’’ 96

Semantic Annotation Name Entity Identification This simplest task of meta-data extraction on NL is Semantic Annotation Name Entity Identification This simplest task of meta-data extraction on NL is to establish “type” relation between entities in the NL resources and concepts in ontologies. 97 Picture from http: //lsdis. cs. uga. edu/courses/Sem. Web. Fall 2005/course. Materials/CSCI 8350 -Metadata. ppt

Semantics • Semantic Annotation - The content of annotation consists of some rich semantic Semantics • Semantic Annotation - The content of annotation consists of some rich semantic information - Targeted not only at human reader of resources but also software agents - formal : metadata following structural standards informal : personal notes written in the margin while reading an article - explicit : carry sufficient information for interpretation tacit : many personal annotations (telegraphic and incomplete) http: //www-scf. usc. edu/~csci 586/slides/6 98

Uses of Annotation 99 http: //www-scf. usc. edu/~csci 586/slides/8 Uses of Annotation 99 http: //www-scf. usc. edu/~csci 586/slides/8

Objectives of Annotation • Generate Metadata for existing information – e. g. , author-tag Objectives of Annotation • Generate Metadata for existing information – e. g. , author-tag in HTML – RDF descriptions to HTML – Content description to Multimedia files • Employ metadata for – Improved search – Navigation – Presentation – Summarization of contents http: //www. aifb. unikarlsruhe. de/WBS/sst/Teaching/Intelligente%20 System%20 im%20 WWW%20 SS%202000/10 -Annotation. pdf 100

Annotation Current practice of annotation for knowledge identification and extraction is time consuming needs Annotation Current practice of annotation for knowledge identification and extraction is time consuming needs annotation by experts is complex Reduce burden of text annotation for Knowledge Management 101 www. racai. ro/EUROLAN-2003/html/presentations/Sheffield. Wilks. Brewster. Dingli/Eurolan 2003 Alexiei. Dingli. ppt

Sem. Tag & Seeker n n WWW-03 Best Paper Prize Seeded with TAP ontology Sem. Tag & Seeker n n WWW-03 Best Paper Prize Seeded with TAP ontology (72 k concepts) ¡ n n And ~700 human judgments Crawled 264 million web pages Extracted 434 million semantic tags ¡ Automatically disambiguated 102

Sem. Tag • Uses broad, shallow knowledge base • TAP – lexical and taxonomic Sem. Tag • Uses broad, shallow knowledge base • TAP – lexical and taxonomic information about popular objects – Music – Movies – Sports – Etc. 104

Sem. Tag • Problem: – No write access to original document, so how do Sem. Tag • Problem: – No write access to original document, so how do you annotate? • Solution: – Store annotations in a web-available database 105

Sem. Tag • Semantic Label Bureau – Separate store of semantic annotation information – Sem. Tag • Semantic Label Bureau – Separate store of semantic annotation information – HTTP server that can be queried for annotation information – Example • Find all semantic tags for a given document • Find all semantic tags for a particular object 106

Sem. Tag • Methodology 107 Sem. Tag • Methodology 107

Sem. Tag • Three phases 1. Spotting Pass: – – 2. Tokenize the document Sem. Tag • Three phases 1. Spotting Pass: – – 2. Tokenize the document All instances plus 20 word window Learning Pass: – – 3. Find corpus-wide distribution of terms at each internal node of taxonomy Based on a representative sample Tagging Pass: – – Scan windows to disambiguate each reference Finally determined to be a TAP object 108

Sem. Tag • Solution: – • Taxonomy Based Disambiguation (TBD) TBD expectation: – Human Sem. Tag • Solution: – • Taxonomy Based Disambiguation (TBD) TBD expectation: – Human tuned parameters used in small, critical sections – Automated approaches deal with bulk of information 110

Sem. Tag • TBD methodology: – Each node in the taxonomy is associated with Sem. Tag • TBD methodology: – Each node in the taxonomy is associated with a set of labels • Cats, Football, Cars all contain “jaguar” – Each label in the text is stored with a window of 20 words – the context – Each node has an associated similarity function mapping a context to a similarity • Higher similarity more likely to contain a reference 111

Sem. Tag • Similarity: – Built a 200, 000 word lexicon (200, 100 most Sem. Tag • Similarity: – Built a 200, 000 word lexicon (200, 100 most common – 100 most common) – 200, 000 dimensional vector space – Training: spots (label, context) and correct node – Estimated the distribution of terms for nodes – Standard cosine similarity – TFIDF vectors (context vs. node) 112

Sem. Tag • Some internal nodes very popular: – Associate a measurement of how Sem. Tag • Some internal nodes very popular: – Associate a measurement of how accurate Sim is likely to be at a node – Also, how ambiguous the node is overall (consistency of human judgment) • TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v • 82% accuracy on 434 million spots 114

Summary • Information extraction can be motivated either as explicating more structure from the Summary • Information extraction can be motivated either as explicating more structure from the data or as an automated way to Semantic Web • Extraction complexity depends on whether the text you have is “templated” or “free-form” – Extraction from templated text can be done by regular expressions – Extraction from free form text requires NLP • Can be done in terms of parts-of-speech-tagging • “Annotation” involves connecting terms in a free form text to items in the background knowledge – It too can be automated 116