Scalable Information Extraction and Integration Eugene Agichtein Microsoft

Scalable Information Extraction and Integration Eugene Agichtein Microsoft Research Emory University Sunita Sarawagi IIT Bombay

The Value of Text Data n “Unstructured” text data is the primary source of human-generated information q Citeseer, comparison shopping, PIM systems, web search, data warehousing n Managing and utilizing text: information extraction and integration n Scalability: a bottleneck for deployment n Relevance to data mining community August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example: Answering Queries Over Text For years, Microsoft Corporation CEO Bill Select Name Gates was against open From PEOPLE source. But today he Where Organization = ‘Microsoft’ appears to have changed his mind. "We can be open source. We love the PEOPLE concept of shared Name Title Organization source, " said Bill Veghte, Bill Gates CEO Microsoft a Microsoft VP. "That's a Bill Veghte VP Microsoft super-important shift for Richard Stallman Founder Free Soft. . us in terms of code access. “ Richard Stallman, founder of the Free Software Foundation, countered saying… August 2006 Bill Gates Bill Veghte (from William Cohen’s IE tutorial, 2003) Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Managing Unstructured Text Data n Information Extraction from text q q n Identify instances of entities and relationships Main approaches and architectures Scaling up to large collections of documents (e. g. , web) Information Integration q q q n Represent information in text data in a structured form Combine/resolve/clean information about entities Entity Resolution & Deduplication Scaling Up: Batch mode/algorithmic issues Connections between Information Extraction and Integration q Coreference Resolution q Deriving values from multiple sources q (Web) Question Answering August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Part I: Tutorial Outline n Overview of Information Extraction q q n Entity tagging Relation extraction Scaling up Information Extraction q q Focus on scaling up to large collections (where data mining and ML techniques shine) Other dimensions of scalability August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Information Extraction Components August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Information Extraction Tasks n Extracting entities and relations: this tutorial q Entities: named (e. g. , Person) and generic (e. g. , disease name) q Relations: entities related in a predefined way (e. g. , Location of a Disease outbreak) n Common extraction subtasks: q Preprocessing: sentence chunking, syntactic parsing, morphological analysis q Creating rules or extraction patterns: manual, machine learning, and hybrid q Applying extraction patterns to extract new information n Postprocessing and complex extraction: not covered q Co-reference resolution q Combining Relations into Events and Facts August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Related Tutorials n Previous information extraction tutorials: consult for more details q R. Feldman, Information Extraction – Theory and Practice, ICML 2006 http: //www. cs. biu. ac. il/~feldman/icml_tutorial. html q q q W. Cohen, A. Mc. Callum, Information Extraction and Integration: an Overview, KDD 2003 http: //www. cs. cmu. edu/~wcohen/ie-survey. ppt A. Doan, R. Ramakrishnan, S. Vaithyanathan, Managing Information Extraction, SIGMOD’ 06 N. Koudas, D. Srivastava, S. Sarawagi, Record Linkage: Similarity Measures and Algorithms, SIGMOD 2006 August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Entity Tagging n Identifying mentions of entities (e. g. , person names, locations, companies) in text q MUC (1997): Person, Location, Organization, Date/Time/Currency q ACE (2005): more than 100 more specific types n Hand-coded vs. Machine Learning approaches n Best approach depends on entity type and domain: q Closed class (e. g. , geographical locations, disease names, gene & protein names): hand coded + dictionaries q Syntactic (e. g. , phone numbers, zipcodes): regexes q Others (e. g. , person and company names): mixture of context, syntactic features, dictionaries, heuristics, etc. q “Almost solved” for common/typical entity types n Non-syntactic entities computationally expensive August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example: Extracting Entities from Text q Useful for data warehousing, data cleaning, web data integration Address Citation House number Building Road City State Zip 4089 Whispering Pines Nobel Drive San Diego CA 92122 1 Ronald Fagin, Combining Fuzzy Information from Multiple Systems, Proc. of ACM SIGMOD, 2002 Segment(si) Sequence Label(si) S 1 Ronald Fagin Author S 2 Combining Fuzzy Information from Multiple Systems Title S 3 Proc. of ACM SIGMOD Conference S 4 2002 Year August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Hand-Coded Methods n n Easy to construct in many cases q e. g. , to recognize prices, phone numbers, zip codes, conference names, etc. Easier to debug & maintain q Especially if written in a “high-level” language (as is usually the case): e. g. , [From Avatar] Contact. Pattern Regular. Expression(Email. body, ”can be reached at”) Person. Phone Precedes(Person Precedes(Contact. Pattern, Phone, D), D) n n Easier to incorporate / reuse domain knowledge Can be quite labor intensive to write August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example of Hand-Coded Entity Tagger [Ramakrishnan. G, 2005, Slides from Doan et al. , SIGMOD 2006] Rule 1 This rule will find person names with a salutation (e. g. Dr. Laura Haas) and two capitalized words <token> INITIAL</token> <token>DOT </token> <token>CAPSWORD</token> Rule 2 This rule will find person names where two capitalized words are present in a Person dictionary <token>PERSONDICT, CAPSWORD </token> <token>PERSONDICT, CAPSWORD</token> CAPSWORD : Word starting with uppercase, second letter lowercase E. g. , De. Witt will satisfy it (DEWITT will not) p{Upper}p{Lower}[p{Alpha}]{1, 25} DOT : The character ‘. ’ August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Hand Coded Rule Example: Conference Name # These are subordinate patterns $word. Ordinals="(? : first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fifteenth)"; my $number. Ordinals="(? : \d? (? : 1 st|2 nd|3 rd|1 th|2 th|3 th|4 th|5 th|6 th|7 th|8 th|9 th|0 th))"; my $ordinals="(? : $word. Ordinals|$number. Ordinals)"; my $conf. Types="(? : Conference|Workshop|Symposium)"; my $words="(? : [A-Z]\w+\s*)"; # A word starting with a capital letter and ending with 0 or more spaces my $conf. Descriptors="(? : international\s+|[A-Z]+\s+)"; #. e. g "International Conference. . . ' or the confere name for workshops (e. g. "VLDB Workshop. . . ") my $connectors="(? : on|of)"; my $abbreviations="(? : $[A-Z]\w\w+[\W\s]*? (? : \d\d+)? $)"; # Conference abbreviations like "(SIGMOD # The actual pattern we search for. A typical conference name this pattern will find is # "3 rd International Conference on Blah (ICBBB-05)" my $full. Name. Pattern="((? : $ordinals\s+$words*|$conf. Descriptors)? $conf. Types(? : \s+$connectors\s+. *? |\s breviations? )(? : \n|\r|\. |<)"; ################################ # Given a <dbworld. Message>, look for the conference pattern ############################### look. For. Pattern($dbworld. Message, $full. Name. Pattern); ############################# # In a given <file>, look for occurrences of <pattern> # <pattern> is a regular expression ############################# sub look. For. Pattern { my ($file, $pattern) = @_; August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Gene & Protein Tagger: Ali. Baba n n Extract gene names from Pub. Med abstracts Use Classifier (Support Vector Machine - SVM) Tokenized Training Corpus n n n SVMlight New Text n Vector Generator SVM Model driven Tagger Post Processor Tagged Text Corpus of 7500 sentences q 140. 000 non-gene words q 60. 000 gene names SVMlight on different feature sets Dictionary compiled from Genbank, HUGO, MGD, YDB Post-processing for compound gene names August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Some Hand Coded Entity Taggers n n n n FRUMP [De. Jong 82] CIRCUS / Auto. Slog [Riloff 93] SRI FASTUS [Appelt, 1996] MITRE Alembic (available for use) Alias-I Ling. Pipe (available for use) OSMX [Embley, 2005] DBLife [Doan et al, 2006] Avatar [Jayram et al, 2006] August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Machine Learning Methods n n Can work well when training data is easy to construct and is plentiful Can capture complex patterns that are hard to encode with hand-crafted rules q e. g. , determine whether a review is positive or negative q extract long complex gene names [From Ali. Baba] The human T cell leukemia lymphotropic virus type 1 Tax protein represses Myo. D-dependent transcription by inhibiting Myo. Dbinding to the KIX domain of p 300. “ n Can be labor intensive to construct training data q Question: how much training data is sufficient? August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Popular Machine Learning Methods for IE n n n Naive Bayes SRV [Freitag-98], Inductive Logic Programming Rapier [Califf & Mooney-97] Hidden Markov Models [Leek, 1997] Maximum Entropy Markov Models [Mc. Callum et al, 2000] Conditional Random Fields [Lafferty et al, 2000] q Implementations available: n n n Mallet (Andrew Mc. Callum) crf. sourceforge. net (Sunita Sarawagi) Minor. Third minorthird. sourceforge. net (William Cohen) For details: [Feldman, 2006 and Cohen, 2004] August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example of State-based ML Method August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Extracted Entities: Resolving Duplicates Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr. , finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission 's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R. I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me. '‘ Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy coedited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980 -1994 (Seren 1996). August 2006 [From Li, Morie, & Roth, AI Magazine, 2005] Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Important Problem, Addressed in Part II n n Appears in numerous real-world contexts Plagues many applications q Citeseer, DBLife, Ali. Baba, Rexa, etc. August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Outline n Overview of Information Extraction q q n Entity tagging Relation extraction Scaling up Information Extraction q q Focus on scaling up to large collections (where data mining and ML techniques shine) Other dimensions of scalability August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Relation Extraction: Disease Outbreaks n Extract structured relations from text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Date August 2006 Location Jan. 1995 Information Extraction System (e. g. , NYU’s Proteus) Disease Name Malaria Ethiopia July 1995 Mad Cow Disease U. K. Feb. 1995 Pneumonia U. S. May 1995 Ebola Zaire Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example: Protein Interactions „We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex. “ CBF-A CBF-B August 2006 interact complex associates CBF-C CBF-A-CBF-C complex Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Relation Extraction n Typically require Entity Tagging as preprocessing n Knowledge Engineering q Rules defined over lexical items n q Rules defined over parsed text n q n “<company> located in <location>” “((Obj <company>) (Verb located) (*) (Subj <location>))” Proteus, GATE, … Machine Learning-based q Learn rules/patterns from examples Dan Roth 2005, Cardie 2006, Mooney 2005, … q Partially-supervised: bootstrap from “seed” examples Agichtein & Gravano 2000, Etzioni et al. , 2004, … n Recently, hybrid models [Feldman 2004, 2006] August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example Extraction Rule [NYU Proteus] August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example Extraction Patterns: Snowball [AG 2000] ORGANIZATION LOCATION August 2006 {<’s 0. 7> <in 0. 7> <headquarters 0. 7>} {<- 0. 75> <based 0. 75>} LOCATION ORGANIZATION Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Accuracy of Information Extraction [Feldman, ICML 2006 tutorial] n Errors cascade (error in entity tag error in relation extraction) n This estimate is optimistic: q q Holds for well-established tasks Many specific/novel IE tasks exhibit lower accuracy August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Outline n Overview of Information Extraction q q n Entity tagging Relation extraction Scaling up Information Extraction q q Focus on scaling up to large collections (where data mining and ML techniques shine) Other dimensions of scalability August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Dimensions of Scalability n Efficiency/corpus size q n Heterogeneity/diversity of information sources q q n Requires many rules (expensive to apply) Many sources/conventions (expensive to maintain rules) Accessing required documents q n Years to process a large collections (centuries for Web) Hidden Web databases are not crawlable Number of Extraction Tasks (not covered) q q Many patterns/rules to develop and maintain Open research area August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Scaling Up Information Extraction n Scan-based extraction q Classification/filtering to avoid processing documents q Sharing common tags/annotations n General (keyword) index-based techniques q QXtract, Know. It. All n Specialized indexes q BE/Know. It. Now, Linguist’s Search Engine n Parallelization/Adaptive Processing q IBM Web. Fountain, Google’s Map/Reduce n Application: Question Answering q Ask. MSR, Arranea, Mulder August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Scan Text Database Output Tokens Extraction … System 1. Retrieve 2. Process 3. Extract docs from documents output tokens database n Scan retrieves and processes documents sequentially (until reaching target recall) Scan Execution time = |Retrieved Docs| · (R + P) Time for retrieving a document August 2006 Time for processing a document Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Efficient Scanning for Information Extraction n 80/20 rule: use few simple rules to capture majority of the cases [PRH 2004] n Train a classifier to discard irrelevant documents without processing [GHY 2002] n Share base annotations (entity tags) across multiple tasks August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Filtered Scan Output Tokens Text Database Classifier Extraction … System 1. Retrieve 3. Process filtered 4. Extract 2. Filter docs from documents output tokens documents database n Scan retrieves and processes all documents (until reaching target recall) Scan n Filtered Scan uses a classifier to identify and process only promising documents Filtered Scan (e. g. , the Sports section of NYT is unlikely to describe disease outbreaks) σ Execution time = |Retrieved Docs| * ( R + F + P) Time for retrieving a document August 2006 Time for processing a document Classifier selectivity (σ≤ 1) Time for filtering Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration a document

Exploiting Keyword and Phrase Indexes n Generate queries to retrieve only relevant documents n Data mining problem! n Some methods in literature: q q q n Traversing Query Graphs [AIG 2003] Iteratively refine queries [AG 2003] Iteratively partition document space [Etzioni et al. , WWW 2004] Case studies: QXtract, Know. It. All August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Simple Strategy: Iterative Set Expansion Output Tokens Text Database Extraction … Query System 1. Query database with seed tokens 2. Process retrieved documents Generation 3. Extract tokens (e. g. , <Malaria, Ethiopia>) from docs (e. g. , [Ebola AND Zaire]) 4. Augment seed tokens with new tokens Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q August 2006 Time for retrieving a Time for processing document a document Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration Time for answering a query

Querying Graph [AIG 2003] Tokens n The querying graph is a bipartite graph, containing tokens and documents Documents t 1 d 1 t 2 d 2 t 3 d 3 <SARS, China> <Ebola, Zaire> n Each token (transformed to a keyword query) retrieves documents Documents contain tokens <Malaria, Ethiopia> d 4 t 5 n t 4 d 5 <Cholera, Sudan> <H 5 N 1, Vietnam> August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Recall Limit: Reachability Graph Tokens Documents t 1 d 1 t 2 d 2 t 3 d 3 t 4 d 4 t 5 d 5 August 2006 Reachability Graph t 1 t 2 t 3 t 5 t 4 t 1 retrieves document d 1 that contains t 2 Upper recall limit: determined by the size Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration of the biggest connected component

Reachability Graph for Disease. Outbreaks August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Connected Components Visualization August 2006 Disease. Outbreaks, Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration New York Times 1995

Getting Around Reachability Limit n Know. It. All: q q n Add keywords to partition documents into retrievable disjoint sets Submit queries with parts of extracted instances QXtract q q General queries with many matching documents Assumes many documents retrievable per query August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

QXtract [AG 2003] User-Provided Seed Tuples Seed Sampling 1. Get document sample with “likely negative” and “likely positive” examples. 2. Label sample documents using information extraction system as “oracle. ” 3. Train classifiers to “recognize” useful documents. 4. Generate queries from classifier model/rules. August 2006 Information Extraction Classifier Training Query Generation Queries Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Know. It. All Architecture Slides: [Zheng Shao, UIUC] Search Engine Interface Pages Web Rule template n Rule Extractor NP 1 “such as” NPList 2 & head(NP 1) = plural(name(Class 1)) NP 1 “such as” NPList 2 & proper. Noun(head(each(NPList 2))) & head(NP 1) = “countries” Assessor => & proper. Noun(head(each(NPList 2))) System Work Flow instance. Of(Class 1, head(each(NPList 2))) => instance. Of(Country, head(each(NPList 2))) Keywords: “countries such as” Database August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Know. It. All Architecture (Cont. ) Search Engine Interface Rule Extractor Frequency Web Pages Extracted Information the United Kingdom and Canada India North Korea, Iran, India and Pakistan Japan Country AND the United Kingdom Assessor Iraq, Italy and Spain Countries such as the United Kingdom …the United Kingdom Discriminator n System Work Flow Phrase Knowledge Canada India North Korea Country AND X Iran “Countries such as X” … Database August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Using Generic Indexes: Summary n Order of magnitude scale-up in corpus size Indexes are approximate (queries not precise) Require many documents to retrieve n Can we do better? n n August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Index Structures for Information Extraction n Bindings Engine [CE 2005] n Indexes of entities: [CGHX 2006], [IBM Avatar] n Other systems (not covered) q q Linguist’s search engine (P. Resnik et al. ): indexes syntactic structures FREE: Indexing regular expressions: J. Cho et al. August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Bindings Engine (BE) [Slides: Cafarella 2005] n Bindings Engine (BE) is search engine where: q q q n No downloads during query processing Disk seeks constant in corpus size #queries = #phrases BE’s approach: q q q “Variabilized” search query language Pre-processes all documents before query-time Integrates variable/type data with inverted index, minimizing query seeks August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

BE Query Support cities such as <Noun. Phrase> President Bush <Verb> <Noun. Phrase> is the capital of <Noun. Phrase> reach me at <phone-number> n Any sequence of concrete terms and typed variables n NEAR is insufficient n Functions (e. g. , “head(<Noun. Phrase>)”) August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

BE Operation n Like a generic search engine, BE: q q q n BE further requires: q q n Downloads a corpus of pages Creates an index Uses index to process queries efficiently Set of indexed types (e. g. , “Noun. Phrase”), with a “recognizer” for each String processing functions (e. g. , “head()”) A BE system can only process types and functions that its index supports August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Index design n Search engines handle scale with inverted index q q n Disk analysis q q n n Single disk seek per term Mainly sequential reads Seeks require ~5 ms, so only 200/sec Sequential reads transfer 10 -40 MB/sec Inverted index minimizes expensive seeks; BE should do the same Parallel downloads are just parallel, distributed seeks; still very costly August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

docid 2 … docid 1 docid 2 docid 3 docid 0 docid 1 docid 2 … docid#docs-1 #docs docid 0 docid 1 docid 2 such #docs docid 0 docid 1 words #docs docid 0 docid 1 docid 2 docid#docs-1 #docs docid 0 docid 1 docid 2 … … #docs docid 0 as #docs docid 0 docid 1 #docs docid 0 give #docs docid 0 mayors #docs billy cities friendly nickels seattle docid#docs-1

Query: such as as #docs docid 1 docid 2 104 billy docid 0 21 150 … 322 docid#docs-1 2501 cities friendly give mayors nickels seattle 1. Test for equality 2. Advance smaller pointer 3. Abort when a list is exhausted such words #docs 322 docid 1 docid 2 15 Returned docs: docid 0 99 322 426 … docid#docs-1 1309

“such as” as docid … … docid pos #docs docid 0 pos 0 docid 1 1 pos 1 2 #docs docid 0 docid billy cities friendly #posns pos 0 pos 1 … #docs-1 pos#pos-1 give mayors nickels In phrase queries, match positions as well seattle such words docid … … docid pos #docs docid 0 pos 0 docid 1 1 pos 1 2 #docs docid 0 docid #posns pos 0 pos 1 … pos#pos-1 #docs-1

Neighbor Index n n At each position in the index, store “neighbor text” that might be useful Let’s index <Noun. Phrase> and <Adj-Term> “I love cities such as Philadelphia. ” Left Right Adj. T: “love” August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Neighbor Index n n At each position in the index, store “neighbor text” that might be useful Let’s index <Noun. Phrase> and <Adj-Term> “I love cities such as Philadelphia. ” Left Adj. T: “I” NP: “I” August 2006 Right Adj. T: “cities” NP: “cities” Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Neighbor Index Query: “cities such as <Noun. Phrase>” “I love cities such as Philadelphia. ” Left Adj. T: “such” August 2006 Right Adj. T: “Philadelphia” NP: “Philadelphia” Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

“cities such as <Noun. Phrase>” as #docs docid 0 pos 0 docid 1 pos 1 billy cities friendly 19 … docid#docs-1 pos#docs-1 … neighbor … pos #posns pos 0 #posns neighbor 0 1 pos #pos-1 1 pos#pos-1 … 12 give mayors nickels blk_offset #neighbors neighbor 0 str 0 neighbor 1 str 1 philadelphia <offset> such NPright Philadelphia 3 Adj. Tleft such words In doc 19, starting at posn 8: “I love cities such as Philadelphia. ” 1. Find phrase query positions, as with phrase queries 2. If term is adjacent to variable, extract typed value

Asymptotic Efficiency Analysis n n k concrete terms in query B bindings found for query N documents in corpus T indexed types in corpus Query Time (in seeks) Index Space BE O(k) O(N * T) Std Model O(k + B) O(N) B and N scale together; k often small; T often exclusive August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Experiment 2: Know. It. All on BE Num Extractions Std Imp/ Google 10 k 5, 976 s 50 k 29, 880 s 150 k 89, 641 s August 2006 BE Speedup Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Experiment 2: Know. It. All on BE Num Extractions Std Imp/ Google BE Speedup 10 k 5, 976 s 95 s 63 x 50 k 29, 880 s 95 s 314 x 150 k 89, 641 s N/A August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

BE Summary n Significant improvement over generic indexes n Index size grows linearly with number of types n Some ML-based patterns (e. g. , HMMs, CRFs, character models) not supported n Can we use it for general QA, RE tasks? August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Similar Approach: [CGHX 2006] n n Support “relationship” keyword queries over indexed entities Top-K support for early termination August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Indexing Thousands of Entity Types [Slides from Chakrabarti et al. , WWW 2006] August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Workload-Driven Indexing August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Selecting Types to Index August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Parallelization/Adaptive Processing n Parallelize processing: q q n IBM Web. Fountain [GCG+2004] Google’s Map/Reduce Select most efficient access strategy q Cost Estimation and Optimization [IAJG 2006] August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Map/Reduce Framework August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Map/Reduce Framework n General framework q q n Scales to 1000 s of machines Implemented in Nutch “Maps” easily to information extraction q Map phase: n n n q Parse individual documents Tag entities Propose candidate relation tuples Reduce phase n n August 2006 Merge multuple mentiones of same relation tuple Resolve co-references, duplicates Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Cost Optimizer for Text-Centric Tasks August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Other Dimensions of Scalability: Managing Complex Features [CNS 2006] R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998 Surface features (cheap) Database lookup features (expensive!) 1. Batch up to do better than individual top-k? 2. Find top segmentation without topk matches for all segments? Many large tables Authors Ronald Fagin Efficient search for top-k most similar entities Steve Cook S. Sudarshan S. Chakrabarti Nick Koudas Inverted index R. K. Narayan E. F. Codd J. Widom August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Other Dimensions of Scalability: Extraction Pattern Discovery [Konig and Brill, KDD 2006] n Use suffix array to efficiently explore candidate patterns August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Application: Web Question Answering Ask. MSR: does not use patterns q q Simplicity scalability (cheap to compute n-grams) Challenge: do better than n-grams on web QA August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Summary n Brief overview of information extraction from text n Techniques to scale up information extraction q q q n Scan-based techniques (limited impact) Exploiting general indexes (limited accuracy) Building specialized index structures (most promising) Scalability is a data mining problem q q q Querying graphs link discovery Workload mining for index optimization Must be optimized for specific text mining application? August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Related Challenges n Duplicate entities, relation tuples extracted n Missing values q q n Extraction errors Information spans multiple documents Combining relation tuples into complex events August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Break n Eugene Agichtein, Microsoft & Emory University q q n http: //www. mathcs. emory. edu/~eugene/ eugene@mathcs. emory. edu Next: Scalable Information Integration q q Core set of techniques to enable large-scale IE, text mining Sunita Sarawagi August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

References n n n n n [AGI 2005] E. Agichtein, Scaling Information Extraction to Large Document Collections, IEEE Data Engineering Bulletin, 2005 [AG 2003] E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. ICDE 2003 [AIG 2003] E. Agichtein, P. Ipeirotis, and L. Gravano, Modeling Query-Based Access to Text Databases, Web. DB 2003 [CDS+2005] l J. Cafarella, D. Downey, S. Soderland, and Oren Etzioni. Know. It. Now: Fast, scalable information extraction from the web. (HLT/EMNLP), 2005. [CE 2005] M. J. Cafarella and O. Etzioni. A search engine for natural language applications. (WWW), 2005 [CNS 2006] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. ICDE 2006 [CRW 2005] S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping? , CIDR 2005. [CGHX 2006] K. Chakrabarti, V. Ganti, Jiawei Han, D. Xin, Ranking Objects Based on Relationships, SIGMOD 2006 [CPD 2006] S. Chakrabarti, Kriti Puniyani and Sujatha Das, Optimizing Scoring Functions and Indexes for Proximity Search in Type-annotated Corpora. WWW 2006 August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration

References II n n n n n [DBB+2002] S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng (2002). P. Bennett, S. Dumais and E. Horvitz (2002). Web question answering: Is more always better? SIGIR 2002 [GHY 2002] R. Grishman, S. Huttunen, and R. Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 2002. [GCG+2004] D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. How to build a Web. Fountain: An architecture for very large-scale text analytics. IBM Systems Journal, 2004. [IAJG 2006] Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, To Search or to Crawl: Towards a Query Optimizer for Text-Centric Tasks, SIGMOD 2006 [KRV+2004] R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Zhu, Avatar: A Database Approach to Semantic Search, SIGMOD 2006 [PRH 2004] P. Pantel, D. Ravichandran, and E. Hovy. Towards terascale knowledge acquisition. In Conference on Computational Linguistics (COLING), 2004. [PE 2005] P. Resnik and A. Elkiss. The linguist’s search engine: An overview (demonstration). In ACL, 2005. [PDT 2001] P. D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In European Conference on Machine Learning (ECML), 2001. C. König and E. Brill, Reducing the Human Overhead in Text Categorization, KDD 2006 August 2006 Agichtein and Sarawagi, KDD 2006: Scalable Information Extraction and Integration