Скачать презентацию Lecture 24 NLP for IR Principles of Information Скачать презентацию Lecture 24 NLP for IR Principles of Information

a6e24ef78f4b8b58deaa8716cd2d7e26.ppt

  • Количество слайдов: 56

Lecture 24: NLP for IR Principles of Information Retrieval Prof. Ray Larson University of Lecture 24: NLP for IR Principles of Information Retrieval Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10: 30 am - 12: 00 pm Spring 2007 http: //courses. ischool. berkeley. edu/i 240/s 07 IS 240 – Spring 2007. 04. 26 - SLIDE 1

Today • Review – Web Search Processing – Parallel Architectures (Inktomi - Brewer) – Today • Review – Web Search Processing – Parallel Architectures (Inktomi - Brewer) – Cheshire III Design – GRID-based DLs • NLP for IR • Text Summarization Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer IS 240 – Spring 2007. 04. 26 - SLIDE 2

Google • Google maintains (probably) the worlds largest Linux cluster (over 15, 000 servers) Google • Google maintains (probably) the worlds largest Linux cluster (over 15, 000 servers) • These are partitioned between index servers and page servers – Index servers resolve the queries (massively parallel processing) – Page servers deliver the results of the queries • Over 8 Billion web pages are indexed and served by Google IS 240 – Spring 2007. 04. 26 - SLIDE 3

Ranking: Link Analysis • Assumptions: – If the pages pointing to this page are Ranking: Link Analysis • Assumptions: – If the pages pointing to this page are good, then this is also a good page – The words on the links pointing to this page are useful indicators of what this page is about – References: Page et al. 98, Kleinberg 98 IS 240 – Spring 2007. 04. 26 - SLIDE 4

Ranking: Page. Rank • Google uses the Page. Rank • We assume page A Ranking: Page. Rank • Google uses the Page. Rank • We assume page A has pages T 1. . . Tn which point to it (i. e. , are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to 0. 85. C(A) is defined as the number of links going out of page A. The Page. Rank of a page A is given as follows: • PR(A) = (1 -d) + d (PR(T 1)/C(T 1) +. . . + PR(Tn)/C(Tn)) • Note that the Page. Ranks form a probability distribution over web pages, so the sum of all web pages' Page. Ranks will be one IS 240 – Spring 2007. 04. 26 - SLIDE 5

Page. Rank Note: these are not real Page. Ranks, since they include values >= Page. Rank Note: these are not real Page. Ranks, since they include values >= 1 X 1 T 3 X 2 Pr=1 T 1 Pr=. 725 A Pr=1 Pr=4. 2544375 T 2 Pr=1 T 8 Pr=2. 46625 T 5 Pr=1 T 7 Pr=1 IS 240 – Spring 2007 T 4 T 6 Pr=1 2007. 04. 26 - SLIDE 6

IS 240 – Spring 2007. 04. 26 - SLIDE 7 IS 240 – Spring 2007. 04. 26 - SLIDE 7

IS 240 – Spring 2007. 04. 26 - SLIDE 8 IS 240 – Spring 2007. 04. 26 - SLIDE 8

IS 240 – Spring 2007. 04. 26 - SLIDE 9 IS 240 – Spring 2007. 04. 26 - SLIDE 9

IS 240 – Spring 2007. 04. 26 - SLIDE 10 IS 240 – Spring 2007. 04. 26 - SLIDE 10

Presentation from DLF Forum April 2005 Digital Library Grid Initiatives: Cheshire 3 and the Presentation from DLF Forum April 2005 Digital Library Grid Initiatives: Cheshire 3 and the Grid Ray R. Larson University of California, Berkeley School of Information Management and Systems Rob Sanderson University of Liverpool Dept. of Computer Science Thanks to Dr. Eric Yen and Prof. Michael Buckland for parts of this presentation IS 240 – Spring 2007. 04. 26 - SLIDE 11

Overview • The Grid, Text Mining and Digital Libraries – Grid Architecture – Grid Overview • The Grid, Text Mining and Digital Libraries – Grid Architecture – Grid IR Issues • Cheshire 3: Bringing Search to Grid-Based Digital Libraries – Overview – Grid Experiments – Cheshire 3 Architecture – Distributed Workflows IS 240 – Spring 2007. 04. 26 - SLIDE 12

Astrophysics . …. . . … Remote sensors Combustion Portals Climate Chemical Engineering (Dr. Astrophysics . …. . . … Remote sensors Combustion Portals Climate Chemical Engineering (Dr. Eric Yen, Academia Sinica, Taiwan. ) Collaboratories. Cosmology Application Toolkits Remote Visualization Grid middleware Application s Remote Computing Data Grid High energy physics Grid Architecture -- Protocols, authentication, policy, instrumentation, Grid Resource management, discovery, events, etc. Services Storage, networks, computers, display devices, etc. Grid and their associated local services Fabric IS 240 – Spring 2007. 04. 26 - SLIDE 13

Astrophysics Humanities computing … … Remote sensors Text Mining Digital Libraries Metadata management Bio-Medical Astrophysics Humanities computing … … Remote sensors Text Mining Digital Libraries Metadata management Bio-Medical Search & Retrieval Combustion (ECAI/AS Grid Digital Library Workshop) Portals Collaboratories Cosmology Climate Remote Visualization Chemical Engineering Remote Computing Application Toolkits Data Grid middleware Applications High energy physics Grid Architecture Grid Services Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Grid Fabric Storage, networks, computers, display devices, etc. and their associated local services IS 240 – Spring 2007. 04. 26 - SLIDE 14

Grid IR Issues • Want to preserve the same retrieval performance (precision/recall) while hopefully Grid IR Issues • Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I. e. speed) • Very large-scale distribution of resources is a challenge for sub-second retrieval • Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive • In many ways Grid IR replicates the process (and problems) of metasearch or distributed search IS 240 – Spring 2007. 04. 26 - SLIDE 15

Today • Natural Language Processing and IR – Based on Papers in Reader and Today • Natural Language Processing and IR – Based on Papers in Reader and on • David Lewis & Karen Sparck Jones “Natural Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996 • Text summarization: Lecture from Ed Hovy (USC) IS 240 – Spring 2007. 04. 26 - SLIDE 16

Natural Language Processing and IR • The main approach in applying NLP to IR Natural Language Processing and IR • The main approach in applying NLP to IR has been to attempt to address – Phrase usage vs individual terms – Search expansion using related terms/concepts – Attempts to automatically exploit or assign controlled vocabularies IS 240 – Spring 2007. 04. 26 - SLIDE 17

NLP and IR • Much early research showed that (at least in the restricted NLP and IR • Much early research showed that (at least in the restricted test databases tested) – Indexing documents by individual terms corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically) – Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements IS 240 – Spring 2007. 04. 26 - SLIDE 18

NLP and IR • Not clear why intuitively plausible improvements to document representation have NLP and IR • Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods – E. g. Use of syntactic role relations between terms has shown no improvement in performance over “bag of words” approaches IS 240 – Spring 2007. 04. 26 - SLIDE 19

General Framework of NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ General Framework of NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 20

General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 21

General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 22

General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Morphological and Lexical Processing S Syntactic Analysis VP P-N V John Semantic Analysis NP run Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 23

General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Pred: RUN Agent: John Morphological and Lexical Processing S Syntactic Analysis VP P-N V John Semantic Analysis NP run Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 24

General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Pred: RUN Agent: John is a student. He runs. IS 240 – Spring 2007 Morphological and Lexical Processing S Syntactic Analysis VP P-N V John Semantic Analysis NP run Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester 2007. 04. 26 - SLIDE 25

General Framework of NLP Tokenization Morphological and Part of Speech Tagging Lexical Processing Inflection/Derivation General Framework of NLP Tokenization Morphological and Part of Speech Tagging Lexical Processing Inflection/Derivation Compounding Syntactic Analysis Term recognition (Ananiadou) Semantic Analysis Context processing Interpretation Domain Analysis Appelt: 1999 Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 26

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 27

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Morphological Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Morphological and Open class words Lexical Processing Terms Term recognition Named Entities Syntactic Analysis Company names Locations Numerical expressions Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 28

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 29

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 30

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 31

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Most words in Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Most words in English Morphological and are ambiguous in terms (2) Ambiguities: Lexical Processing of their parts of speech. Combinatorial runs: v/3 pre, n/plu Explosion clubs: v/3 pre, n/plu Syntactic Analysis and two meanings Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 32

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Structural Ambiguities Semantic Analysis Predicate-argument Ambiguities Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 33

Structural Ambiguities Semantic Ambiguities(1) John bought a car with Mary. $3000 can buy a Structural Ambiguities Semantic Ambiguities(1) John bought a car with Mary. $3000 can buy a nice car. (1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000. The manager of Yaxing Benz, a Sino-German joint venture The manager of Yaxing Benz, Mr. John Smith (2) Scope Ambiguities Semantic Ambiguities(2) young women and men in the room Every man loves a woman. (3)Analytical Ambiguities Visiting relatives can be boring. IS 240 – Spring 2007 Co-reference Ambiguities Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester 2007. 04. 26 - SLIDE 34

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Combinatorial Explosion Structural Ambiguities Semantic Analysis Predicate-argument Ambiguities Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 35

Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More comprehensive knowledge: More ambiguities Adaptability: Tuning, Learning Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 36

Framework of IE IE as compromise NLP Slides from Prof. J. Tsujii, Univ of Framework of IE IE as compromise NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 37

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 38

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 39

Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences (4) Adaptation Techniques: Machine Learning, Trainable systems Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 40

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester General Framework Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation IS 240 – Spring 2007 95 % FSA rules Part of Speech Tagger Statistic taggers Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Local Context Statistical Bias F-Value 90 Domain Dependent Domain specific rules: , Inc. Mr. . Machine Learning: HMM, Decision Trees Rules + Machine Learning 2007. 04. 26 - SLIDE 41

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Anaysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: IS 240 – Spring 2007 Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event. 2007. 04. 26 - SLIDE 42

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Anaysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: IS 240 – Spring 2007 Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event. 2007. 04. 26 - SLIDE 43

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Analysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: IS 240 – Spring 2007 Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event. 2007. 04. 26 - SLIDE 44

Using NLP • Strzalkowski (in Reader) Text NLP: TAGGER IS 240 – Spring 2007 Using NLP • Strzalkowski (in Reader) Text NLP: TAGGER IS 240 – Spring 2007 NLP repres PARSER Dbase search TERMS 2007. 04. 26 - SLIDE 45

Using NLP INPUT SENTENCE The former Soviet President has been a local hero ever Using NLP INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE The/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np. /per IS 240 – Spring 2007. 04. 26 - SLIDE 46

Using NLP TAGGED & STEMMED SENTENCE the/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj Using NLP TAGGED & STEMMED SENTENCE the/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np. /per IS 240 – Spring 2007. 04. 26 - SLIDE 47

Using NLP PARSED SENTENCE [assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] Using NLP PARSED SENTENCE [assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]] IS 240 – Spring 2007. 04. 26 - SLIDE 48

Using NLP EXTRACTED TERMS & WEIGHTS President 2. 623519 soviet President+soviet 11. 556747 president+former Using NLP EXTRACTED TERMS & WEIGHTS President 2. 623519 soviet President+soviet 11. 556747 president+former Hero 7. 896426 hero+local Invade 8. 435012 tank Tank+invade 17. 402237 tank+russian Russian 7. 383342 wisconsin IS 240 – Spring 2007 5. 416102 14. 594883 14. 314775 6. 848128 16. 030809 7. 785689 2007. 04. 26 - SLIDE 49

Same Sentence, different sys INPUT SENTENCE The former Soviet President has been a local Same Sentence, different sys INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE (using uptagger from Tsujii) The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP. /. IS 240 – Spring 2007. 04. 26 - SLIDE 50

Same Sentence, different sys CHUNKED Sentence (chunkparser – Tsujii) (TOP (S (NP (DT The) Same Sentence, different sys CHUNKED Sentence (chunkparser – Tsujii) (TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) (. . ) ) ) IS 240 – Spring 2007. 04. 26 - SLIDE 51

Same Sentence, different sys Enju Parser ROOT been a a local The former Russian Same Sentence, different sys Enju Parser ROOT been a a local The former Russian Soviet invaded has since ever ROOT be be a a local the former russian soviet invade have since ever IS 240 – Spring 2007 ROOT VBN DT DT JJ JJ NNP VBD VBZ IN IN RB ROOT VB VB DT DT JJ JJ NNP VB VB IN IN RB -1 5 5 6 11 7 0 1 12 2 14 14 4 4 10 10 9 ROOT ARG 1 ARG 2 ARG 1 ARG 1 MOD ARG 1 ARG 2 MOD ARG 1 been President hero tank hero President tank Wisconsin President been invaded since be president hero tank hero president tank wisconsin president be be invade since VBN NNP NN NNP NNP VBN VBD IN 2007. 04. 26 - SLIDE 52 VB NN NN NN VB VB VB IN

NLP & IR • Indexing – Use of NLP methods to identify phrases • NLP & IR • Indexing – Use of NLP methods to identify phrases • Test weighting schemes for phrases – Use of more sophisticated morphological analysis • Searching – Use of two-stage retrieval • Statistical retrieval • Followed by more sophisticated NLP filtering IS 240 – Spring 2007. 04. 26 - SLIDE 53

NPL & IR • Lewis and Sparck Jones suggest research in three areas – NPL & IR • Lewis and Sparck Jones suggest research in three areas – Examination of the words, phrases and sentences that make up a document description and express the combinatory, syntagmatic relations between single terms – The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching – Using NLP-based methods for searching and matching IS 240 – Spring 2007. 04. 26 - SLIDE 54

NLP & IR Issues • Is natural language indexing using more NLP knowledge needed? NLP & IR Issues • Is natural language indexing using more NLP knowledge needed? • Or, should controlled vocabularies be used • Can NLP in its current state provide the improvements needed • How to test IS 240 – Spring 2007. 04. 26 - SLIDE 55

NLP & IR • New “Question Answering” track at TREC has been exploring these NLP & IR • New “Question Answering” track at TREC has been exploring these areas – Usually statistical methods are used to retrieve candidate documents – NLP techniques are used to extract the likely answers from the text of the documents IS 240 – Spring 2007. 04. 26 - SLIDE 56