a6e24ef78f4b8b58deaa8716cd2d7e26.ppt
- Количество слайдов: 56
Lecture 24: NLP for IR Principles of Information Retrieval Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10: 30 am - 12: 00 pm Spring 2007 http: //courses. ischool. berkeley. edu/i 240/s 07 IS 240 – Spring 2007. 04. 26 - SLIDE 1
Today • Review – Web Search Processing – Parallel Architectures (Inktomi - Brewer) – Cheshire III Design – GRID-based DLs • NLP for IR • Text Summarization Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer IS 240 – Spring 2007. 04. 26 - SLIDE 2
Google • Google maintains (probably) the worlds largest Linux cluster (over 15, 000 servers) • These are partitioned between index servers and page servers – Index servers resolve the queries (massively parallel processing) – Page servers deliver the results of the queries • Over 8 Billion web pages are indexed and served by Google IS 240 – Spring 2007. 04. 26 - SLIDE 3
Ranking: Link Analysis • Assumptions: – If the pages pointing to this page are good, then this is also a good page – The words on the links pointing to this page are useful indicators of what this page is about – References: Page et al. 98, Kleinberg 98 IS 240 – Spring 2007. 04. 26 - SLIDE 4
Ranking: Page. Rank • Google uses the Page. Rank • We assume page A has pages T 1. . . Tn which point to it (i. e. , are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to 0. 85. C(A) is defined as the number of links going out of page A. The Page. Rank of a page A is given as follows: • PR(A) = (1 -d) + d (PR(T 1)/C(T 1) +. . . + PR(Tn)/C(Tn)) • Note that the Page. Ranks form a probability distribution over web pages, so the sum of all web pages' Page. Ranks will be one IS 240 – Spring 2007. 04. 26 - SLIDE 5
Page. Rank Note: these are not real Page. Ranks, since they include values >= 1 X 1 T 3 X 2 Pr=1 T 1 Pr=. 725 A Pr=1 Pr=4. 2544375 T 2 Pr=1 T 8 Pr=2. 46625 T 5 Pr=1 T 7 Pr=1 IS 240 – Spring 2007 T 4 T 6 Pr=1 2007. 04. 26 - SLIDE 6
IS 240 – Spring 2007. 04. 26 - SLIDE 7
IS 240 – Spring 2007. 04. 26 - SLIDE 8
IS 240 – Spring 2007. 04. 26 - SLIDE 9
IS 240 – Spring 2007. 04. 26 - SLIDE 10
Presentation from DLF Forum April 2005 Digital Library Grid Initiatives: Cheshire 3 and the Grid Ray R. Larson University of California, Berkeley School of Information Management and Systems Rob Sanderson University of Liverpool Dept. of Computer Science Thanks to Dr. Eric Yen and Prof. Michael Buckland for parts of this presentation IS 240 – Spring 2007. 04. 26 - SLIDE 11
Overview • The Grid, Text Mining and Digital Libraries – Grid Architecture – Grid IR Issues • Cheshire 3: Bringing Search to Grid-Based Digital Libraries – Overview – Grid Experiments – Cheshire 3 Architecture – Distributed Workflows IS 240 – Spring 2007. 04. 26 - SLIDE 12
Astrophysics . …. . . … Remote sensors Combustion Portals Climate Chemical Engineering (Dr. Eric Yen, Academia Sinica, Taiwan. ) Collaboratories. Cosmology Application Toolkits Remote Visualization Grid middleware Application s Remote Computing Data Grid High energy physics Grid Architecture -- Protocols, authentication, policy, instrumentation, Grid Resource management, discovery, events, etc. Services Storage, networks, computers, display devices, etc. Grid and their associated local services Fabric IS 240 – Spring 2007. 04. 26 - SLIDE 13
Astrophysics Humanities computing … … Remote sensors Text Mining Digital Libraries Metadata management Bio-Medical Search & Retrieval Combustion (ECAI/AS Grid Digital Library Workshop) Portals Collaboratories Cosmology Climate Remote Visualization Chemical Engineering Remote Computing Application Toolkits Data Grid middleware Applications High energy physics Grid Architecture Grid Services Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Grid Fabric Storage, networks, computers, display devices, etc. and their associated local services IS 240 – Spring 2007. 04. 26 - SLIDE 14
Grid IR Issues • Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I. e. speed) • Very large-scale distribution of resources is a challenge for sub-second retrieval • Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive • In many ways Grid IR replicates the process (and problems) of metasearch or distributed search IS 240 – Spring 2007. 04. 26 - SLIDE 15
Today • Natural Language Processing and IR – Based on Papers in Reader and on • David Lewis & Karen Sparck Jones “Natural Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996 • Text summarization: Lecture from Ed Hovy (USC) IS 240 – Spring 2007. 04. 26 - SLIDE 16
Natural Language Processing and IR • The main approach in applying NLP to IR has been to attempt to address – Phrase usage vs individual terms – Search expansion using related terms/concepts – Attempts to automatically exploit or assign controlled vocabularies IS 240 – Spring 2007. 04. 26 - SLIDE 17
NLP and IR • Much early research showed that (at least in the restricted test databases tested) – Indexing documents by individual terms corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically) – Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements IS 240 – Spring 2007. 04. 26 - SLIDE 18
NLP and IR • Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods – E. g. Use of syntactic role relations between terms has shown no improvement in performance over “bag of words” approaches IS 240 – Spring 2007. 04. 26 - SLIDE 19
General Framework of NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 20
General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 21
General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 22
General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Morphological and Lexical Processing S Syntactic Analysis VP P-N V John Semantic Analysis NP run Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 23
General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Pred: RUN Agent: John Morphological and Lexical Processing S Syntactic Analysis VP P-N V John Semantic Analysis NP run Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 24
General Framework of NLP John runs. John run+s. P-N V N 3 -pre plu Pred: RUN Agent: John is a student. He runs. IS 240 – Spring 2007 Morphological and Lexical Processing S Syntactic Analysis VP P-N V John Semantic Analysis NP run Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester 2007. 04. 26 - SLIDE 25
General Framework of NLP Tokenization Morphological and Part of Speech Tagging Lexical Processing Inflection/Derivation Compounding Syntactic Analysis Term recognition (Ananiadou) Semantic Analysis Context processing Interpretation Domain Analysis Appelt: 1999 Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 26
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 27
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Morphological and Open class words Lexical Processing Terms Term recognition Named Entities Syntactic Analysis Company names Locations Numerical expressions Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 28
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 29
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 30
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 31
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Most words in English Morphological and are ambiguous in terms (2) Ambiguities: Lexical Processing of their parts of speech. Combinatorial runs: v/3 pre, n/plu Explosion clubs: v/3 pre, n/plu Syntactic Analysis and two meanings Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 32
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Structural Ambiguities Semantic Analysis Predicate-argument Ambiguities Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 33
Structural Ambiguities Semantic Ambiguities(1) John bought a car with Mary. $3000 can buy a nice car. (1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000. The manager of Yaxing Benz, a Sino-German joint venture The manager of Yaxing Benz, Mr. John Smith (2) Scope Ambiguities Semantic Ambiguities(2) young women and men in the room Every man loves a woman. (3)Analytical Ambiguities Visiting relatives can be boring. IS 240 – Spring 2007 Co-reference Ambiguities Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester 2007. 04. 26 - SLIDE 34
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Combinatorial Explosion Structural Ambiguities Semantic Analysis Predicate-argument Ambiguities Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 35
Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More comprehensive knowledge: More ambiguities Adaptability: Tuning, Learning Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 36
Framework of IE IE as compromise NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 37
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 38
Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 39
Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences (4) Adaptation Techniques: Machine Learning, Trainable systems Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester IS 240 – Spring 2007. 04. 26 - SLIDE 40
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation IS 240 – Spring 2007 95 % FSA rules Part of Speech Tagger Statistic taggers Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Local Context Statistical Bias F-Value 90 Domain Dependent Domain specific rules:
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Anaysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: IS 240 – Spring 2007 Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event. 2007. 04. 26 - SLIDE 42
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Anaysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: IS 240 – Spring 2007 Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event. 2007. 04. 26 - SLIDE 43
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Framework of NLP Based on finite states automata (FSA) 1. Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2. Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3. Complex phrases: Complex noun groups and verb groups Semantic Analysis 4. Domain Events: Context processing Interpretation 5. Merging Structures: IS 240 – Spring 2007 Patterns for events of interest to the application Basic templates are to be built. Templates from different parts of the texts are merged if they provide information about the same entity or event. 2007. 04. 26 - SLIDE 44
Using NLP • Strzalkowski (in Reader) Text NLP: TAGGER IS 240 – Spring 2007 NLP repres PARSER Dbase search TERMS 2007. 04. 26 - SLIDE 45
Using NLP INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE The/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np. /per IS 240 – Spring 2007. 04. 26 - SLIDE 46
Using NLP TAGGED & STEMMED SENTENCE the/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np. /per IS 240 – Spring 2007. 04. 26 - SLIDE 47
Using NLP PARSED SENTENCE [assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]] IS 240 – Spring 2007. 04. 26 - SLIDE 48
Using NLP EXTRACTED TERMS & WEIGHTS President 2. 623519 soviet President+soviet 11. 556747 president+former Hero 7. 896426 hero+local Invade 8. 435012 tank Tank+invade 17. 402237 tank+russian Russian 7. 383342 wisconsin IS 240 – Spring 2007 5. 416102 14. 594883 14. 314775 6. 848128 16. 030809 7. 785689 2007. 04. 26 - SLIDE 49
Same Sentence, different sys INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE (using uptagger from Tsujii) The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP. /. IS 240 – Spring 2007. 04. 26 - SLIDE 50
Same Sentence, different sys CHUNKED Sentence (chunkparser – Tsujii) (TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) (. . ) ) ) IS 240 – Spring 2007. 04. 26 - SLIDE 51
Same Sentence, different sys Enju Parser ROOT been a a local The former Russian Soviet invaded has since ever ROOT be be a a local the former russian soviet invade have since ever IS 240 – Spring 2007 ROOT VBN DT DT JJ JJ NNP VBD VBZ IN IN RB ROOT VB VB DT DT JJ JJ NNP VB VB IN IN RB -1 5 5 6 11 7 0 1 12 2 14 14 4 4 10 10 9 ROOT ARG 1 ARG 2 ARG 1 ARG 1 MOD ARG 1 ARG 2 MOD ARG 1 been President hero tank hero President tank Wisconsin President been invaded since be president hero tank hero president tank wisconsin president be be invade since VBN NNP NN NNP NNP VBN VBD IN 2007. 04. 26 - SLIDE 52 VB NN NN NN VB VB VB IN
NLP & IR • Indexing – Use of NLP methods to identify phrases • Test weighting schemes for phrases – Use of more sophisticated morphological analysis • Searching – Use of two-stage retrieval • Statistical retrieval • Followed by more sophisticated NLP filtering IS 240 – Spring 2007. 04. 26 - SLIDE 53
NPL & IR • Lewis and Sparck Jones suggest research in three areas – Examination of the words, phrases and sentences that make up a document description and express the combinatory, syntagmatic relations between single terms – The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching – Using NLP-based methods for searching and matching IS 240 – Spring 2007. 04. 26 - SLIDE 54
NLP & IR Issues • Is natural language indexing using more NLP knowledge needed? • Or, should controlled vocabularies be used • Can NLP in its current state provide the improvements needed • How to test IS 240 – Spring 2007. 04. 26 - SLIDE 55
NLP & IR • New “Question Answering” track at TREC has been exploring these areas – Usually statistical methods are used to retrieve candidate documents – NLP techniques are used to extract the likely answers from the text of the documents IS 240 – Spring 2007. 04. 26 - SLIDE 56


