Скачать презентацию CSA 4050 Advanced Topics in NLP Information Extraction Скачать презентацию CSA 4050 Advanced Topics in NLP Information Extraction

42357f2ba9ef40a05533edc5d512cecf.ppt

  • Количество слайдов: 30

CSA 4050: Advanced Topics in NLP Information Extraction I What is Information Extraction? November CSA 4050: Advanced Topics in NLP Information Extraction I What is Information Extraction? November 2003 CSA 4050: Information Extraction I 1

Sources • R. Gaizauskas and Y. Wilks, Information Extraction: Beyond Document Retrieval. Technical Report Sources • R. Gaizauskas and Y. Wilks, Information Extraction: Beyond Document Retrieval. Technical Report CS-97 -10, Department of Computer Science, University of Sheffield, 1997. November 2003 CSA 4050: Information Extraction I 2

What is Information Extraction? • IE: the analysis of unrestricted text in order to What is Information Extraction? • IE: the analysis of unrestricted text in order to extract information about pre-specified types of entity, relationship and event. • Typically, text is newspaper text or newswire feed. • Typically, prespecified structure is a classlike object with different data fields. November 2003 CSA 4050: Information Extraction I 3

A Example of Information Extraction 19 March – A bomb went off near a A Example of Information Extraction 19 March – A bomb went off near a power tower in Salvador leaving a large part of city, without energy; but no casualties has been reported. According to unofficial sources, the bomb- allegedly detonated by urban Guerilla commandos - blew up a power tower in northwestern part of San Salvador November 2003 Template Structure: Incident Type : bombing Date : March 19 Location : San Salvador Perpetrator : Urban Guerilla Commandos Target : power tower CSA 4050: Information Extraction I 4

 • Different levels of structure can be envisaged. – Named Entities – Relationships • Different levels of structure can be envisaged. – Named Entities – Relationships – Events – Scenarios November 2003 CSA 4050: Information Extraction I 5

Examples of Named Entities • People – John Smith, John, Mr. Smith • Locations Examples of Named Entities • People – John Smith, John, Mr. Smith • Locations – EU, The Hague, SLT, Piazza Tuta • Organisations – IBM, The Mizzi Group, University of Malta • Numerical Quantities – Lm 10, forty per cent, 40%, $10 November 2003 CSA 4050: Information Extraction I 6

Examples of Relationships between Named Entities George Bush 1 is [President 2 of the Examples of Relationships between Named Entities George Bush 1 is [President 2 of the United States 3 ] 4 – nation(3) – president(1, 3) – coref(1, 4) November 2003 CSA 4050: Information Extraction I 7

Examples of Events • Financial Events – Takeover bids – Changes of management • Examples of Events • Financial Events – Takeover bids – Changes of management • Socio/Political Events – Terrorist attacks – Traffic accidents • Geographical Events – Natural Disasters November 2003 CSA 4050: Information Extraction I 8

Some Differences between IE and IR • IE extracts relevant information from documents. • Some Differences between IE and IR • IE extracts relevant information from documents. • IE has emerged from research into rule based systems in CL. • IE typically based on some kind of linguistic analysis of source text. November 2003 • Information Retrieval (IR) retrieves relevant documents in a collection • IR mostly influenced from theory of information, probability, and statistics. • IR typically uses bag of words model of source text. CSA 4050: Information Extraction I 9

Why Linguistic Analysis is Necessary • Active/Passive distinction – BNC Holdings named Ms G. Why Linguistic Analysis is Necessary • Active/Passive distinction – BNC Holdings named Ms G. Torretta to succeed Mr. N. Andrews as new chairperson – Nicholas Andrews was named by Gina Torretta as chair-person of BNC Holdings • Use of different phrases to mean the same thing – Ms. Gina Torretta took the helm at BNC Holdings. She succeeds Nick Andrews – G Torretta succeeds N Andrews as chairperson at BNC Holdings • Establishing coreferences November 2003 CSA 4050: Information Extraction I 10

Brief History • 1960 -80 N Sager Linguistic String project: automatically induced information formats Brief History • 1960 -80 N Sager Linguistic String project: automatically induced information formats for radiology reports • 1970 s R. Schank: Scripts • 1982 G. De. Jong FRUMP: “Sketchy Scripts” used to process UPI newswire stores in domains (e. g. earthquakes; labour strikes); systematic evaluation. • 1983 J-P Zarri – analysis of historical texts by translating text into a semantic metalanguage • 1986 ATRANS (S. Lytinen et al) – script based system for analysis of money transfer messages between banks • 1992 Carnegie Group: JASPER - skims company press releases to fill in templates concerning earnings and dividends. November 2003 CSA 4050: Information Extraction I 11

Message Understanding Conferences • Conferences aimed at comparing the performance of a number of Message Understanding Conferences • Conferences aimed at comparing the performance of a number of systems working on IE from naval messages. • Sponsored by DARPA and organised by the US Naval Command centre, San Diego. – Progressively more difficult tasks. – Progressively more refined evaluation measures. November 2003 CSA 4050: Information Extraction I 12

MUC Tasks • MUC 1: tactical naval operations reports on ship sightings and engagements. MUC Tasks • MUC 1: tactical naval operations reports on ship sightings and engagements. No task definition; no evaluation criteria • MUC 3: newswire stories about terrorist attacks. 18 slot templates to be filled. Formal evaluation criteria supplied. • MUC 6: specific subtasks including named entity recognition; coreference identification; scenario template extraction. November 2003 CSA 4050: Information Extraction I 13

IE Subtasks • Named Entity recognition (NE) – Finds and classifies names, places etc. IE Subtasks • Named Entity recognition (NE) – Finds and classifies names, places etc. • Coreference Resolution (CO) – Identifies identity relations between entities in texts. • Template Element construction (TE) – Adds descriptive information to NE results (using CO). • Template Relation construction (TR) – Finds relations between TE entities. • Scenario Template production (ST) – Fits TE and TR results into specified event scenarios. November 2003 CSA 4050: Information Extraction I 14

Evaluation: the IR Starting Point false pos true pos target selected November 2003 false Evaluation: the IR Starting Point false pos true pos target selected November 2003 false neg CSA 4050: Information Extraction I 15

Evaluation Metrics • Starting points are those used for IR, namely recall and precision. Evaluation Metrics • Starting points are those used for IR, namely recall and precision. Relevant Not Relevant Retrieved tp (true pos) fp (false pos) Not Retrieved fn (false neg) tn (true neg) November 2003 CSA 4050: Information Extraction I 16

IR Measures: Precision and Recall • Precision: fraction of retrieved docs that are relevant IR Measures: Precision and Recall • Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Precision P = tp/(tp + fp) • Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Recall R = tp/(tp + fn) November 2003 CSA 4050: Information Extraction I 17

F-Measure • Whatever method is chosen to establish P and R there is a F-Measure • Whatever method is chosen to establish P and R there is a trade-off between them. • For this reason researchers often use a measure which combines the two. • F = 1/ (α/P + (1 - α)/R) is commonly used where α is a factor which determines the weighting between P and R • When α = 0. 5 the formula reduces to the harmonic mean = 2 PR/(P+R) • Clearly F is weighed towards P as α approaches 1. November 2003 CSA 4050: Information Extraction I 18

 Harmonic Mean arithmetic mean geometric mean harmonic mean 50 50 50 40 60 Harmonic Mean arithmetic mean geometric mean harmonic mean 50 50 50 40 60 50 49 48 30 70 50 46 42 20 80 50 40 32 x y arithmetic mean November 2003 CSA 4050: Information Extraction I 19

Evaluation Metrics for IE • For IE, these measures need to be related to Evaluation Metrics for IE • For IE, these measures need to be related to the activity of slot-filling: – Slot fills can be correct, partially correct or incorrect, missing, spurious. – These differences permit the introduction of finer grained measures of correctness that include overgeneration, undergeneration, and substitution. November 2003 CSA 4050: Information Extraction I 20

Recall • Recall is a measure of how much relevant information a system has Recall • Recall is a measure of how much relevant information a system has extracted from text. • It is the ratio of how much information is actually extracted against how much information there is to be extracted, ie count of facts extracted count of possible facts November 2003 CSA 4050: Information Extraction I 21

Precision • Precision is a measure of how accurate a system is in extracting Precision • Precision is a measure of how accurate a system is in extracting information. • It is the ratio of how much correct information is actually extracted against how much information is extracted, i. e. count of correct facts extracted count of facts extracted November 2003 CSA 4050: Information Extraction I 22

Bare Bones Architecture ( from Appelt and Israel 1999) Tokenisation Word segmentation Morphological & Bare Bones Architecture ( from Appelt and Israel 1999) Tokenisation Word segmentation Morphological & Lexical Processing Syntactic Analysis POS Tagging Word Sense Tagging Preparsing Parsing Discourse Analysis November 2003 CSA 4050: Information Extraction I Coreference 23

Generic IE System (Hobbs 1993) Text Zoner Preprocessor Semantic Interpreter Lexical Disambiguator November 2003 Generic IE System (Hobbs 1993) Text Zoner Preprocessor Semantic Interpreter Lexical Disambiguator November 2003 Filter Preparser Fragment Combiner Parser Coreference Resolution Template Generator CSA 4050: Information Extraction I 24

Large Scale IE La. SIE • General-purpose IE research system geared towards MUC-6 tasks. Large Scale IE La. SIE • General-purpose IE research system geared towards MUC-6 tasks. • Pipelined system with three principle processing tasks: – Lexical preprocessing – Parsing and semantic interpretation – Discourse interpretation November 2003 CSA 4050: Information Extraction I 25

La. SIE: Processing Stages • Lexical preprocessing: reads, tokenises, and tags raw input text. La. SIE: Processing Stages • Lexical preprocessing: reads, tokenises, and tags raw input text. • Parsing and semantic interpretation: chart parser; best-parse selection; construction of predicate/argument structure • Discourse interpretation: adds information from predicate-argument representation to a world model in the form of a hierarchically structured semantic net November 2003 CSA 4050: Information Extraction I 26

La. SIE Parse Forest • It is rare that analysis contains a unique, spanning La. SIE Parse Forest • It is rare that analysis contains a unique, spanning parse • selection of best parse is carried out by choosing that sequence of non-overlapping, semantically interpretable categories that covers the most words and consists of the fewest constituents. November 2003 CSA 4050: Information Extraction I 27

La. SIE Discourse Model November 2003 CSA 4050: Information Extraction I 28 La. SIE Discourse Model November 2003 CSA 4050: Information Extraction I 28

Example Applications of IE • • • Finance Medicine Law Police Academic Research November Example Applications of IE • • • Finance Medicine Law Police Academic Research November 2003 CSA 4050: Information Extraction I 29

Future Trends • Better performance: higher precision & recall • User (not expert) defined Future Trends • Better performance: higher precision & recall • User (not expert) defined IE: minimisation of role of expert • Integration with other technologies (e. g. IR) • Multilingual IE November 2003 CSA 4050: Information Extraction I 30