389d2e837e7a767698a057a8c8617af3.ppt
- Количество слайдов: 26
Robust Reading: Identification and Tracing of Ambiguous Names Xin Li, Paul Morie, Dan Roth University of Illinois at Urbana-Champaign Presented by Xin Li, UIUC
Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr. , finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R. I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me. '‘ Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy coedited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980 -1994 (Seren 1996). Presented by Xin Li, UIUC 2
Document 1: The Justice Department has officially ended its inquiry into the assassinations of John F. Kennedy and Martin Luther King Jr. , finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. Document 2: In 1953, Massachusetts Sen. John F. Kennedy married Jacqueline Lee Bouvier in Newport, R. I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me. '‘ Document 3: David Kennedy was born in Leicester, England in 1959. …Kennedy coedited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980 -1994 (Seren 1996). Presented by Xin Li, UIUC 3
Why is Robust Reading Problem important? Most of the work in NLP is done at the level of mentions. We would like to start moving from the mention level to the concept level. Our solution We develop a global probabilistic view on how documents are generated and names are ``sprinkled’’ to them. l We formulize the problem as learning the model parameters and making inference using it. l Our experimental study showed promising results on New York Times news articles. l Presented by Xin Li, UIUC 4
Outline A generative model of document generation – three model relaxations. l Learn the models in a completely unsupervised setting. l Experimental Results l Conclusion and Future Directions l Presented by Xin Li, UIUC 5
Generate Document d The Justice Department has officially ended its inquiry into the assassinations of President John F. Kennedy and Martin Luther King Jr. , finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. …President Kennedy…JFK… Presented by Xin Li, UIUC 6
At the beginning, we have a set of entities in our mind A set of entities E The Justice Department Dallas David Kennedy Presented by Xin Li, UIUC The House Assassinations Committee 7
First Step: Select a subset of entities for d. Underlying probability distribution: P(Ed). Ed : entities in d The Justice Department Dallas The House Assassinations Committee Presented by Xin Li, UIUC 8
Second Step: For each entity e, select a representative r. Underlying probability distribution: P(r|e) and P(Rd|Ed)= P(r|e). Rd : representatives in d President John F. Kennedy The Justice Department has officially ended its inquiry into the assassinations of President John F. Kennedy and Martin Luther King Jr. , finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. Presented by Xin Li, UIUC 9
Third Step: For each representative r, select a set of mentions. Underlying probability distributions: P(m|r) and P(Md|Rd)~ P(m|r). Md : actual mentions in d President John F. Kennedy, JFK President Kennedy The Justice Department has officially ended its inquiry into the assassinations of President John F. Kennedy and Martin Luther King Jr. , finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. …President Kennedy…JFK… Presented by Xin Li, UIUC 10
Generate Document d The Justice Department has officially ended its inquiry into the assassinations of President John F. Kennedy and Martin Luther King Jr. , finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. …President Kennedy…JFK… Presented by Xin Li, UIUC 11
E e Step 1: P(Ed) d Ed edi Step 2: P(r|e) Rd President John F. Kennedy rd House of Representatives i Step 3: P(m|r) Md Mdi {President Kennedy, JFK} {House of Representatives, The House} Presented by Xin Li, UIUC 12
Robust Reading Assuming we have the model, the fundamental problem is to decide what entities are mentioned in a given document and what the most likely entity to each mention is. Ed = argmax (E , R ) P(Ed, Rd | Md, ) = argmax (E , R ) P(Ed, Rd, Md | ) d d Presented by Xin Li, UIUC 13
Model III Model I E e Model II d Ed edi Rd rdi Md Mdi Presented by Xin Li, UIUC 14
Model I (the simplest) President Kennedy e 1, e 2, e 3, …. , m 1, m 2, m 3 …. , P(D) = P({(ei, mi)}) = P(ei) P(mi | ei) The most likely entity e* for mention m: e* = argmax e E P(e | m, ) = argmax e E P(e) P(m | e). Presented by Xin Li, UIUC 15
Model II President John F. Kennedy, JFK, Kennedy, President Jonh F. Kennedy E e P(Ed) = P(edi) Ed edi Presented by Xin Li, UIUC 16
Model III (least relaxation) l Mentions are independently selected while entities are selected according to a Markov chain. E e P(Ed) = P(e ) P(edi | edi -1) 0 Ed edi Presented by Xin Li, UIUC 17
Learning the Models Unsupervised Learning Only assuming that we know Md, for each d in D, we hope to learn Θ using D and also hope to label D. l Truncated EM algorithm--- A greedy search algorithm. – Initialization: l Perform local clustering of mentions to find <Ed, Rd> inside each document based on a simple similarity metric between names – M-step: Θ* = argmaxΘ P(D={<Ed, Rd, Md>}| Θ). – E-step: (Ed Rd)* = argmax(Ed, Rd) P(D={<Ed, Rd, Md>}| Θ). Presented by Xin Li, UIUC 18
Parameter Estimation In the learning process, assuming we have obtained labeled documents D= {(Ed, Rd, Md)} from previous Ior E-steps, l we perform maximum likelihood estimation of model parameters in each M-step. l P(e), P(e 2|e 1), and the appearance probability PW|W (for example, P(m|r)). l Presented by Xin Li, UIUC 19
The Appearance Probability p(m|r) President John F. Kennedy l Kennedy, JFK President Kennedy Appearance Probability: the probability of one name being transformed from another: P(President Kennedy | President John F. Kennedy)= k P(vk’|vk). Attributes: First. Name, Last. Name, Title, Suffix, Gender. l P(vk’|vk) is modelled relationally as a multinomial distribution over a set of predefined values. – Identical Writing, Typical Transformation, Non-typical Transformation, Missing Value. l Presented by Xin Li, UIUC 20
Experimental Setting l. Data: 300 TREC documents (New York Times), 8000 mentions, 2000 entities. l. Processed with a named entity recognizer: People, Location and Organization. l. Each pair of names is a test example, 130, 000 positive examples. l. Evaluation: Precision, Recall and F 1. Presented by Xin Li, UIUC 21
Performance Baseline: Predict (m 1, m 2) as positive iff they have identical writings Discriminative: Cluster based on the Soft. TFIDF entity similarity metric Presented by Xin Li, UIUC 22
Conclusions We presented an unsupervised learning approach to the “Robust Reading” problem. l We designed a generative model that describes the natural generation process of a document and how names are “sprinkled” into it. l Our model can achieve promising results (89% F 1) on news articles. l Presented by Xin Li, UIUC 23
Future Directions l Integrate with more contextual information, l Integrate with general coreference resolution, l Integrate with other NLP tasks. Presented by Xin Li, UIUC 24
Thank You! Presented by Xin Li, xli 1@uiuc. edu A demo of this work is at http: //l 2 r. cs. uiuc. edu/~cogcomp/eoh/index. html Presented by Xin Li, UIUC 25
The Basic Model l A global probabilistic view on how documents are generated. – A joint distribution over entities P(Ed), – An “author” model that makes sure that at least one mention of an entity is easily identifiable P(r|e), – An appearance model governing how mentions are transformed from the “representative” mention P(m|r). P(d) = P(Ed, Rd, Md) = P(Ed) P(Rd | Ed) P(Md | Rd) P(D) = P(d) Presented by Xin Li, UIUC 26
389d2e837e7a767698a057a8c8617af3.ppt