23b9cc0f9f7222e592046ab8970256ce.ppt
- Количество слайдов: 59
Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard Department of Computer Science, UMIACS, and i. School CLIP Colloquium, UMD, Feb 2009
Real Problem Clinton White House 32 million emails search request ~~~~~~~~ ~~~~ Tobacco Policy ~~~~~~~~ ~~~~ 80, 000 National Archives hired 25 persons for 6 months … ~~~~~~~~ ~~~~ 200, 000 Identity Resolution in Email Collections 2
Identity Resolution in Email Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann
Enron Collection Message-ID: 55 Mon, 30<1494. 1584620. Java. Mail. evans@thyme> Date: Sheila’s !! -0700 (PDT) Jul 2001 12: 40: 48 From: elizabeth. sager@enron. com To: weismansstack@reliant. com jarnot maynes Subject: RE: Shhhh. . it's a SURPRISE ! pardo kirby nacey X-From: Sager, Elizabeth glover knudsen ferrarini X-To: 'SStack@reliant. com@ENRON' rich boehringer dey Hi jones Shari macleod lutz breeden all ishoward glover Hope well. Count huckaby me in for the group present. wollam darling See ya next week if not earlier tweed jortner watson Liza mcintyre neylon perlick Elizabeth Sager chadwick whanger advani 713 -853 -6349 birmingham nagel hester -----Original Message----kahanek graves kenner From: SStack@reliant. com@ENRON foraker Monday, July 30, 2001 2: 24 PM mclaughlin lewis Sent: To: Elizabeth; tasman Sager, walton Murphy, Harlan; jcrespo@hess. com; venville wfhenze@jonesday. com fisher rappazzo whitman Cc: ntillett@reliant. com petitt miller Subject: berggren it's a SURPRISE ! Shhhh. . Dombo call me (713) 207 -5233 swatek osowski Please Robbins hollis kelly Thanks! chang Shari Identity Resolution in Email Collections Rank Candidates 4
Generative Model 1. Choose “person” c to mention person p(c) 2. Choose appropriate “context” X to mention c context p(X | c) 3. Choose a “mention” l mention p(l | X, c) “sheila” GE conference call 5 Identity Resolution in Email Collections
3 -Step Solution (1) Identity Modeling (2) Context Reconstruction (3) Mention Resolution Posterior Distribution 6 Identity Resolution in Email Collections
Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion and Future Work p 7 Identity Resolution in Email Collections
“Easy References” of Identity Message-ID: <1494. 1584620. Java. Mail. evans@thyme> Date: Mon, 30 Jul 2001 12: 40: 48 -0700 (PDT) From: elizabeth. sager@enron. com To: sstack@reliant. com Subject: RE: Shhhh. . it's a SURPRISE ! X-From: Sager, Elizabeth X-To: 'SStack@reliant. com@ENRON' Hi Shari Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713 -853 -6349 User Regularities -----Original Message----From: SStack@reliant. com@ENRON Sent: Monday, July 30, 2001 2: 24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess. com; wfhenze@jonesday. com Cc: ntillett@reliant. com Subject: Shhhh. . it's a SURPRISE ! Email Standards Email-Client Behavior Please call me (713) 207 -5233 Thanks! Shari 8 Identity Resolution in Email Collections
Representational Model of Identity Representational Model sheila glover 1170 (User Name) sheila. glover@enron. com 216 (Signature) 19 (Salutation) 932 (Main Headers) 14 (Quoted Headers) sheila sg sheila glover 19 (Signature) 77, 240 “non-trivial” models 9 Identity Resolution in Email Collections
Computational Model of Identity identity c name type t m observed mention 10 Identity Resolution in Email Collections
Candidates Likelihood: p ( “sheila” | c) Identity Models Candidates 11 Identity Resolution in Email Collections
Outline Introduction and Approach Overview p Computational Model of Identity p p Context Reconstruction Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion and Future Work p 12 Identity Resolution in Email Collections
Contextual Space Topical Context Conversational Context Local Context 13 Identity Resolution in Email Collections
Topical Context Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann
Contextual Space Social Context Topical Context Conversational Context Local Context 15 Identity Resolution in Email Collections
Social Context Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann
Formally p. A context of an email is a probability distribution over emails n Probability estimated based on type of context p Contextual Space is a linear combination of 4 contexts 17 Identity Resolution in Email Collections
Context Expansion local topical conversational people social time content Temporal similarity affects social and topical similarity 18 Identity Resolution in Email Collections
Temporal Similarity Decay over time p Gaussian and Linear functions p Time difference / Rank p 19 Identity Resolution in Email Collections
Social Similarity Two sets of participants (email adresses) p Binary, Overlap, Jacaard, Both p 20 Identity Resolution in Email Collections
Temporal Effect Temporal Sim Social Sim Normalize Pure Social Sim Social Context 21 Identity Resolution in Email Collections
Topical Similarity p Standard IR Similarity: BM 25 email p Email as a DOCUMENT? n n p reply / forward Subject Body (+Subject) Root of thread Concatenated path to root Combined similarly with temporal similarity 22 Identity Resolution in Email Collections
Contextual Space (emails) Social Context Topical Context Conversational Context Local Context 23 Identity Resolution in Email Collections
Contextual Space (mentions) “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” “Sheila” topical social “sheila” “Sheila” topical conversational “sg” 24 Identity Resolution in Email Collections
Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion and Future Work p 25 Identity Resolution in Email Collections
Mention Resolution Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann
[1] Context-Free Resolution “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” Context-Free Resolution “Sheila” topical social “sheila” “Sheila” topical conversational “sg” X Identity Resolution in Email Collections 27
[2] Contextual Resolution “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” Context-Free Resolution “Sheila” social topical “sheila” “Sheila” “sg” 28 Identity Resolution in Email Collections
Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion p 29 Identity Resolution in Email Collections
Test Collections Enron-all Enron-subset Sager Shapiro Collection Emails Identities Mention Candidates Queries Min. Avg. Max. Sager 1, 628 627 51 1 4 11 Shapiro 974 855 49 1 8 21 Enron-subset 54, 018 27, 340 78 1 152 489 Enron-all 248, 451 123, 783 78 3 518 1785 30 Identity Resolution in Email Collections
Evaluation Measures Commonly used in “known-item” retrieval p Success @1 (i. e. , Precision @1) n One-best p MRR (Mean Reciprocal Rank) n Inverse of the harmonic mean of the ranks of true answer ri 31 Identity Resolution in Email Collections
Comparison w/Literature Collection Sager Shapiro Enron-subset Enron-all MRR Context Lit. Best Expansion Best 0. 911 0. 889 0. 913 0. 879 0. 91 0. 89 - Success @ 1 Context Lit. Best Expansion Best 0. 863 0. 804 0. 878 0. 779 0. 846 (0. 82) 0. 821 - Earlier expansion approach, reported in ACL 2008 Improved expansion 0. 92 0. 87 32 Identity Resolution in Email Collections
Limitations p Resolving single mentions Scalable Implementation for Resolving All Mentions All mention-queries are sampled from Enron to Enron emails p All mention-queries refer to Enron Employee p Small for train/test split p New Test Collection 33 Identity Resolution in Email Collections
Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion p 34 Identity Resolution in Email Collections
Scalable Implementation Two Bottlenecks: 1. Context expansion of ALL emails For each email: ranked list of “Similar” emails 2. Resolution of ALL mentions Resolution of one mention depends on resolution of all other mentions in context 35 Identity Resolution in Email Collections
Context Expansion of ALL Emails p Goal: For each email: ranked list of “Similar” emails Need for BOTH social and topical contexts n Efficient implementation n Abstract Problem: Computing Pairwise Similarity 36 Identity Resolution in Email Collections
Trivial Solution p p load each vector o(N) times load each term o(dft 2) times Goal scalable and efficient solution for large collections 37 Identity Resolution in Email Collections
Better Solution Each term contributes only if appears in p p Load weights for each term once Each term contributes o(dft 2) partial scores 38 Identity Resolution in Email Collections
Map. Reduce Framework (b) Shuffle (a) Map (k 1, v 1) input [k 2, v 2] map map (c) Reduce (k 2, [v 2]) [(k 3, v 3)] group values by: [keys] output reduce Shuffling reduce output map handles low-level details transparently 39 Identity Resolution in Email Collections
Decomposition Each term contributes only if appears in reduce p p map Load weights for each term once Each term contributes o(dft 2) partial scores Identity Resolution in Email Collections 40
Expansion Using Map. Reduce p Using generic pairwise-similarity for both topical and social expansion context temporal sim model ~~~~ topical : body/root/pat h social : doc rep. ~~~~ ~~~~ context graph participants ~~~~ ~~~~ ~~~-- time df-cut rank window cut-off 41 Identity Resolution in Email Collections
Context Mention-Graph map “Sheila Tweed” social “Sheila Walton” Context-Free Resolution “jsheila@enron. com” map social “Sheila” reduce topical social “sheila” map “Sheila” conversational topical map “sg” 42 Identity Resolution in Email Collections
Packing Resolution System Using Map. Reduce Threads Emails Identity Models Expansion Dict. Conv. Expansion Local Expansion Topical Expansion Social Expansion Conv. Graph Local Graph Topical Graph Social Graph Resolution Prior Conv. Resolution Prior Local Resolution Mention Recognition and Prior Computation Prior Topical Resolution Social Resolution Prior Resolution Preprocessing Merging Context Resolutions Posterior Resolution 43 Identity Resolution in Email Collections
Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion p 44 Identity Resolution in Email Collections
New Test Collection Random Sample from CMU-Enron collection p “Annotation + Search” interface available p Total annotators: 3 p Annotation time: ~50 hours. p Not only resolutions p n Time, difficulty, confidence, evidence, and comments Total mention-queries : 584 p 80% resolvable, 82% of them to Enron domain p Overall inter-annotator agreement: ~81 % p 45 Identity Resolution in Email Collections
Mention-Query Selection 46 Identity Resolution in Email Collections
Distribution of Names Based on Resolution 47 Identity Resolution in Email Collections
Distribution Based on Difficulty 48 Identity Resolution in Email Collections
Distribution Based on Confidence 49 Identity Resolution in Email Collections
Distribution of Time Spent 50 Identity Resolution in Email Collections
Evaluation again … 51 Identity Resolution in Email Collections
Pairwise Agreement a 1 195 16/16 (100%) 2/4 (50%) 4/7 (57%) 27 23 a 3 199 12/16 (75%) 1/5 (20%) 2/2 (100%) 50 24/27 (89%) 5/12 (42%) 11/11 (100%) a 2 35/38 (92%) 2/2 (100%) 6/10 (60%) 50 190 Enron-resolvable Non-enron-resolvable Unresolvable 52 Identity Resolution in Email Collections
Individual Annotator Agreement 53 Identity Resolution in Email Collections
Overall Agreement 54 Identity Resolution in Email Collections
Agreement Based on Difficulty 55 Identity Resolution in Email Collections
Agreement Based on Confidence 56 Identity Resolution in Email Collections
Conclusion and Future Work p Identity Resolution by non-participants is feasible n n p Proposed generative probabilistic model n n p And automatic systems for that can be built ~90 -75% accurate Context Expansion using temporal similarity Scalable Implementation using “Pairwise Sim with Map. Reduce” Developed largest test collection for the task n 80% resolvable, 82% of them to Enron employees p Effectiveness scales well to large collections p Efficiency Results Evaluation using double-assessments Iterative approach for “joint resolution” p p 57 Identity Resolution in Email Collections
Thank You! 58 Identity Resolution in Email Collections
Related Work p Diehl et al. (SIAM, 2006) n n n p Developed Enron-subset collection Temporal traffic models Candidates must have communicated with sender Minkov et al. (SIGIR, 2006) n n n Developed Sager and Shapiro collections Graphical framework Large collections? 59 Identity Resolution in Email Collections


