Скачать презентацию Identity Resolution in Email Collections Tamer Elsayed and Скачать презентацию Identity Resolution in Email Collections Tamer Elsayed and

23b9cc0f9f7222e592046ab8970256ce.ppt

  • Количество слайдов: 59

Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard Department of Computer Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard Department of Computer Science, UMIACS, and i. School CLIP Colloquium, UMD, Feb 2009

Real Problem Clinton White House 32 million emails search request ~~~~~~~~ ~~~~ Tobacco Policy Real Problem Clinton White House 32 million emails search request ~~~~~~~~ ~~~~ Tobacco Policy ~~~~~~~~ ~~~~ 80, 000 National Archives hired 25 persons for 6 months … ~~~~~~~~ ~~~~ 200, 000 Identity Resolution in Email Collections 2

Identity Resolution in Email Date: Wed Dec 20 08: 57: 00 EST 2000 From: Identity Resolution in Email Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the Sheila call will be too late for him. WHO? 3 Identity Resolution in Email Collections

Enron Collection Message-ID: 55 Mon, 30<1494. 1584620. Java. Mail. evans@thyme> Date: Sheila’s !! -0700 Enron Collection Message-ID: 55 Mon, 30<1494. 1584620. Java. Mail. evans@thyme> Date: Sheila’s !! -0700 (PDT) Jul 2001 12: 40: 48 From: elizabeth. sager@enron. com To: weismansstack@reliant. com jarnot maynes Subject: RE: Shhhh. . it's a SURPRISE ! pardo kirby nacey X-From: Sager, Elizabeth glover knudsen ferrarini X-To: 'SStack@reliant. com@ENRON' rich boehringer dey Hi jones Shari macleod lutz breeden all ishoward glover Hope well. Count huckaby me in for the group present. wollam darling See ya next week if not earlier tweed jortner watson Liza mcintyre neylon perlick Elizabeth Sager chadwick whanger advani 713 -853 -6349 birmingham nagel hester -----Original Message----kahanek graves kenner From: SStack@reliant. com@ENRON foraker Monday, July 30, 2001 2: 24 PM mclaughlin lewis Sent: To: Elizabeth; tasman Sager, walton Murphy, Harlan; jcrespo@hess. com; venville wfhenze@jonesday. com fisher rappazzo whitman Cc: ntillett@reliant. com petitt miller Subject: berggren it's a SURPRISE ! Shhhh. . Dombo call me (713) 207 -5233 swatek osowski Please Robbins hollis kelly Thanks! chang Shari Identity Resolution in Email Collections Rank Candidates 4

Generative Model 1. Choose “person” c to mention person p(c) 2. Choose appropriate “context” Generative Model 1. Choose “person” c to mention person p(c) 2. Choose appropriate “context” X to mention c context p(X | c) 3. Choose a “mention” l mention p(l | X, c) “sheila” GE conference call 5 Identity Resolution in Email Collections

3 -Step Solution (1) Identity Modeling (2) Context Reconstruction (3) Mention Resolution Posterior Distribution 3 -Step Solution (1) Identity Modeling (2) Context Reconstruction (3) Mention Resolution Posterior Distribution 6 Identity Resolution in Email Collections

Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion and Future Work p 7 Identity Resolution in Email Collections

“Easy References” of Identity Message-ID: <1494. 1584620. Java. Mail. evans@thyme> Date: Mon, 30 Jul “Easy References” of Identity Message-ID: <1494. 1584620. Java. Mail. evans@thyme> Date: Mon, 30 Jul 2001 12: 40: 48 -0700 (PDT) From: elizabeth. sager@enron. com To: sstack@reliant. com Subject: RE: Shhhh. . it's a SURPRISE ! X-From: Sager, Elizabeth X-To: 'SStack@reliant. com@ENRON' Hi Shari Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713 -853 -6349 User Regularities -----Original Message----From: SStack@reliant. com@ENRON Sent: Monday, July 30, 2001 2: 24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess. com; wfhenze@jonesday. com Cc: ntillett@reliant. com Subject: Shhhh. . it's a SURPRISE ! Email Standards Email-Client Behavior Please call me (713) 207 -5233 Thanks! Shari 8 Identity Resolution in Email Collections

Representational Model of Identity Representational Model sheila glover 1170 (User Name) sheila. glover@enron. com Representational Model of Identity Representational Model sheila glover 1170 (User Name) sheila. glover@enron. com 216 (Signature) 19 (Salutation) 932 (Main Headers) 14 (Quoted Headers) sheila sg sheila glover 19 (Signature) 77, 240 “non-trivial” models 9 Identity Resolution in Email Collections

Computational Model of Identity identity c name type t m observed mention 10 Identity Computational Model of Identity identity c name type t m observed mention 10 Identity Resolution in Email Collections

Candidates Likelihood: p ( “sheila” | c) Identity Models Candidates 11 Identity Resolution in Candidates Likelihood: p ( “sheila” | c) Identity Models Candidates 11 Identity Resolution in Email Collections

Outline Introduction and Approach Overview p Computational Model of Identity p p Context Reconstruction Outline Introduction and Approach Overview p Computational Model of Identity p p Context Reconstruction Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion and Future Work p 12 Identity Resolution in Email Collections

Contextual Space Topical Context Conversational Context Local Context 13 Identity Resolution in Email Collections Contextual Space Topical Context Conversational Context Local Context 13 Identity Resolution in Email Collections

Topical Context Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann Topical Context Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Date: Fri Dec 15 05: 33: 00 EST 2000 From: david. oxley@enron. com To: vince j kaminski Cc: sheila walton sheila. walton@enron. com Subject: Re: Grant Masson Great news. Lets get this moving along. Sheila can you work out GE letter? Sheila, GE Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone. call 14 Identity Resolution in Email Collections

Contextual Space Social Context Topical Context Conversational Context Local Context 15 Identity Resolution in Contextual Space Social Context Topical Context Conversational Context Local Context 15 Identity Resolution in Email Collections

Social Context Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann Social Context Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann kay. mann@enron. com To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Date: Tue, 19 Dec 2000 07: 00 -0800 (PST) From: rebecca. walker@enron. com To: kay. mann@enron. com Subject: ESA Option Execution Kay Can you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland. Sheila Tweed Thanks, Rebecca 16 Identity Resolution in Email Collections

Formally p. A context of an email is a probability distribution over emails n Formally p. A context of an email is a probability distribution over emails n Probability estimated based on type of context p Contextual Space is a linear combination of 4 contexts 17 Identity Resolution in Email Collections

Context Expansion local topical conversational people social time content Temporal similarity affects social and Context Expansion local topical conversational people social time content Temporal similarity affects social and topical similarity 18 Identity Resolution in Email Collections

Temporal Similarity Decay over time p Gaussian and Linear functions p Time difference / Temporal Similarity Decay over time p Gaussian and Linear functions p Time difference / Rank p 19 Identity Resolution in Email Collections

Social Similarity Two sets of participants (email adresses) p Binary, Overlap, Jacaard, Both p Social Similarity Two sets of participants (email adresses) p Binary, Overlap, Jacaard, Both p 20 Identity Resolution in Email Collections

Temporal Effect Temporal Sim Social Sim Normalize Pure Social Sim Social Context 21 Identity Temporal Effect Temporal Sim Social Sim Normalize Pure Social Sim Social Context 21 Identity Resolution in Email Collections

Topical Similarity p Standard IR Similarity: BM 25 email p Email as a DOCUMENT? Topical Similarity p Standard IR Similarity: BM 25 email p Email as a DOCUMENT? n n p reply / forward Subject Body (+Subject) Root of thread Concatenated path to root Combined similarly with temporal similarity 22 Identity Resolution in Email Collections

Contextual Space (emails) Social Context Topical Context Conversational Context Local Context 23 Identity Resolution Contextual Space (emails) Social Context Topical Context Conversational Context Local Context 23 Identity Resolution in Email Collections

Contextual Space (mentions) “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” “Sheila” topical social “sheila” Contextual Space (mentions) “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” “Sheila” topical social “sheila” “Sheila” topical conversational “sg” 24 Identity Resolution in Email Collections

Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion and Future Work p 25 Identity Resolution in Email Collections

Mention Resolution Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann Mention Resolution Date: Wed Dec 20 08: 57: 00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled 1 ? Did Sheila want Scott to participate? Looks like the call will be too late for him. 2 Likelihood: p ( “sheila” | c) 3 Candidates Goal: estimate p(c|m, X(m)) and rank accordingly 26 Identity Resolution in Email Collections

[1] Context-Free Resolution “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” Context-Free Resolution “Sheila” topical [1] Context-Free Resolution “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” Context-Free Resolution “Sheila” topical social “sheila” “Sheila” topical conversational “sg” X Identity Resolution in Email Collections 27

[2] Contextual Resolution “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” Context-Free Resolution “Sheila” social [2] Contextual Resolution “Sheila Tweed” “jsheila@enron. com” social “Sheila Walton” Context-Free Resolution “Sheila” social topical “sheila” “Sheila” “sg” 28 Identity Resolution in Email Collections

Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion p 29 Identity Resolution in Email Collections

Test Collections Enron-all Enron-subset Sager Shapiro Collection Emails Identities Mention Candidates Queries Min. Avg. Test Collections Enron-all Enron-subset Sager Shapiro Collection Emails Identities Mention Candidates Queries Min. Avg. Max. Sager 1, 628 627 51 1 4 11 Shapiro 974 855 49 1 8 21 Enron-subset 54, 018 27, 340 78 1 152 489 Enron-all 248, 451 123, 783 78 3 518 1785 30 Identity Resolution in Email Collections

Evaluation Measures Commonly used in “known-item” retrieval p Success @1 (i. e. , Precision Evaluation Measures Commonly used in “known-item” retrieval p Success @1 (i. e. , Precision @1) n One-best p MRR (Mean Reciprocal Rank) n Inverse of the harmonic mean of the ranks of true answer ri 31 Identity Resolution in Email Collections

Comparison w/Literature Collection Sager Shapiro Enron-subset Enron-all MRR Context Lit. Best Expansion Best 0. Comparison w/Literature Collection Sager Shapiro Enron-subset Enron-all MRR Context Lit. Best Expansion Best 0. 911 0. 889 0. 913 0. 879 0. 91 0. 89 - Success @ 1 Context Lit. Best Expansion Best 0. 863 0. 804 0. 878 0. 779 0. 846 (0. 82) 0. 821 - Earlier expansion approach, reported in ACL 2008 Improved expansion 0. 92 0. 87 32 Identity Resolution in Email Collections

Limitations p Resolving single mentions Scalable Implementation for Resolving All Mentions All mention-queries are Limitations p Resolving single mentions Scalable Implementation for Resolving All Mentions All mention-queries are sampled from Enron to Enron emails p All mention-queries refer to Enron Employee p Small for train/test split p New Test Collection 33 Identity Resolution in Email Collections

Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion p 34 Identity Resolution in Email Collections

Scalable Implementation Two Bottlenecks: 1. Context expansion of ALL emails For each email: ranked Scalable Implementation Two Bottlenecks: 1. Context expansion of ALL emails For each email: ranked list of “Similar” emails 2. Resolution of ALL mentions Resolution of one mention depends on resolution of all other mentions in context 35 Identity Resolution in Email Collections

Context Expansion of ALL Emails p Goal: For each email: ranked list of “Similar” Context Expansion of ALL Emails p Goal: For each email: ranked list of “Similar” emails Need for BOTH social and topical contexts n Efficient implementation n Abstract Problem: Computing Pairwise Similarity 36 Identity Resolution in Email Collections

Trivial Solution p p load each vector o(N) times load each term o(dft 2) Trivial Solution p p load each vector o(N) times load each term o(dft 2) times Goal scalable and efficient solution for large collections 37 Identity Resolution in Email Collections

Better Solution Each term contributes only if appears in p p Load weights for Better Solution Each term contributes only if appears in p p Load weights for each term once Each term contributes o(dft 2) partial scores 38 Identity Resolution in Email Collections

Map. Reduce Framework (b) Shuffle (a) Map (k 1, v 1) input [k 2, Map. Reduce Framework (b) Shuffle (a) Map (k 1, v 1) input [k 2, v 2] map map (c) Reduce (k 2, [v 2]) [(k 3, v 3)] group values by: [keys] output reduce Shuffling reduce output map handles low-level details transparently 39 Identity Resolution in Email Collections

Decomposition Each term contributes only if appears in reduce p p map Load weights Decomposition Each term contributes only if appears in reduce p p map Load weights for each term once Each term contributes o(dft 2) partial scores Identity Resolution in Email Collections 40

Expansion Using Map. Reduce p Using generic pairwise-similarity for both topical and social expansion Expansion Using Map. Reduce p Using generic pairwise-similarity for both topical and social expansion context temporal sim model ~~~~ topical : body/root/pat h social : doc rep. ~~~~ ~~~~ context graph participants ~~~~ ~~~~ ~~~-- time df-cut rank window cut-off 41 Identity Resolution in Email Collections

Context Mention-Graph map “Sheila Tweed” social “Sheila Walton” Context-Free Resolution “jsheila@enron. com” map social Context Mention-Graph map “Sheila Tweed” social “Sheila Walton” Context-Free Resolution “jsheila@enron. com” map social “Sheila” reduce topical social “sheila” map “Sheila” conversational topical map “sg” 42 Identity Resolution in Email Collections

Packing Resolution System Using Map. Reduce Threads Emails Identity Models Expansion Dict. Conv. Expansion Packing Resolution System Using Map. Reduce Threads Emails Identity Models Expansion Dict. Conv. Expansion Local Expansion Topical Expansion Social Expansion Conv. Graph Local Graph Topical Graph Social Graph Resolution Prior Conv. Resolution Prior Local Resolution Mention Recognition and Prior Computation Prior Topical Resolution Social Resolution Prior Resolution Preprocessing Merging Context Resolutions Posterior Resolution 43 Identity Resolution in Email Collections

Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Outline Introduction and Approach Overview p Computational Model of Identity p Context Reconstruction p Mention Resolution p Evaluation on Existing Collections p Scalable Map. Reduce Implementation p New Test Collection p Conclusion p 44 Identity Resolution in Email Collections

New Test Collection Random Sample from CMU-Enron collection p “Annotation + Search” interface available New Test Collection Random Sample from CMU-Enron collection p “Annotation + Search” interface available p Total annotators: 3 p Annotation time: ~50 hours. p Not only resolutions p n Time, difficulty, confidence, evidence, and comments Total mention-queries : 584 p 80% resolvable, 82% of them to Enron domain p Overall inter-annotator agreement: ~81 % p 45 Identity Resolution in Email Collections

Mention-Query Selection 46 Identity Resolution in Email Collections Mention-Query Selection 46 Identity Resolution in Email Collections

Distribution of Names Based on Resolution 47 Identity Resolution in Email Collections Distribution of Names Based on Resolution 47 Identity Resolution in Email Collections

Distribution Based on Difficulty 48 Identity Resolution in Email Collections Distribution Based on Difficulty 48 Identity Resolution in Email Collections

Distribution Based on Confidence 49 Identity Resolution in Email Collections Distribution Based on Confidence 49 Identity Resolution in Email Collections

Distribution of Time Spent 50 Identity Resolution in Email Collections Distribution of Time Spent 50 Identity Resolution in Email Collections

Evaluation again … 51 Identity Resolution in Email Collections Evaluation again … 51 Identity Resolution in Email Collections

Pairwise Agreement a 1 195 16/16 (100%) 2/4 (50%) 4/7 (57%) 27 23 a Pairwise Agreement a 1 195 16/16 (100%) 2/4 (50%) 4/7 (57%) 27 23 a 3 199 12/16 (75%) 1/5 (20%) 2/2 (100%) 50 24/27 (89%) 5/12 (42%) 11/11 (100%) a 2 35/38 (92%) 2/2 (100%) 6/10 (60%) 50 190 Enron-resolvable Non-enron-resolvable Unresolvable 52 Identity Resolution in Email Collections

Individual Annotator Agreement 53 Identity Resolution in Email Collections Individual Annotator Agreement 53 Identity Resolution in Email Collections

Overall Agreement 54 Identity Resolution in Email Collections Overall Agreement 54 Identity Resolution in Email Collections

Agreement Based on Difficulty 55 Identity Resolution in Email Collections Agreement Based on Difficulty 55 Identity Resolution in Email Collections

Agreement Based on Confidence 56 Identity Resolution in Email Collections Agreement Based on Confidence 56 Identity Resolution in Email Collections

Conclusion and Future Work p Identity Resolution by non-participants is feasible n n p Conclusion and Future Work p Identity Resolution by non-participants is feasible n n p Proposed generative probabilistic model n n p And automatic systems for that can be built ~90 -75% accurate Context Expansion using temporal similarity Scalable Implementation using “Pairwise Sim with Map. Reduce” Developed largest test collection for the task n 80% resolvable, 82% of them to Enron employees p Effectiveness scales well to large collections p Efficiency Results Evaluation using double-assessments Iterative approach for “joint resolution” p p 57 Identity Resolution in Email Collections

Thank You! 58 Identity Resolution in Email Collections Thank You! 58 Identity Resolution in Email Collections

Related Work p Diehl et al. (SIAM, 2006) n n n p Developed Enron-subset Related Work p Diehl et al. (SIAM, 2006) n n n p Developed Enron-subset collection Temporal traffic models Candidates must have communicated with sender Minkov et al. (SIGIR, 2006) n n n Developed Sager and Shapiro collections Graphical framework Large collections? 59 Identity Resolution in Email Collections