3c95ab90552c8ed7274c0f467d6e7881.ppt
- Количество слайдов: 32
Modeling Identity in Archival Collections of Email: A Preliminary study Tamer Elsayed and Douglas W. Oard Institute for Advanced Computer Studies Department of Computer Science College of Information Studies Conference on Email and Anti-Spam (CEAS), July 28 th, 2006
Real Problem Clinton White House 32 million emails search request ~~~~~~~~ ~~~~ Tobacco Policy ~~~~~~~~ ~~~~ 80, 000 National Archives hired 25 persons for 6 months … ~~~~~~~~ ~~~~ 200, 000 Modeling Identity in Archival Collections of Email: A Preliminary Study
Email Searcher Participant Non-participant Personal My own emails Shneiderman’s Postel’s Organizational CS UMIACS White House Enron TREC Enterprise Usenet news W 3 C Public Meaning Modeling Content p People Modeling Identity p Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity Nickname sent email to Name Nickname Name Email Address Sender Receivers sent mentioned ~~~~~ ~~Email~~ Email ~~~~~~~~~ received mentioned to mentions Mentioned Email Address Name Nickname Modeling Identity in Archival Collections of Email: A Preliminary Study
Outline p Problem p Identity Resolution Architecture p Evaluation p Conclusion Modeling Identity in Archival Collections of Email: A Preliminary Study
Entity Example Nickname Name “Robert Bruce” “Bob” Main Headers (915) Quoted Headers (8) Salutations (7) Free Signatures (9) Email Address “robert. bruce@enron. com” Static Signature (140) Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345 -7780 F (713) 646 -3393 robert. bruce@enron. com Signature Block Modeling Identity in Archival Collections of Email: A Preliminary Study
Enron Collection Example of large organizational collection p CMU version p n n p about half million emails 133, 581 unique email addresses ~52% of emails are duplicates! n same address, subject, body Modeling Identity in Archival Collections of Email: A Preliminary Study
Typical Enron Email Message-ID: <1494. 1584620. Java. Mail. evans@thyme> Date: Mon, 30 Jul 2001 12: 40: 48 -0700 (PDT) From: elizabeth. sager@enron. com To: sstack@reliant. com Subject: RE: Shhhh. . it's a SURPRISE ! X-From: Sager, Elizabeth X-To: 'SStack@reliant. com@ENRON' Hi Shari Salutation Main Body Hope all is well. Count me in for the group present. See ya next week if not earlier Liza Elizabeth Sager 713 -853 -6349 Message Body Signature Block -----Original Message----Quoted Header From: SStack@reliant. com@ENRON Sent: Monday, July 30, 2001 2: 24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess. com; wfhenze@jonesday. com Cc: ntillett@reliant. com Subject: Shhhh. . it's a SURPRISE ! Please call me (713) 207 -5233 Thanks! Shari Message Header Quoted Text Quoted Main Body Quoted Signature Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity Resolution Architecture Entities Clustering Associations Address-Address Associations Address-Name Associations Address-Nickname Associations Nickname Extraction Salutation lines Signature lines Extraction from Quoted Header Quoted headers Extraction from Main Header Unique emails Signature Line Detection Salutation Line Detection Main body Body and Quoted Text Separation Duplicate Detection Modeling Identity in Archival Collections of Email: A Preliminary Study
Extraction From Main Headers Name-Address Message-ID: <1486175. 1075858665169. Java. Mail. evans@thyme> Date: Wed, 26 Sep 2001 09: 25: 19 -0700 (PDT) Association From: jmathes@nbchamber. com To: mark. vandini@enron. com, steve. urbon@enron. com, sapienza. tony@enron. com, o'rourke. tom@enron. com, lyons. tom@enron. com Subject: New Email Address X-From: Jim Mathes
Extraction From Quoted Headers Hi Jeff, Did you get our registration packet? If not, stop by and pick one up Name-Address because you need it. Make sure you get the. Association one for new students. Shawn On Wednesday, November 03, 1999 11: 18 AM, Jeff Dasovich [SMTP: jdasovic@enron. com] wrote: > > > ok, don't shoot me, but what's the deadline for scheduling for classes? > > signed, > clueless ----------- Forwarded by Elizabeth Sager/HOU/ECT on 02/09/2000 12: 02 PM -------------"Patricia Young"
Signature & Salutation Detection From: susan. scott@enron. com Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, buthas left meto shut down The week is going OK. All the tennis and swimming you have with sore sometime. this is my night off. Am planning to do some more house chores so muscles so I do not end up with another weekend like the last. Am sending you my travel schedule for next week. The following week (May 29 Thestill 2) I'mare going back to school next weekend, I'm I'll actually have - June planning on coming to Austin already so now would be asure when, to I'm kiddies planning to be in SF also, but I'm not sure just not good time plan. I'llthereto D. C. at last. Maybe early Sept? trip that long. to bea let you know. but Also I'd be game for a girls' trip to Destin. Have a good afternoon! Call if you get lonely! Time to work! Love, love, Love, -Sooz sooz Sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573 A Houston, TX 77002 Modeling Identity in Archival Collections of Email: A Preliminary Study
Nickname Extraction From: susan. scott@enron. com Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime. Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long. Have a good afternoon! love, sooz nickname Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573 A Houston, TX 77002 3, 151 address-nickname associations Modeling Identity in Archival Collections of Email: A Preliminary Study
Identifying Entities Nickname Name “Robert Bruce” “Bob” Main Headers (915) Quoted Headers (8) 82, 084 addr-name Salutations (7) Free Signatures (9) 3, 151 addr-nickname Email Address “robert. bruce@enron. com” Static Signature (140) Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345 -7780 F (713) 646 -3393 robert. bruce@enron. com Signature Block Main Headers (7) 19, 708 addr-addr Email Address “rbruce@hotmail. com” Quoted Headers (5) “Robert” 66, 715 entities Modeling Identity in Archival Collections of Email: A Preliminary Study Name
Outline Problem p Identity Resolution Architecture p p Evaluation Conclusion p Future Work p Modeling Identity in Archival Collections of Email: A Preliminary Study
Stratified Sampling Weakest Evidence Stronger Evidence Address-Name Associations Main headers only 50 / 29677 50 / 31248 Quoted headers only 50 / 8042 50 / 3828 Both headers 50 / 9289 Address-Nickname Associations Salutations only 50 / 272 50 / 465 Signatures only 50 / 172 50 / 1754 Both Address-Address Associations 50/490 50 / 6514 50 / 4194 Modeling Identity in Archival Collections of Email: A Preliminary Study
Judgment Process Incorrect kmpresto@msn. com "home email" terrie. james@enron. com "alexis james-petty" Correct but not informative june-deadrick@reliantenergy. com “june deadrick” robbie. lewis@enron. com “robbie lewis” Correct and somewhat informative terriecovarrubias@hotmail. com "terrie covarrubias" randal. maffett@enron. com "randy" Correct and very informative lemelpe@nu. com "phyllis" piazzet@wharton. upenn. edu "tom" Modeling Identity in Archival Collections of Email: A Preliminary Study
Evaluation Measures Correct Judged Associations Very Informative Modeling Identity in Archival Collections of Email: A Preliminary Study
Accuracy p p 100% accuracy with multiple sources of evidence. Address-name association was nearly perfect 80% minimum accuracy in address-nickname 96. 7% entity accuracy Address-Name Associations Address-Nickname Associations Address-Address Associations Modeling Identity in Archival Collections of Email: A Preliminary Study
Informativeness Address-Name Associations Address-Nickname Associations Address-Address Associations Modeling Identity in Archival Collections of Email: A Preliminary Study
Outline Problem p Identity Resolution Architecture p Evaluation p p Conclusion Modeling Identity in Archival Collections of Email: A Preliminary Study
Conclusion p Introduced a computational model of identity n n n a set of simple techniques put together provide a useful baseline assessed its potential utility in the context of one fairly complex email collection Automatic detection of nicknames in salutations and signature lines. p Most informative results from weakest evidence & least accurate p Accuracy and informativeness are both important p Modeling Identity in Archival Collections of Email: A Preliminary Study
Limitations Email address associated with single identity p Strength of evidence not exploited p Heuristics hand-tuned for Enron collection p Focus on personal attributes p No reconciliation of multiple identities for single person p No attempt to classify identities as machines or groups p Recall? p Modeling Identity in Archival Collections of Email: A Preliminary Study
Thank You! Questions? Modeling Identity in Archival Collections of Email: A Preliminary Study
Backup Modeling Identity in Archival Collections of Email: A Preliminary Study
Future Work p p p extend the model to exploit temporal features and behavioral evidence implement machine learning techniques perform ablation studies characterize the coverage of our methods in more detail replicate this work in other contexts integrate these techniques with the ultimate applications for which computational models of identity are needed (e. g. , social network analysis). Modeling Identity in Archival Collections of Email: A Preliminary Study
Helping in Judgments Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity Framework Group Machine Identity Entity Person Identity Entity Entity Candidates Modeling Identity in Archival Collections of Email: A Preliminary Study
Modeling Identity p Attributes (stable explicit features) n p Associations n n p email addresses, names, nickname, contact info Link attributes together Based on observations Entities n n Representation of an identity Set of attributes in undirected graph p Linked by weighted associations Modeling Identity in Archival Collections of Email: A Preliminary Study
Identifying Entities p First round n p limited transitive closure Merging associations n n based on unique attributes Address-address associations No use of strength of evidence yet p 66, 715 entities p n Covering 77, 420 unique email address (58% of all addresses) Modeling Identity in Archival Collections of Email: A Preliminary Study
Related Work Attribute/association extraction p Name recognition and reference resolution p Applications: p n n Social network analysis Finding experts Modeling Identity in Archival Collections of Email: A Preliminary Study
Unjudged Associations Address-Name Associations Address-Nickname Associations Address-Address Associations Only 19 ~3% Modeling Identity in Archival Collections of Email: A Preliminary Study