
888f50350b7a6878d1bb8b8c8821e955.ppt
- Количество слайдов: 23
India Research Lab Auto-grouping Emails for Faster e. Discovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM Research – India *IBM Software Group
India Research Lab Outline of the Talk § e. Discovery Process § A new way of e. Discovery Review: Group Level Review § Creating Syntactic Groups § Creating Semantic Groups § Experiments and Conclusion | 3/17/2018
India Research Lab e. Discovery Process § Discovery: Process in pre-trial phase - Produce relevant information § e. Discovery: FRCP 2006 amendment - Produce relevant Electronically Stored Information (ESI) § Emails, chats, word docs, presentations etc. § Huge volumes of ESI - Process is expensive 60% of cases warrant some form of e. Disovery - 4. 8 billion dollars industry in 2011 - | 3/17/2018
India Research Lab e. Discovery Process § High cost due to review stage - Lawsuit between Clinton administration and tobacco companies (U. S. Vs. Philip Morris) Apply Text Mining Techniques to reduce high costs involved in e. Discovery Process | 3/17/2018
India Research Lab Architecture of e. Discovery Review Systems Named entity annotator Language Annotator Signature Annotator | 3/17/2018
India Research Lab Group Level Review § Review groups of documents that are “related” instead of individual documents Mark whole group as responsive/unresponsive or privileged - Efficient and consistent - - Syntactically Similar Documents § Automated messages, Near and exact duplicates - Semantically Similar Documents § Threads, semantic categories | 3/17/2018
India Research Lab Detecting Syntactic Groups: Automated Messages | 3/17/2018
India Research Lab Detecting Near Duplicates § S 1: I am away from 17/2/2011 to 19/2/2011. Please mail xyz@in. ibm. com in case of any need § S 2: I am away from 26/7/2011 to 31/7/2011. Please mail abc@us. ibm. com in case of any need § Notion of Similarity: Resemblance § Use fingerprinting (Rabin) instead of actual chunks. | 3/17/2018
India Research Lab Efficient Detection of Near Duplicates § For a document of length n words there would be - n-K+1 chunks with a window size of K § It suffices to keep for each document a relatively small fixed size signature § Let Sn be the set of permutations of [n] § And let P be chosen uniformly at random over Sn | 3/17/2018
India Research Lab Signature Annotator § In practice choosing the permutations randomly is hard § Use a set of n one-toone functions fi and keep only the smallest value for each fi § Keep only j lowest significant bits for each value | 3/17/2018
India Research Lab Discovering Automated Messages § Generating groups of near duplicate – Index Based Clustering - For each document d in index I do § If d is not covered - Let S = {S 1, S 2, …, Sn} be the signature of document d - D = Query(I, atleast(S, k)) - For each document d’ in D § d’ is covered § Discovering Groups of Automated Messages - Automated Messages, Group of bulk emails, Group of forward emails § Use MD 5 to detect bulk emails. Emails with one segment are automated messages | 3/17/2018
India Research Lab Detecting Semantic Groups: Email Threads § A tree like structure § A link denotes that the child node was written as a reply to the parent node. § Capture the context in which an email was written | 3/17/2018
India Research Lab Detecting Email Threads § Meta data based methods - Headers are not consistently used § Content of old mail remains in the new mail - A segment contains text of only one communication § An email ei contains ej iff ei approximately contains all the segment of ej | 3/17/2018
India Research Lab Method for Thread Detection § Email Segment Generator (ESG) – Creates segments of it where each segment contains content of only one email. § Segment Signature Generator (SSG): – Generates a signature for a segment • Use near duplicate signatures § For practical implementation, we limit on the number of segment signatures (N) that can be associated with an email, e. g. 20 segments. © 2007 IBM Corporation
India Research Lab Method: Processing at Indexing Time w 1 w 2 Word index wn Meta index ESG Signature index © 2007 IBM Corporation
India Research Lab Method: Processing at Query Time Word index Meta index Signature index w 1 w 2 wn q Use Signature Of First Segment Generating Candidate Thread Set © 2007 IBM Corporation
India Research Lab Detecting Email Threads § Given a Candidate Thread Set Identify the email with only root segment - An email ec is child of an email ep if ec minimally contains ep - | 3/17/2018
India Research Lab Creating Semantic Categories § Focus Categories Documents that are likely to be responsive - Legal Content, Financial Communication, Intellectual Property - High recall - § Filter Categories Documents that are likely to be unresponsive - Bulk emails, Private communication, Jokes - High precision - | 3/17/2018
India Research Lab Creating Semantic Categories § Email Segmentation § Pattern based annotation: Use System T based method § Consolidation Each concept is independent - Apply additional constraints over concepts - | 3/17/2018
India Research Lab Experiments – Near Duplicate Detection § Enron Corpus - 517 K emails from 150 users § Measuring precision - - Manually evaluated near duplicate set for 500 queries With more bits precision is 100% even with 40% similarity threshold § Only 33. 3 % emails are unique | 3/17/2018
India Research Lab Experiments – Email Thread Detection § No ground truth for threads § Subject approximation Method: Based on “Re: ”, “Fw: ” etc in subject § Manually verified the results of thread for our method and subject approximation method - The union of correct emails in thread for both approaches is treated as ground truth. | 3/17/2018
India Research Lab Experiments – Semantic Group § Ground truth: Sampled 2200 emails using generic keywords and then manually labeled | 3/17/2018
India Research Lab Conclusions § We developed a framework that allow group level review of documents § We developed methods for finding syntactic groups such as automated messages for creating groups § We developed methods for finding email threads and semantic groups § We showed significant reduction in the review time by using the group level review and integrated the proposed techniques with IBM Infosphere e. Discovery Analyzer product | 3/17/2018
888f50350b7a6878d1bb8b8c8821e955.ppt