Finding Structure in Noisy Text Topic Classification and

Скачать презентацию Finding Structure in Noisy Text Topic Classification and

847c021b691864774cadd34a98486b00.ppt

Количество слайдов: 46

Finding Structure in Noisy Text: Topic Classification and Unsupervised Clustering Rohit Prasad, Prem Natarajan, Krishna Subramanian, Shirin Saleem, and Rich Schwartz {rprasad, pnataraj}@bbn. com Presented by Daniel Lopresti 8 th January 2007 1

Outline · Research objectives and challenges · Overview of supervised classification using HMMs · Supervised topic classification of newsgroup messages · Unsupervised topic discovery and clustering · Rejection of off-topic messages 2

Objectives · Develop a system that performs topic based categorization of newsgroup messages in two modes · Mode 1 – Supervised classification: topics of interest to the user are known apriori to the system – Spot messages that are on topics of interest to a user – Requires rejecting “off-topic” messages to ensure low false alarm rates · Mode 2 – Unsupervised classification: topics of interest to the user are not known – Discover topics in a large corpus without human supervision – Automatically organize/cluster the messages to support efficient navigation 3

Challenges Posed by Newsgroup Messages · Text in newsgroup messages tends to be “noisy” – – Abbreviations, misspellings Colloquial (non-grammatical) language of messages Discursive structure with frequent switching between topics Lack of context in some messages makes it impossible to understand the message without access to complete thread · Supervised classification requires annotation of newsgroup messages with a set of topic labels – Every non-trivial message contains multiple topics – No completely annotated corpus of newsgroup messages exists • By complete annotation we mean tagging each message with ALL relevant topics 4

Outline · Research objectives and challenges Ø Overview of supervised classification using HMMs · Supervised topic classification of newsgroup messages · Unsupervised topic discovery and clustering · Rejection of off-topic messages 5

Supervised Topic Classification Text Audio or Images ASR or OCR “President Clinton dumped his embattled Mexican bailout today. Instead, he announced another plan that doesn’t need congressional approval. ” CNN, NBC, CBS, NPR, etc. Text Topic Models Topic Classifier 5, 000 topics Several Topics • Clinton, Bill • Mexico • Money e. g. , Primary Source Media 4 -5 topics / story 40, 000 stories / year • Economic assistance, American Applications 6 Topic-Labeled Broadcast News Corpus Training • • News Sorting Information Retrieval Detection of Key Events Improved Speech Recognition

On. Topic. TM HMM Topic Model · A probabilistic hidden Markov model (HMM) that attempts to capture the generation of a story · Assumes a story can be on multiple topics, different words are related to different topics · Uses an explicit state for General Language because most words in a story are not related to any topics · Scalable to a large number of topics and requires only topic labels for each story for training the model · Language independent methodology 7

Outline · Research objectives and challenges · Overview of supervised classification using HMMs Ø Supervised topic classification of newsgroup messages · Unsupervised topic discovery and clustering · Rejection of off-topic messages 8

Experiment Setup · Performed experiments with two newsgroup corpora – Automated Front End (AFE) newsgroup corpus collected by Washington Univ. – 20 Newsgroup (NG) corpus from http: //people. csail. mit. edu/jrennie/20 Newsgroups/ · Assumed the name of the newsgroup is the ONLY associated topic for each of the message · Although cost effective this assumption leads to inaccuracies in estimating system performance – Messages typically contain multiple topics, some of which may be related to the dominant theme of another newsgroup 9

AFE Newsgroups Corpus · Google newsgroups data collected by Washington University from 12 diverse newsgroups · Messages posted to 11 newsgroups are considered to be in-topic and all messages posted to the “talk. origins” newsgroup are considered to be off-topic · Message headers were stripped to exclude newsgroup name from training and test messages · Split the corpus into training, test, and, validation sets according to the distribution specified in the config. xml file provided by the Washington University – But since the filenames were truncated we could not select the same messages as Washington University 10

AFE Newsgroups Corpus Newsgroup # of Messages #Training #Test #Validation Alt. sports. baseball. stl_cardinals 21 33 10 Comp. ai. neural_nets 15 25 7 Comp. programming. threads 31 47 15 Humanities. musics. composers. wagner 19 31 9 Misc. consumers. frugal_living 10 17 5 Misc. writing. moderated 24 37 12 Rec. Equestrian 27 41 13 Rec. martial_arts. moderated 18 29 9 Sci. archaelogy. moderated 46 69 23 Sci. logic 20 30 10 Soc. libraries. talk 10 17 5 Talk. origins (Chaff) 245 10401 122 Total Number of Messages (w/o chaff) 241 376 118 Total Number of Messages (w/ chaff) 486 10777 240 Total Number of Words (w/o chaff) 103 K 118 K 32 K Total Number of Words (w/ chaff) 187 K 3. 4 M 63 K 11

Closed-set Classification Accuracy on AFE · Trained On. Topic models on 11 newsgroups – Excluded messages from talk. origins newsgroup because they are “off-topic” w. r. t topics of interest – Used stemming since some newsgroups had only a few training messages · Classified 376 in-topic messages · Achieved overall top-choice accuracy of 91. 2% – Top-choice accuracy: Percentage of times the top-choice (best) topic returned by On. Topic was the correct answer · Top-choice accuracy was worse on newsgroups with fewer training examples 12

Closed-set Classification Accuracy (Contd. ) Newsgroup #Training Messages %Top-Choice Accuracy Misc. consumers. frugal_living 10 47. 1% Soc. libraries. talk 10 58. 8% Comp. ai. neural_nets 15 80. 0% Rec. martial_arts. moderated 18 86. 2% Humanities. musics. composers. wagner 19 100. 0% Sci. logic 20 96. 7% Alt. sports. baseball. stl_cardinals 21 100. 0% Misc. writing. moderated 24 91. 9% Rec. Equestrian 27 97. 6% Comp. programming. threads 31 100. 0% Sci. archaelogy. moderated 46 95. 7% Overall 241 91. 2% 13

“ 20 Newsgroups” Corpus · Downloaded 20 Newsgroups Corpus (“ 20 NG”) from http: //people. csail. mit. edu/jrennie/20 Newsgroups/ · Corpus characteristics: – Messages from 20 newsgroups with an average of 941 messages per newsgroup – Average of 350 threads in each newsgroup – Average message length of 300 words (170 words after headers and “replied to” text is excluded) – Some newsgroups are similar – the 20 newsgroups span 6 broad subjects · Data pre-processing – Stripped message headers, e-mail IDs, and signatures to exclude newsgroup related information · Corpus was split into training, development, and validation sets for topic classification experiments 14

Distribution of Messages Across Newsgroup Unique Threads Messages Per Thread alt. atheism 799 87 9. 2 comp. graphics 973 532 1. 8 comp. os. ms-windows. misc 985 479 2. 1 comp. sys. ibm. pc. hardware 982 536 1. 8 comp. sys. mac. hardware 961 467 2. 1 comp. windows. x 980 773 1. 3 misc. forsale 972 877 1. 1 rec. autos 990 260 3. 8 rec. motorcycles 994 177 5. 6 rec. sport. baseball 994 272 3. 7 rec. sport. hockey 999 346 2. 9 sci. crypt 991 216 4. 6 sci. electronics 981 395 2. 5 sci. med 990 314 3. 2 sci. space 987 296 3. 3 soc. religion. christian 997 295 3. 4 talk. politics. guns 910 145 6. 3 talk. politics. mideast 940 307 3. 1 talk. politics. misc 775 133 5. 8 talk. religion. misc 15 Total Messages 628 103 6. 1 Average 941 350 3. 7

Organization of Newsgroups By Subject Matter comp. graphics comp. os. ms-windows. misc comp. sys. ibm. pc. hardware rec. autos rec. motorcycles rec. sport. baseball rec. sport. hockey sci. crypt sci. electronics sci. med sci. space talk. politics. misc talk. politics. guns talk. politics. mideast talk. religion. misc alt. atheism soc. religion. christian comp. sys. mac. hardware comp. windows. x misc. forsale 16

Splits for Training and Testing · 80: 20 split between training and test/validation sets for three different partitioning schemes · Thread Partitioning: Entire thread is assigned to one of training, development, or validation sets · Chronological Partitioning: Messages in each thread are split between training, test, and validation; first 80% in training, and rest in test and validation · Random Partitioning: 80: 20 split between training and test/validation, without regard to thread or chronology – Prior work by other researchers with 20 NG used random partitioning 17

Closed-set Classification Results · Trained On. Topic model set consisting of 20 topics · Classified 2 K test messages – Two test conditions, one where “replied-to” text (from previous messages) is included and the other where it is stripped from the test message %Top Choice Accuracy Test Message Type Thread Chronological Random w/o “replied-to” text 74. 5 77. 8 79. 7 w/ “replied-to” text 76. 0 79. 6 83. 2 · Classification accuracy is low due to following 18 – Significant subject overlap between newsgroups – Lack of useful a priori probabilities due to almost uniform distribution of topics, unlike AFE newsgroup data

Detailed Results for Thread Partitioned Newsgroup Top Confusion talk. religion. misc 29. 3 talk. politics. guns misc. forsale 51. 0 comp. os. ms-windows. misc talk. politics. misc 57. 5 talk. politics. guns sci. electronics 58. 3 rec. autos comp. os. ms-windows. misc 62. 0 comp. sys. mac. hardware alt. atheism 63. 4 soc. religion. christian comp. graphics 68. 6 comp. windows. x comp. sys. ibm. pc. hardware 72. 6 comp. os. ms-windows. misc comp. sys. mac. hardware 74. 5 comp. sys. ibm. pc. hardware comp. windows. x 77. 1 comp. sys. ibm. pc. hardware rec. motorcycles 81. 9 rec. autos talk. politics. guns 82. 9 sci. crypt talk. politics. mideast 84. 6 rec. motorcycles soc. religion. christian 87. 4 sci. med sci. crypt 89. 0 talk. politics. guns rec. sport. baseball 90. 7 rec. sport. hockey rec. autos 93. 4 misc. forsale sci. med 93. 6 misc. forsale rec. sport. hockey 94. 2 rec. sport. baseball sci. space 19 %Top-choice Accuracy 94. 6 rec. autos Overall 76. 0

Detailed Results for Chronological Newsgroup Top Confusion talk. religion. misc 35. 0 alt. atheism misc. forsale 53. 8 comp. sys. ibm. pc. hardware comp. os. ms-windows. misc 62. 9 comp. windows. x comp. graphics 63. 9 sci. electronics 64. 3 rec. autos talk. politics. misc 71. 4 talk. politics. guns alt. atheism 72. 2 soc. religion. christian comp. sys. ibm. pc. hardware 73. 2 comp. os. ms-windows. misc comp. sys. mac. hardware 75. 8 comp. windows. x 81. 4 comp. os. ms-windows. misc rec. motorcycles 86. 7 rec. autos sci. med 86. 9 sci. space rec. autos 88. 7 comp. os. ms-windows. misc talk. politics. guns 90. 1 talk. politics. misc talk. politics. mideast 90. 2 alt. atheism sci. space 91. 9 comp. graphics rec. sport. baseball 92. 9 rec. sport. hockey soc. religion. christian 94. 0 alt. atheism sci. crypt 96. 0 sci. electronics rec. sport. hockey 20 %Top-choice Accuracy 98. 0 sci. med Overall 79. 6 comp. os. ms-windows. misc comp. sys. ibm. pc. hardware

Manual Clustering and Human Review · Manually clustered newsgroups into 12 topics after reviewing content of training messages · Recomputed top-choice classification accuracy using the cluster information %Top Choice Accuracy Thread Chronological Random w/o Clustering 76. 0 79. 6 83. 2 w/ Clustering 81. 5 84. 8 88. 2 Clustering · Effect of presence of multiple topics in a message and incomplete reference topic label set 21 – Manually reviewed messages from 4 categories with lowest performance for “Chronological” split – Accuracy increases to 88. 0% (from 84. 8%) following manual rescoring

Cluster Table Topic Cluster Newsgroup(s) Autos rec. autos, rec. motorcycles Graphics comp. graphics Macintosh comp. sys. mac. hardware Misc. forsale misc. forsale Politics talk. politics. guns, talk. politics. mideast, talk. politics. misc Windows comp. os. ms-windows. misc, comp. sys. ibm. pc. hardware, comp. windows. x religion soc. religion. christian, talk. religion. misc, alt. atheism sports rec. sport. baseball, rec. sport. hockey sci. crypt sci. electronics sci. med sci. space 22

Outline · Research objectives and challenges · Overview of supervised classification using HMMs · Supervised topic classification of newsgroup messages Ø Unsupervised topic discovery and clustering · Rejection of off-topic messages 23

The Problem · Why unsupervised topic discovery and clustering? – Topics of interest may not be known apriori – May not be feasible to annotate documents with a large number of topics · Goals – Discover topics and meaningful topic names – Cluster topics instead of messages automatically to organize messages/documents for navigation at multiple levels 24

Unsupervised Topic Discovery 3 Input documents Add Phrases Frequent phrases, using MDL criterion; Names, using Identi. Finder. TM Augmented docs Initial Topics for each doc Topic Names Topic Training Select words/phrases with highest tf-idf; Keep topic names that occur in >3 documents Key step: Associate many words/phrases with topics; Use EM training in On. Topic. TM system Topic Models Topic Classification Topic Annotated Corpus 25 Assign topics to all documents 3. S. Sista et al. . An Algorithm for Unsupervised Topic Discovery from Broadcast News Stories. In Proceedings of ACM HLT, San Diego, CA, 2002.

UTD output (English document) news source: Associated Press – November, 2001 26

UTD output (Arabic document) News Source: Al_Hayat (Aug-Nov, 2001) 27

Unsupervised Topic Clustering · Organize automatically discovered topics (rather than documents) into a hierarchical topic tree · Leaves of the topic tree are one of the fine topics discovered from the UTD process · Intermediate nodes are logical collection of topics · Each node in the topic tree has a set of messages associated with it – A message can be assigned to multiple topic clusters by virtue of multiple topic labels assigned to it by UTD process – Overcomes the problem of single cluster assignment of a document prevalent in most document clustering approaches · Resulting topic tree enables browsing of the large corpus at multiple level of granularity – One can find a message with different set of logical actions 28

Topic Clustering Algorithm · Agglomerative clustering for organizing topics in a hierarchical tree structure · Topic clustering algorithm: Step 1: Each topic assigned to its own individual cluster Step 2: For every pair of clusters, compute the distance between the two clusters Step 3: Merge the closest pair into a single cluster if the distance is lower than a threshold and go to Step 2. Else Stop clustering · Modification: merge more than two clusters at each iteration to limit the number of levels in the tree – Also add other constraints in terms of limiting the branching factor, number of levels etc. 29

Distance Metrics for Topic Clustering · Metrics computed from topic co-occurrences: – Co-occurrence probability – Mutual Information · Metrics computed from support/key word distributions: – Support word overlap between Ti and Tj – Kullback-Leibler (KL) and J-Divergence between two probability mass functions 30

Clustering Example 31

Evaluation of UTC · Initial topic clustering experiments performed on 20 NG corpus – 3, 343 topics discovered from 19 K message – Allowed a maximum of 4 topics to be clustered at each iteration · Evaluation of UTC has been mostly subjective with a few objective metrics used to evaluate the clustering · Clustering rate: rate of increase of clusters with more than one topic seems to be well correlated with subjective judgments · Combination of J-divergence and topic co-occurrence seems to result in most uniform, logical clusters 32

Key Statistics of the UTC Topic Tree for 20 NG Corpus · Measured some key features of the topic tree that could have significant impact on user experience Key Feature Value Average Maximum Number of Levels - 6 Branching Factor 2. 4 4 No. of topics in a cluster 2. 7 22 33

Screenshot of the UTC based Message Browser 34

$Off-topic Message Rejection · Significant fraction of messages processed by the topic classification system$ Off-topic Message Rejection · Significant fraction of messages processed by the topic classification system are likely to be off-topic · Rejection Problem: Design a binary classifier for accepting or rejecting the top-choice topic – Accepting a message means asserting that the message contains the top-choice topic – Rejecting a message means asserting that the message does not contain the top-choice topic 36

Rejection Algorithm · Use the General Language (GL) topic model as model for off-topic messages · Compute the ratio of the log-posterior of top-choice topic Tj and GL topic as a relevance score · Accept the top-choice topic Tj if: · Threshold can be topic-independent or topicspecific 37

Parametric Topic-Specific Threshold Estimation · Compute empirical distribution (m and s) of log likelihood ratio score for a large corpus of off-topic documents – Can assume most messages in corpus are off-topic – More reliable statistics than if computed for on-topic message · Normalize the score for a test message before comparing to a topic-independent threshold · Can be thought of as a transformation of the topicindependent threshold rather than score normalization 38

Parametric Topic-Specific Threshold Estimation Do a Null-hypothesis test using the score distribution of the offtopic messages Off-topic score distribution On-topic 39 A message that is not-off-topic is on-topic. A message several standard-deviations away from off-topic mean is very likely to be on-topic. Example histogram of normalized test scores (y-axis scaled to magnify view for on-topic messages)

Non-Parametric Threshold Estimation · Accept the top-choice topic Tj if: · Select a(Tj) by constrained optimization: 40

Experimentation Configuration · In-topic messages from 14 newsgroups of the 20 NG corpus – Messages from six newsgroups were discarded due to significant subject overlap with off-topic messages · Off-topic/chaff messages are from two sources: – talk. origins newsgroup from the AFE corpus – large collection of messages from 4 Yahoo! groups · Used jack-knifing to estimate rejection thresholds on Train+Dev set and then applied them to validation set Distribution of Messages Message Type Train Dev. Validation In-topic 5. 6 K 2. 8 K Off-topic/Chaff 9. 6 K 76 K 41

Comparison of Threshold Estimation Techniques 42

Comparison of Threshold Estimation Techniques 43

Comparison of Threshold Estimation Techniques Rejection Method %False Rejections @ 1% False Acceptances Topic-independent thresholds Topic-specific thresholds (parametric) 27. 4 Topic-specific thresholds (non-parametric) 44 31. 4 23. 7

Conclusions · HMM based topic classification delivers comparable performance on 20 NG and AFE corpora as in [1], [2] · Closed-set classification accuracy on 20 NG data after clustering is slightly worse than AFE data – Key reason is significant subject overlap between the newsgroups · Clustered categories still exhibited significant subject overlap across clusters – The data set creators assign only six different subjects (topics) to the 20 NG set 1. J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In Proceeding of ICML 2003, Washington, D. C. , 2003. 2. S. Eick, J. Lockwood, R. Loui, J. Moscola, C. Kastner, A. Levine, and D. Weishar. Transformation Algorithms for Data Streams. In Proceedings of IEEE AAC, March 2005. 45

Conclusions (Contd. ) · Novel estimation of topic-specific thresholds outperforms topic-independent threshold for rejection of off-topic messages · Introduced a novel concept of unsupervised topic clustering for organizing messages – Built a demonstration prototype for topic tree based browsing of large corpus of archived messages · Future work will focus on measuring the utility of UTC on user experience and objective metrics to evaluate UTC performance 46