bdc9f125624fbc7aef8c3c59874bf960.ppt
- Количество слайдов: 1
Improving Newsgroup Clustering by Filtering Author-Specific Words Yuval Marom and Ingrid Zukerman School of Computer Science and Software Engineering Monash University descriptions of these topics to interested users. Target application: a help-desk system, but we have used newsgroups as a test-bed. Newsgroups are useful because • they provide a good approximation to help-desk systems, • they are readily available on the Internet in large quantities and diversity, and • they obviate the need for manual tagging of topics, and thus enable automatic evaluation. Problem: when people contribute frequently to a newsgroup, their idiosyncratic words dominate the clustering process, as shown in the example on the right. cluster 2 To make the type visible make it white put a Stroke on it from the Layer Styles dialog. Or some variation of that. -Comic book sketches and artwork: http: //www. sover. net/~hannigan/edjh. html The fastest software company is Borland. When I called them to buy JBuilder 5, I was told the current version was 6. But what I got from mail is 7. A month later, I learned 9 was scheduled to release. 0. 988* 0. 070 0. 047. . . tony vizros demo smith realistic software borland. . . 2. Consider each word in each posting: a) calculate a word-usage proportion (word posting frequency divided by the person's total number of postings) b) if the proportion is significantly higher than a threshold, filter the word from that posting. Objective: to examine the effect of the filtering mechanism on clustering performance, with respect to the topical similarity between the newsgroups, and • the number of clusters. • 1. Merge threads from separate newsgroups into a single dataset. 2. Test the ability of the clustering mechanism to separate these threads back into the correct newsgroups. Issues: Clustering is an unsupervised learning mechanism, therefore in order to evaluate clustering performance, we need to determine which clusters match which newsgroups. • The number of clusters is not always equal to the number of newsgroups. • Newsgroups: Results: • Performance is much poorer without filtering, suggesting that author-specific words create undesirable overlaps between the newsgroups. • talk. politics. mideast talk. politics. guns talk. religion. misc Calculate an overall F-score as a measure of how well the pooled clusters match the newsgroups. newsgroup n thread 1 -n thread 2 -n k clusters thread 1 -n thread 2 -1 thread 2 -2 best F-score match 1 U Pij = F-score: U Rij= ij Pij + R ij 1 n 5 Conclusion • • We have experimented with newsgroups of varying degrees of topical similarity. The least related newsgroups provide a benchmark for clustering performance, while the more related ones exemplify our target help-desk application. • 0. 894 * 0. 766 * 0. 745 * 0. 702 * 0. 638 * 0. 064 0. 021. . . filter off filter on number of clusters (k) However, filtering consistently improves performance, which means that there also undesirable overlaps created by authorspecific words. Dataset 3 Newsgroups: talk. politics. mideast rec. sport. hockey sci. space Results: • The results show that our filtering mechanism generally improves clustering performance, where the magnitude of its effect depends on the topical similarity between the newsgroups, and the number of clusters. newsgroup j 2 P R ij These newsgroups discuss fairly similar topics, so there is a large topical overlap between the threads. Therefore, separating these newsgroups is difficult, yielding a poorer performance. overall F-score 2 Recall: newsgroup j pooling 2 cluster i • n pooled clusters n newsgroup j Results: • clustering cluster i These overlaps are resolved as the value of k increases, because more clusters enable the detection of finer differences between the threads. Newsgroups: Pool clusters that match the same newsgroup. dataset thread 1 -1 thread 1 -2 42 36 35 33 30 3 1. . . Dataset 2 Choose the match with the best F-score. newsgroup 2 thread 1 -2 thread 2 -2 proportion Dataset 1 Solution: Calculate the F-score (details in the box below) for each cluster-newsgroup match. frequency * words with a significantly high proportion are filtered lp. hp comp. text. tex comp. graphics. apps. photoshop Approach: Precision: Overall word-usage by “vizrosplugins@yahoo. com” in 47 postings: word 4 Evaluation F-score calculation: Tony G. Smith Vizros - Realistic 3 D page curl plug-ins and more Demo at http: //www. vizros. com/gallery. html 1. Build a “profile” for each person posting to the newsgroup. This profile is a distribution of word posting frequencies -- the number of postings where a word is used. frequency proportion newsgroup 1 thread 1 -1 thread 2 -1 Examples of postings made by these authors are shown below, as are the authors' highest overall word-usages in the newsgroup. The words that characterize cluster 2 are highlighted in the example postings. We have built a filtering mechanism that removes undesirable influences of dominant authors in a newsgroup. The mechanism works as follows: * words with a significantly high proportion are filtered 1. 2. 3. 4. The words that are most characteristic of cluster 2 appear in the signatures of two dominant contributors. That is, the clustering algorithm has created this cluster based on the authors, rather than the topics of discussion. The characteristic words are irrelevant to the topic of discussion. Example posting from “vizrosplugins@yahoo. com”: 3 Filtering Mechanism Overall word-usage by “edjhann@hotmail. com” in 85 postings: sketches http: //www. sover. net. . . comic artwork layer styles. . . photography light convert similar nelson colour view. . . sketches http: //www. sover. net. . . comic smith tony demo realistic. . . Example posting from “edjhann@hotmail. com”: 84 84 6 4. . . cluster 3 In the example shown here, we are clustering email threads from the newsgroup comp. graphics. app. photoshop, using K-Means. The example shows three clusters and their most characteristic words. performance (F-score) saved gif transparent advertisements tom 187@earthlink. net crop unsolicited. . . Author-specific words used by dominant authors can have a detrimental discriminative influence on the clustering of newsgroup threads, and they can also lead to the extraction of uninformative and confusing topic descriptions. Thus, a filtering mechanism such as ours has both quantitative and qualitative benefits. performance (F-score) Aim: to use clustering to identify topics in electronic discussions and provide word 2 Dominant Authors cluster 1 • These newsgroups discuss very different topics, so the threads are different enough for the clustering to perform similarly well with and without filtering. Nonetheless, filtering has an effect for lower values of k, suggesting that some overlap is created by author-specific words. filter off filter on number of clusters (k) performance (F-score) 1 Project Aim filter off filter on number of clusters (k) Acknowledgments This research was supported in part by Linkage Grant LP 0347470 from the Australian Research Council and by an endowment from Hewlett Packard.
bdc9f125624fbc7aef8c3c59874bf960.ppt