2293e30e36db4542633df564bd1b7c31.ppt
- Количество слайдов: 18
Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof Ciesielski Institute of Computer Science, Poland
Summary Using data on terms’ co-occurrence, extracted from a newsgroup sample, we seek for the terms’ most regular arrangement and show the obtained pattern allows a convenient visualization and clustering.
Clustering of documents and terms: what for? • Improving and grouping search results • Finding synonyms: construction of thesauri, query expansion based on the synonyms of the entered terms • Finding collocations
The common approach to term clustering • Association matrices which quantify term correlations • This global approach does not necessarily adapt well to the local context
The local approach Co-occurrence is identified within a sliding window instead of whole document and arranged into a contingency table (symmetric matrix).
Material A collection of posts from 20 newsgroups, widely used as a benchmark for text-mining methods http: //people. csail. mit. edu/ jrennie/20 Newsgroups/ comp. windows. x rec. antiques. radio+phono rec. sport. hockey sci. med talk. religion. misc Entropy of within-group frequencies (condition) 363 automatically selected keywords representing these groups
Methods • Stemming – to reduce inflected forms to one representative • HAL (Hyperspace Analogue to Language) • Grade correspondence analysis implemented in the Grade. Stat program
HAL generates matrix H in which the cell hij corresponds to the similarity measure of the terms i and j. If s = (t 1, . . . , tk) is a sentence (an ordered list of terms), then hij is the sum (over all sentences in a collection of documents) of co-occurrences of terms i and j. Several forms of normalizations are possible.
Grade Correspondence Analysis GCA transforms a data matrix into a probability table and iteratively permutes rows and columns to make it more strongly and regularly positive dependent by maximizing Kendall’s tau.
Regularity and deviation from it In the most regular arrangement possible, the deviation from regularity for each pair of observations or variables can be measured as: armax - |ar| where ar is the concentration index of the two distributions describing that particular pair of observations/variables, and armax is the respective maximum concentration index.
Overrepresentation maps Contingency matrices are here visualized by means of overrepresentation maps. Overrepresentation is defined as follows:
Results
Computer-related terms ftp, server, unix, MIT, Columbia, mac, graphic, video, display, internet Polarization between groups of terms Political and religious terms murder, belief, kill, faith, Jewish, moral, hell, death, children, shot, war, fire, arm, defense, absolut, burn, Bible
Deviations from regularity • Are themselves more regular than original data • Thus are better descriptors of the position of a term in the dataset
computers commerce general Clusters war city bas eba ll le mp exa il apr se hou duc e pro com pan ftp y Examples of seriation sport war religion
Conclusions • We identified two disjunctive groups composed of very specific terms and a group of terms with various affinities to these extremes → a scale obtained in a process of unsupervised learning • Deviation from regularity in the dataset characterizes terms better than simply co-occurrence data
Plans for future Deviation from regularity used as a criterion in outlier detection might indicate words used inadequately to the context, neologisms etc.
Thank you for attention http: //gradestat. ipipan. waw. pl/english/
2293e30e36db4542633df564bd1b7c31.ppt