SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS Ravi N Veer

Скачать презентацию SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS Ravi N Veer

31a50e083056400c05d633a2a08d8aa1.ppt

Количество слайдов: 57

SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS Ravi. N. Veer Prakash S. Vivek Shenoy T

Contents • Introduction • Literature Review • Document Representation • Text Classifiers • Implementation Aspects • Results and Analysis • Conclusion • Future Enhancements • References

INTRODUCTION

• Current scenario of the Documents in the WEB. • Structured data • Unstructured data • Information Retrieval (IR) • Deals with the representation, storage, organization and access to information items. • This representation of the information is used to manipulate the unstructured data. • Goal of IR • To provide users with those documents which satisfy their information needs.

Objective of the project • To classify the documents in the corpus. The documents thus classified are then classified into various classes. A particular document is assigned to a class if there is a relevance in the query and the document. • To provide a comparative study between two classifiers namely - Centroid based classifier - K- nearest Neighbour classifier

• Definition of Information Retrieval (IR) IR is finding material of an unstructured nature that satisfies an information need from within large collections [28]. • Different fields of Information Retrieval (IR) There are 2 categories : - General Applications of IR - Domain Specific Applications • IR Process The IR process is a 6 step process as shown in the next slide,

Problem recognition and acceptance Query Formulation Query Execution Examination of the Result Information Retrieval Fig. Schematic representation of Information Retrieval

• Machine learning

• Types of Text Classification: • Supervised Learning : The training data is labeled with the correct answers, e. g. , “spam”. • Unsupervised Document Classification/ Document clustering external information. • Definition of Text Classification Let C = { c 1, c 2, . . . cm} be a set of categories and D = { d 1, d 2, . . . dn} a set of documents. The task of the text classification consists in assigning to each pair ( ci, dj ) of C x D (with 1 ≤ i ≤ m and 1 ≤ j ≤ n) a value of 0 or 1, i. e. the value 0, if the document dj doesn't belong to ci. This mapping is done with the help of a decision matrix [17].

LITERATURE REVIEW

• Phases of IR Development : There are several phases in the development of IR : • 1 st Phase, 1950 s - 1960 s, of IR is the research phase • 2 nd Phase, 1970 s, IR struggled for adoption • 3 rd Phase, 1980 s- 1990 s, reached acceptance phase in terms of free-text search systems. • Now-a-days the influence of IR is such that it is moving towards projects in sound and image retrieval, along with electronic provision [26]. • Defination of TC by H. P. Luhn gave a definition for TC in 1958, this made the start of the text classification era [32], the definition is as follows : “…utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the ‘action points’ in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points. ”

DOCUMENT REPRESENTATION

• Need for DR • The task of information retrieval is to extract relevant documents from a large collection of documents in response to user queries. The documents contain primarily unrestricted text. Document representation basically involves generating a representation R of a document such that for any text items D 1 and D 2, R(D 1) ≈ R(D 2) where R is a function knows as relevance of the document which is obtained by matching the key words in the query with document set. In order to reduce the complexity of the documents and make them clear and easier to handle we transform the document from its full text version to a document vector which describes the contents of the document. The terms that occur in a document are the parameters of the document representation. The types of parameters determine the type of the document representation.

• Different Types: • Binary Document Representation • Term Frequency Representation (Frequency vector) • Probabilistic representation Example Documents : Documents Document content No. of Unique words D 0 Gold silver truck 3 D 1 Shipment of gold damaged in a fire 4 D 2 Delivery of silver arrived in a silver truck 4 D 3 Shipment of gold arrived in a truck 4

• Binary Document Representation: The assumptions, Here, the term “binary'' is equivalent to Boolean, documents and queries are both represented as binary term incidence vectors. That is, a document “d” is represented by the vector x =(x 1, …x. M) where ‘xt=1’ if term ‘t’ is present in document ‘d’ and ‘xt=0’ if ‘t’ is not present in ‘d’ [22]. • Representation of the Example Documents : Doc id Arrived Damaged Delivery Fire Gold Shipment Silver Truck D 0 0 0 1 1 D 1 0 1 1 1 0 0 D 2 1 0 0 0 1 1 D 3 1 0 0 0 1 1 0 1 • Drawback :

• Term Frequency Representation (Frequency vector) weight, that depends on the number of occurrences of the term in the document. • Representation of the Example Documents : Docid Arriv e d Damage d Deliver y Fire Gold Shipme nt Silve r Truc k D 0 0 0 1 1 D 1 0 1 1 1 0 0 D 2 1 0 0 0 2 1 D 3 1 0 0 0 1 1 0 1 • Drawback : respecta document interms weigh does with the not This approach to other documents in the dataset.

• Probabilistic representation probabilityscheme In denotesevery the of vector this component of occurrence of the corresponding term with in the document. The probability of a particular term is found by the following Probability = Number of occurrences of the term ‘t’ in the document ‘d’ Total number of terms in the document ‘d' • Representation of the Example Documents : Doc id Arrive d Damaged Delivery Fire Gold Shipment Silver Truck D 0 0 0 1/3 1/3 D 1 0 ¼ 0 1/4 ¼ 0 0 D 2 1/4 0 0 0 2/4 1/4 D 3 1/4 0 0 0 1/4 ¼ 0 1/4

• tf-idf (term frequency – inverse document frequency) representation The main idea behind tf-idf is that the term occurring infrequently should be given a higher weight than a term that occurs frequently. • Important definitions in tf-idf context : t = number of distinct terms in the document collection. tfij = number of occurrences of term tj in document Di. This is also referred to as term frequency. dfj = number of documents which contain tj. idfj = log( d/dfj) where d is the total number of documents. This is the inverse document frequency

each • Weighting Factor of each term : for factor weighting. The by taking the product of term-frequency and inverse-document frequency related to the term by using the following, dij = tfij * idfj • tf-idft, d assigns to term t a weight in document d that is 1. highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents); 2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal); 3. lowest when the term occurs in virtually all documents. The values thus computed are then filled into the document vectors,

• Representation of the Example Documents : Docid Arrived Damaged Delivery Fire Gold Shipment Silver Truck D 0 0 0. 12 0 0. 3 0. 12 D 1 0 0. 3 0 0. 6 0. 12 0. 3 0 0 D 2 0. 3 0 0. 6 0. 12 D 3 0 0. 12

Text Classifiers

• Refined Definition of Text Classification (TC) : TC is defined as the task of approximating the unknown target function Φ: D×C →{T, F} Where Φ is called as the classifier [29], where, C={c 1, . . . , c|C|}……………a predeﬁned set of categories D ……………. . a (possibly inﬁnite) set of documents. If Φ(dj , ci)= T, then dj is called a positive example (or a member) of ci, If Φ(dj , ci)= F then dj is called a negative example (or not a member) of ci. documentsneed to In theset classifier Ω sucha we order that a build value of Φ(dj , ci) is known for every (dj , ci) € Ω ×C, usually Ω is partitioned into three disjoint sets Tr (the training set), Va (the validation set), and Te (the test set) [31].

• Training set: The training set is the set of documents observing which the learner builds the classifier. • Validation set: The validation set is the set of documents on which the engineer fine tunes the classiﬁer, e. g. choosing for a parameter p on which the classifier depends, the value that has yielded the best effectiveness when evaluated on Va. • Test set : The test set is the set on which the effectiveness of the classifier is ﬁnally evaluated. “evaluating the effectiveness” means running the classifier on a set of pre-classiﬁed documents (Va or Te) and checking the degree of correspondence between the output of the classifier and the pre-assigned classes.

• Types of classifiers The following are some of the classifiers [37], • Naïve-Bayesian classifier • k. NN classifier. • Linear Classifiers • C 4. 5 • Support Vector Machines etc. In this project we mainly concentrate on only 2 classifiers. • Centroid classifier • k. NN classifier.

• CENTROID CLASSIFIER • This type of a classifier computes a centroid vector for every pre-defined class using all the training documents belonging to the class. • Next, the test document (which must be classified) is compared with all these centroid vectors to compute the similarity coefficients. • Finally a class is chosen whose centroid nearly matches with that of the test document (i. e. selecting that class whose similarity coefficient score is the highest)

• Pseudo code of Centroid Classifier Step 1) The input documents (under pre-defined categories) are split into training set and testing set respectively. Step 2) Scan through the entire training set to identify all the unique words across the entire collection. The total count of the unique words decides the length of the document vector. Step 3) For each of the unique terms (as identified in step 2) , compute the document frequency (i. e. total number of documents in which a particular unique terms occurs). Step 4) Represent every input training document as a vector. (here we shall assume that we are using tf-idf weights to represent the input documents. Any of the representation schemes explained earlier can also be used).

Thus a document vector is represented as dtf = ( tf 1 log (N/df 1) , tf 2 log (N/df 2), tf 3 log (N/df 3) , …tfm log (N/dfm) ) Step 5) For every pre-defined class compute a centroid vector. this is done using the following formula where S is the training set of the category/class for which the centroid vector is being computed. centroid vectors. The “m” centroid vectors are denoted as

Step 6: For every test document “d” 1)Use the document frequencies of the various terms computed from the training set , to compute the tf – idf representation of d i. e 2) Compute the similarity coefficient between and all the k- centroid vectors using the normalised cosine measure. The cosine measure is computed as follows where , is any centroid vector of a class. 3) Based on similarity coefficient score , assign document x to the class, with whom , the score is the highest. It can be mathematically represented as Thus using the above discussed formula’s, the classification of the document can be done.

• K Nearest Neighbor Classifier • It is one of the instance learning algorithm which has been applied to text categorization. • This classifier first computes k nearest neighbor’s of a test document. Then the similarities of the test document to the k-nearest neighbors are aggregated according to the class of the neighbors, and the test document is assigned to the most similar class (as measured by aggregate similarity) [37]. • Drawbacks : • one test document must be compared with all the test documents, so as to decide the class of the test document. Thus it requires huge amount of computation. • It uses all the features equally in computing similarities. This may lead to poor similarity measures and may lead to classification errors.

• Pseudo code of KNN Classifier Step 1) The input documents (under pre-defined categories) are split into training set and testing set respectively. Step 2) Scan through the entire training set to identify all the unique words across the entire collection. The total count of the unique words decides the length of the document vector. Step 3) Fix a value for k. This value determines the number of nearest neighbors which will be considered during document classification. Step 4) For every test document , compute the similarity coefficient with each of the training documents and record the similarity score in a hash table. Step 5) Select the top “k” scores from the hash.

Step 6) Compute the aggregate score for each class. If several of the k-nearest neighbor’s share a class, then the per- neighbor weights of that class are added together and resulting weighted sum is used as likelihood score of that class. Sort the scores of candidate classes and generate a ranked list. The decision rule can be mathematically represented as Where, • “d” is the test document which is being classified and KNN (d) indicates the set of k-nearest neighbors of document d. • (dj, ci ) represents the classification for document dj with respect to class ci. Step 7) Test document “d” should be assigned to the class that has the highest weighted aggregate score.

Implementation Aspects

• PERL • PDL “Perl Data Language” • PDL is an object oriented extension to perl that is designed for scientific and bulk numeric data processing and display. It is a very powerful and at the same time fast array-oriented language. • The PDL concept gives standard Perl, the ability to compactly store and speedily manipulate the large N-dimensional data setswhich are very essential for scientific computing. • PDL uses Perl `objects' to hold piddle data. An `object' is like a user-defined data-type and is a very powerful feature of Perl, PDL creates it's own class of `PDL' objects to store piddles.

• PDL’s over perl variables • It is impossible to manipulate Perl `arrays' arithmetically as we like. i. e. @y = @x * 2 • Perl lists are intrinsically one-dimensional and we can have `lists of lists' but this is not the same thing as a pdl. • Perl lists do not support the range of datatypes that piddles do (byte arrays, integer arrays, single precision, double precision, etc. ) • Perl lists consume a lot of memory. At least 20 bytes per number, of which only a few bytes are used for storing the actual value. This is because Perl lists are flexible, and can contain text strings as well as numbers. • Perl lists are scattered about memory. The list data structure means consecutive numbers are not stored in a neat block of consecutive memory addresses as in case C and other programming language.

• Advantages of using Perl Data Language • Both Perl and PDL are easily available, free of cost under the open source license. • Since PDL is an extension of perl, a perl programmer has all the powerful features of perl at his hands. Thus even in mainly numerically oriented programming, it is often extremely handy if we have access to non-numeric functionality. • Since it is a package of perl , it makes PDL extensible and interoperable. • Syntax associated with PDL is very simple thus making it a user friendly package

• Usage of pdl in our project: pdl Binary representatin (pdl elements indicates presence or absence of a word ) Term frequency representation (pdl elements indicates the frequency i. e. number of times a word occurring in a file) Probabilistic Representation (pdl elements indicates the probability of occurance of a word) Tf-idf Representation (pdl elements indicate the product of term frequency and inverse document frequency)

• Organization of our code Classifiers KNN Centroid Binary representation Textfiles (contains all the training and testing documents) Term frequency representation Freq (contains the files representing training and testing documents which indicates the frequency of a word in a file) Probabilistic representation String (contains all the scripts and the result of classification) Tf-idf representation Actuals (contains predefined files which indicate the class to which each of the file belongs )

• Scripts of our project There are 6 different scripts. 1) init. pl 2) main. pl 3) script 1. pl 4) script 2. pl 5) script 3. pl 6) script 4. pl 1) init. pl is This script is make all the necessary folders available for the smooth functioning of the code. It deletes the selected folders (for example, freq, source code/results etc which holds all the necessary data ) and recreates them again

2) main. pl This sequentilally. 3) script 1. pl other unwanted characters from the source file. Note: - We are not actually modifying the actual source file. 4) script 2. pl The frequency for each of the unique terms depicted in uniquefile. txt

5) script 3. pl This script mainly performs the task of document classification. 6) script 4. pl The main intention of this script, is to generate a input to an html browser, so as to display the results to the user.

Results & Analysis

The following are the elements of our Project : 1) Pre-defined classes -7 2) Training documents - 651 3) Testing Documents - 47 1) Pre-defined classes : The following are the 7 pre-defined classes, S. no. Class Name No. of Documents 1. Cricket 101 2. Formula-1 90 3. Hockey 109 4. Ice-Hockey 109 5. Movies 122 6. Politics 20 7. Religion 100

• Results for Centroid Classifier Representation Binary Term frequency Probabilistic Tf-idf Properly classified 36 38 43 33 Misclassfied 11 9 4 14 0. 76 0. 80 0. 91 0. 70 Level of classification Accuracy

• Level of Accuracy Achieved

• Results for KNN Classifier The following shows result applying table the of KNN classifier on the document vectors when the value of k=2 is given by the user. Representation Binary Term frequency Probabilistic Tf-idf Properly classified 39 41 42 37 Misclassfied 08 06 05 10 0. 83 0. 87 0. 89 0. 78 Level of classification Accuracy

• Level of Accuracy Achieved

The following table shows the result of applying KNN classifier on the document vectors when the value of k=20 is given by the user. Representation Binary Term frequency Probabilistic Tf-idf Properly classified 44 43 42 37 Misclassfied 3 4 5 10 0. 94 0. 91 0. 89 0. 78 Level of classification Accuracy

• Level of Accuracy Achieved

• Comparison of Centroid Classifier and the KNN classifier

Conclusion

About KNN Classifier We find KNN Classifier provides the top notch results in terms of classification accuracy. Drawbacks : 1) We cannot decide on the ideal value of K. 2) Requires huge amount of Computational Resources 3)Impracticle in case of very large Document Collections.

About Centroid Classifier classification accuracy very near to that of KNN. Advantages of Centroid Classifier over KNN 1) Does not require Huge amount of Computation. 2) Very quick to decide results of classification. 3) Ideally suited in case of very large input document collection.

n hat lude we ntroid Thus Classifier is better than KNN Classifier.

Future Enhancements • To increase the number of classes • To build a suitable front end • To integrate the classifiers built to the search engine to provide classification of websites • To enhance the centroid classifier by implementing weighted centroid classifier. • To incorporate a stemming algorithm ex. Stemmer porter. • To upgrade the implementation to incorporate the standard data collections, such as, Reuters-21578, TREC-5, TREC-6 and OHSUMED collection, 20 news group data set.

References [1]. Ricardo Bayeza-yates, Berthier ribeiro Neto, “Modern Information Retrieval”, Addison-Wesley. Longman Publishing co. , 1999 [2]. Spoerri, A, “Information Processing & Management”. Proceedings of the IEEE First International Conference on Computer Vision. Volume 43, pp. 1044 -1058, 2007 [3]. Forrester, “Coping with complex data”, The forrester Report, pp. 2 -4, April 1995. [4]. W. Bruce, “intelligent Information Retrieval”, Croft Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts, Amherst, D-Lib Magazine, November 1995 [5] Simon Colton, ”AI Bite”, The Society for the Study of Artificial Intelligence and Simulation of Behaviour, pp. 66 -67,

Thank you. . .

Any Questions? ? ?