ea4226d4ba594d7d1ebe41ee2a595228.ppt
- Количество слайдов: 13
Combining Unsupervised Feature Selection Strategy for Automatic Text Categorization Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Ping-Tsun Chang
Introduction • In recent researches – The limit of using statistic or computational approach for natural language understanding – The develop of machine learning technique is almost reached its bound – Natural language is infinite and nonlinear! • Unsupervised Feature Selection Ping-Tsun Chang
Text Categorization Background Knowledge • Problem Definition: Text Categorization is a problem to assign a unknown lebel to a large amount of document by a large amount of text data. Sensing Classification Segmentation Post-Processing Feature Extraction Decision Ping-Tsun Chang
Background Knowledge Machine Learning • Using Computer help us to induction from complex and large amount of pattern data • Bayesian Learning • Instance-Based Learning –K-Nearest Neighbors • Neural Networks • Support Vector Machine Ping-Tsun Chang
Background Knowledge Feature Selection • Information Gain • CHI-Square • Mutual Information Ping-Tsun Chang
Baysian Classifier • Recent Researches –Naïve Bayes classifiers are competitive with other techniques in accuracy –Fast: single pass and quickly classify new documents –ATHENA: EDBT 2000 Ping-Tsun Chang
Machine Learning Approaches: k. NN Classifier d? Ping-Tsun Chang
Machine Learning Approaches: Support Vector Machine • Basic hypotheses : Consistent hypotheses of the Version Space • Project the original training data in space X to a higher dimension feature space F via a Mercel operator K Ping-Tsun Chang
What is Certainly? • Rule for k. NN • Rule for SVM Ping-Tsun Chang
Algorithm for Two-Stage Automatic Text Categorization ALGORITHM Two-Stage-Text-Categorization (input: document d) returns category C Statistic: Trained classifier: Traditional-Classifier The feature set: F The new feature set by user feedback: Ui for related catehory Ci For new document d C ← Traditional-Classifier (d) If NOT satisfy the rule of uncertainly Return C Else For all category Ci If d have the feature in F C ← Ci Return C End If Cj ←User-Input Uj ← Uj + User-Selected C ←Cj END If Return C Ping-Tsun Chang
Determine threshold of the Rule Ping-Tsun Chang
Experienments Ping-Tsun Chang
References [1] Dunja Mladenic, J. Stefen Institute, Text-Learning and Related Intelligent Agents: A Survey, IEEE Transactions on Intelligent Systems, pp. 44 -54, 1999. [2] Yiming Yang, Improving Text Categorization Methods for Event Tracking, In Proceedings of the 23 th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’ 00), 2000. [3] Yiming Yang, Combining Multiple Learning Strategies for Effective Cross Vaildation, In Proceedings of the 17 th International Conference on Machine Learning (ICML ’ 00) , 2000. [4] V. Vapnik, The Nature of Statiscal Learning Theory. Springer, New York, 1995. [5] Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevent Features. In European Conference on Machine Learning(ECML ’ 98), pages 137 -142, Berlin, 1998, Springer. [6] Yiming Yang, A re-examination of Text Categorization Methods, In Proceedings of the 22 th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’ 99), 1999. [7] Lee-Feng Chien. Pat-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20 th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR ’ 97), pages 50 -58, 1997. [8] Jyh-Jong Tsay and Jing-Doo Wang, Improving Automatic Chinese Text Categorization by Error Correction. In Proceedings of Information Retrieval of Asian Languages(IRAL ’ 00), 2000. [9] James Tin-Yau Kwok, Automated Text Classification Using Support Vector Machine, International Conference on Neural Information Processing(ICNIP ’ 98), 1998. [10] Daphne Koller and Simon Tone, Support Vector Machine Active Learning with Applications to Text Classification, In Proceedings of International Conference on Machine Learning(ICML ’ 00), 2000. [11] Central News Agency, URL: http: //www. cna. com. tw [12] Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. [13] D. E. Appelt, D. J. Israel. Introduction to Information Extraction Technology. Tutorial for International Joint Conference on Artificial Intelligence, Stockholm, August 1999. Ping-Tsun Chang
ea4226d4ba594d7d1ebe41ee2a595228.ppt