Weka: An open-source tool for data analysis and mining with machine learning Quantitative Data Analysis Colloquium Centenary College of Louisiana Mark Goadrich 4/17/2008
Regression lines and correlation • Find relationship between two attributes • Correlation coefficient
Categorization • Can we learn one category based on the others? • This search for classification lines is called machine learning
Data Sets • • • House of Representative Votes Labor Relations Iris (plant) Discrimination Breast Cancer Many more at http: //archive. ics. uci. edu/ml/ • Table of Features – Example is a row – Features are discrete or continuous
Weka Time - Explore • http: //www. cs. waikato. ac. nz/ml/weka/ • Open Explorer • Open Data File – ARFF or CSV • Visualize All • Visualize Crosstabs
Discrete : Decision Trees • Reduce confusion (entropy) in the data by drawing recursive lines • Result is comprehensible to humans
Continuous : ANN and SVM • Artificial Neural Networks simulate activating and thresholding neurons • Support Vector Machines use a kernel to transform data to higher dimensions
Weka Time - Classify • Choose Algorithm – J 48, Multilayered Perceptron, SMO • Validate Learning – Training set – Cross validation • Visualize output – ROC Curves – Precision-Recall Curves
Future Topics • Clustering – Number and makeup of categories unknown • Relational Data – Features are related within examples – Features are related across examples