Rough Set based Decision Tree for Identifying Vulnerable

Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households Rajni Jain 1, S. Minz 2 and P. Adhiguru 1 1 Sr. Scientist, NCAP, Pusa, New Delhi 2 Associate Professor, Jawaharlal Nehru University

Outline o o o o Problem Knowledge Discovery Process Data Mining Classification Task of Data Mining Methodology: RDT Dataset for this Study Classifier Model Evaluation

Problem of Food Security o o Most often, available Funds are scarce Need to target the Food security program to most vulnerable group. Exhaustive surveys exclusively for this purpose will be very costly and time consuming. Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.

Knowledge Discovery in Dataset Data Selection Target Data Pre-processed Data Preprocessing • Preprocessing care to be taken not to induce any unwanted bias. They include removing noise and missing data handling • In Data Mining step many different learning and modeling algorithms are potential candidates Tranformed Data Mining • Transformations may be combining attributes or discretizing continuous attributes Transformation • Selection phase defines KDD problem by focusing on a subset of data attributes or data samples on which KDD is to be performed. Interpretation Knowledge Patterns

Data Mining Tasks o Classification n n Decision Tree Decision Rule Summarization o Association rules o Characteristic rules o

Classification Step I Training Data Classification Algorithm Rules/Tree/Formula Step II Estimate the predictive accuracy of the model. If acceptable Step III New Data Classification Rules Label the class

Data o Training Data n o Test Data n o The data used for developing the model The data used to estimate the evaluation parameter of the model New Data n Condition attributes known but decision attribute is not known

Basis of Classification Algorithms o o o Rough Sets Decision tree Learning Statistics Neural Network Genetic Algorithms None of the method is suitable for all types of domain

Methodology: Machine Learning o o o Rough Sets Decision Tree induction Rough set based Decision Tree induction (RDT) n n n Two phases RS for dominant attributes selection J 4. 8 for decision tree induction

Rough Sets o o o 1980, Prof. Z. Pawlak, A Polish Mathematician Indiscernible- similar Objects (say Patients, households etc. ) Indiscernibility Relation Id H M T F 1 n y h y 2 y n h y 3 y y vh y 4 n y n n 5 y n h n 6 n y vh y

Indiscernibility Relation - contd. . Flu Patients U/IND(H)={{1, 4, 6}{2, 3, 5}} U/IND(F)={{1, 2, 3, 6}, {4, 5}} Id H M T F 1 n y h y 2 y n h y 3 y y vh y 4 n y n n 5 y n h n 6 n y vh y

Lower and Upper Approximation 1 2 3 4 5 6 7 8 9 10 Let the Bigger Square represent the domain of the universe 11 12 13 14 15 Small Squares represent the partitions of the universe for a given set of attributes P. All objects in a partition are indiscernible. 16 17 18 19 20 Oval represents the concept X to be defined 21 22 23 24 25 26 27 28 29 30 P (X)= {13, 14, 18, 19} P(X)={7, 8, 9, 12, 13, 14, 15, 17, 18, 19, 20, 22, 2 3, 24} 1 2 3 4 Coming down in the other square, 5 6 7 8 P={7}, P(X)={7}, so crisp set 9 10 11 12

Important Terms o Reduct: R n n o Core : C n o o A minimum set of attributes that preserve the IND relation. Decision relative reduct Intersection of all Reducts Johnson’s method for single efficient reduct computation GA based algorithm for multiple reducts computation

Architecture of RDT Model Data Reduct Computation Algorithm Reduct Remove attributes absent in reduct Reduced Training Data ID 3 Algorithm DT

Decision Tree CHLD y n 0 HAGE young old middle 1 0 1 Very old LAND 1 0

Dataset o Source n Primary Survey data of 180 rural households from three villages as a part of the Project by Dr. P. Adhiguru at National Centre for Agricultural Economics and Policy Research (NCAP), India n 3 different production systems from Dharampuri district of Tamilnadu state n Actual food intake was measured by 24 hours recall method. Later corresponding nutrients intake was worked out

Attributes o o o Attributes are the variables in the dataset that are used to describe the objects Any attributes is either qualitative or quantitative In classification problem two types of attributes are considered Condition attributes - Independent Variables Class or Decision attributes -Dependent Variable

o Food Groups n n n n Cereals and Millets Pulses Green leafy vegetables Fruits Milk Fats and oils Roots and Tuber Sugar o Nutrients n n n Protein Energy Calcium Iron Vitamin A Vitamin C Energy is used as a proxy for measuring food insecurity of the household

Morphological Attributes House. Hold_Id 1. Land: Whether house has its own land 2. Hedu: Highest education of the head 3. Hage: age of the head in the household 4. Chld: Whether children in the family 5. Flsz: No of members in the family 6. Pr. Wm: Proportion of Women to Family Size 7. Hstd: whether own home stead garden 8. Pear: proportion of earning to family size PCENER: Energy/Capita/day in terms of KCAL 9. Decision: Derived from PCENER

Average Calorie Intake o o o In Tamil Nadu, Average intake per consumer unit per day in Kcal= 2347 In Tamil Nadu, Calorie intake of the lowest decile per consumption unit per day in Kcal= 1551 For All India, Calorie intake of the lowest decile per consumption unit per day in Kcal= 1954 To identify poorest of the poor, lowest decile average figure was used If Energy <1500 then decision attribute is labeled 0 means poorest of the poor or vulnerable to food insecurity Else 1 means not vulnerable to food insecurity

Revisiting Problem o o Most often, available Funds are scarce Need to target the Food security program to most vulnerable group. Exhaustive surveys exclusively for this purpose will be very costly and time consuming. Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.

Concepts to be Learned from Rural Household Dataset o Decision Tree n o A hierarchical structure with root node and sub trees as children Rules n Tree may be mapped to rules traversing the path from root to leaves

Softwares o o Rosetta for Rough set Analysis Weka for Decision tree induction C++ programs for interfacing between the two softwares Excel for Evaluation of the classifiers

Description of Learning Algorithms Algorithm Description RS Rough set with full discernibility decision relative reduct CJU Continuous data, J 4. 8, unpruned DT CJP Continuous data, J 4. 8 algorithm, pruned DT DID 3 RS based discretization, no reduct, ID 3 RDT RS based discretization, global reduct, ID 3 DJU Discretized using RS, J 4. 8, unpruned DJP Discretized using RS, J 4. 8, pruned RJU Discretized, global reducts, J 4. 8, unpruned DT RJP Discretized, global reducts, J 4. 8, pruned DT DRJU Discretized, dynamic reduct, J 4. 8, unpruned DT DRJP Discretized using RS, dynamic reduct, J 4. 8, pruned DT

DT and corresponding rules

Evaluation o o o Experiment using 10 fold Cross Validation Accuracy on Test data (A) Complexity (S) Number of Rules (Nr) Number of attributes (Na) Cumulative Score (CS)

Evaluation of Simplified DT Accuracy =73% Complexity = 43 Number of rules = 9 Num. of attributes = 4 0 : poorest and vulnerable to food insecurity 1: not vulnerable to food insecurity

Comparing Algorithms using CS

Nutrition Dataset

DT(DRJP) - Nutrition Data Accuracy=73% Complexity=43 Attributes=4 Rules=9 <40 CHLD y n HAGE 40 1 0 FLSIZE <4 [41, 51) >51 1 0 >4 4 1 <45 1 1 PEAR >45 [45, 54) 0 1

Benefits o o Cost Effective Timely Simple to understand implement No scope for personal Bias

Constraints o o o Development or model building requires expertise Lack of synergy among disciplines Adequate sample of data Region specific Mindset towards conventional and traditional techniques

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Adhiguru, P. and C. Ramasamy 2003. Agricultural-based Interventions for Sustainable Nutritional Security. Policy Paper 17. NCAP, New Delhi, India. Han, J. and M. Kamber 2001. Data Mining: Concepts and Techniques. MK Hand, D. , Mannila, H. and P. Smyth 2001. Principles of Data Mining. PHI. Minz S. and R. Jain 2003. Rough Set based Decision Tree Model for classification, In Proc of 5 th Intl. Conference, Da. Wak 03, LNCS 2737. Minz, S. and R. Jain 2005. Refining decision tree classifiers using rough set tools. International Journal of Hybrid Intelligent Systems, 2(2): 133 -147. Pawlak, Z. 2001. Drawing Conclusions from Data-The Rough Set Way. IJIS 16: 3 -11. Polkowski, L. and A. Skowron 2001. Rough Sets in Knowledge Discovery 1 and 2, Heidelberg, Germany: Physica-Verlag. Quinlan, J. R. 1993. C 4. 5: Programs for Machine Learning. Morgan Kauffman. Rosetta, Rough set toolkit at http: //www. idi. ntnu. no/~aleks/ rosetta/. Witten, I. H. and E. Frank 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, MK Wroblewski, J. 1998. Genetic algorithms in decomposition and classification problems. In: Polkowski, L. and Skowron, A. , Rough Sets in Knowledge Discovery 1 and 2, Heidelberg, Germany: Physica-Verlag 472 -492. Ziarko, W. 1993. Variable precision rough set model, Journal of Computer and System Sciences 46: 39 -59.

?

Thank You