Multiple Imputation for households surveys A comparison of

Скачать презентацию Multiple Imputation for households surveys A comparison of

024ebf965bb13df7a1878debd8f189e6.ppt

Количество слайдов: 25

Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting Rodrigo Alfaro – Marcelo Fuenzalida 1 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Outline § § § 2 Multiple Imputation (MI) Methods for MI Chilean Household Surveys Application of MI to EFH Comments and Conclusions Appendix STATA USERS GROUP MEETING SEPTEMBER 09 2008

Multiple Imputation § In empirical applications researchers must work with incomplete data sets. § A “solution” to the problem above is known as Multiple Imputation (MI) procedure. § MI relies on the assumption Missing At Random: § “…the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis. ” (Allison, 2002) § In our empirical application we assume MAR. § Validity of assumption is beyond the scope of this analysis. 3 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Multiple Imputation § MI is based on the assumption that we have a good proxies of distributions of missing observations in the sample. Given this: § We could “fill the blanks” taking random realizations. § We create m-versions of the complete datasets in order to reflect randomness of the procedure. § So, we change our incomplete data set for a set of complete ones. We do not “solve” missing data problem we just measure it. § Analysis of multiple data sets could be combine using Rubin’s rules (Rubin, 1987). 4 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Methods for MI § We divide methods into 2 categories § Conditional: Hot-Deck and Univariate. § Multivariate: Normal and Chained Equations. § Hot-Deck (hotdeck. ado) § Replace missing observations with an observed one taken randomly within a specific group: males with college. § Informal conversations: std. dev. ’s are still small. § Univariate (uvis. ado) § Regress variable with missing observations on exogenous variables with no missing. § Draw posterior of estimators (beta & sigma) § “Predict” missing values. 5 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Methods for MI § Chained Equation (ice. ado) § Based on Univariate method, but with the possibility of having missing values in exogenous variables. § Using reverse equation missing values of exogenous variables are replaced. § Loop over previous steps. § Normal § Assume Multivariate Normal. Estimate parameters using EM algorithm (or other initial value), and draw imputations using Data Augmentation procedure. § Theory relies on the convergence of EM. § No implemented in Stata. Schafer’s stand-alone package. 6 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Methods for MI § Schafer’s stand-alone package: Data. 7 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Methods for MI § Schafer’s stand-alone package: EM. 8 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Methods for MI § Schafer’s stand-alone package: DA. 9 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Methods for MI § Schafer’s stand-alone package: DA. 10 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Chilean Households Surveys § We have households surveys with a few number of waves: CASEN, and EPS. § CASEN was created to measure poverty. § EPS was created to evaluate pension system. § At the Central Bank we have been using these surveys to analyze financial fragility of households. § However, CASEN and EPS were not created for this purpose. We need new sample designs. § In 2007, we started a new survey designed for our purposes: EFH. 11 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Chilean Households Surveys § Our surveys have different levels of information § Personal information of each member of the household. For example: age, year of education, labor income, etc. § Aggregate information of the household. For example: value of assets (cars, house, financial instruments, etc. ), debts (mortgage, consumer loans, educational loans, etc. ) § Our variables of interest could be irrelevant for some households in the sample. § Many households have loans with retails-companies instead of borrowing the money directly from banks. § Few households have personal savings invested in financial instruments such as stocks, bonds, etc. 12 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Application of MI to EFH § Using conditional methods, we could attach the constraints to the imputation procedure. § We are able to impute labor income for each member of the household, considering only individual level vars. § At the household level, we could impute “banks loans” in a sub-sample of households that declared to have that kind of debt. We use as exogenous variables age, years of education, and gender of interviewee. § We impute “debt in retails-companies” with a different sub -sample but with the same exogenous variables. § But, we cannot impute “debt in retails-companies” with “banks loans” because sub-samples may be different. 13 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Application of MI to EFH § Multivariate methods imply groups of households. § Suppose that a household without a house, we could pretend that the value of its house is “zero”. However, that will be affect the correlation between value of the house and total amount of debts. § Our first round includes 3 groups defined by the credit access. § First group includes households without debts in financial institutions and without any kind of assets. § Second group includes households with real assets: cars, and primary house. § Third group adds households with debts in financial institutions. 14 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Results for EFH Missing Information § Low missing rate of information. § However, combining variables reduces sample size. § We will concentrate our analysis in the second group. § We use logit transformation to avoid unbounded results. Source: EFH 2007. 15 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Group 2: Conditional Source: EFH 2007. 16 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Group 2: Multivariate § Households with debt in retails companies. § UVIS and HD could have lower variances than raw data. § We note that ICE and NORM are consistent in sd’s. Source: EFH. 17 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Comments and Conclusions § In our empirical application Hot-Deck as well as § § 18 Univariate imputation have smaller variances than multivariate methods. Under a multivariate imputation we are able to have a reasonable standard deviation that reflects the uncertainty of complete data sets. Moving to multivariate methods we have ICE and NORM. Both have advantages and disadvantages. ICE is implemented in Stata with many features available to accommodate several models. In that respect is more general than NORM. STATA USERS GROUP MEETING SEPTEMBER 09 2008

Comments and Conclusions § NORM relies in EM algorithm in theoretical terms and DA for imputation method. For that we observed that we need convergence and “reasonable” positive definite matrix. § In practical terms we observed that a high rate of missing data is associated with non-converge of EM algorithm and/or some problems with DA. § In the case of ICE we observed that the algorithm does not have “convergence problem”. We were able to impute data with a high rate of incomplete information. 19 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Comments and Conclusions § We think that NORM provides a useful information about the stability of the model. § For that its implementation in Stata would be a good complement for ICE. Don’t you think? § We found 2 versions of NORM code: miss. sas by Paul Allison and norm. R by Alvaro Novo. § We translated miss SAS routine into Stata-ado programming. Allison used SAS package IML (Interactive Matrix Language). We observed that IML is similar to Mata. 20 STATA USERS GROUP MEETING SEPTEMBER 09 2008

Comments and Conclusions § However, original code was not “optimized” as § § 21 Schafer suggested in his book. A month ago we move to R-code. We found that original code in Fortran was included in R routine. So, norm. R allows to use Fortran code directly in R. Speedy up for this meeting, we translated 800 lines of Fortran into Mata in a week. However, our translation from R to Mata is not good enough… yet. STATA USERS GROUP MEETING SEPTEMBER 09 2008

Problem § Besides technical issues on MI we have an unsolved topic for which your opinion is crucial: Aggregation § We want to work at the household level, then aggregation § § 22 of individual information must be done somehow. In order to deal with missing observation at individual level we could apply “improper imputations”: (1) replacing with zeros, or (2) replacing with a predicted value. Alternative, we could code “missing” to the household if any member has missing observation. Because we lost information we discarded it. Any feasible mixture? Is it possible to add variables in order to account for incomplete information at individual level? STATA USERS GROUP MEETING SEPTEMBER 09 2008

Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting Rodrigo Alfaro – Marcelo Fuenzalida 23 STATA USERS GROUP MEETING SEPTEMBER 09 2008

All Households: Conditional Source: EFH 2007. 24 STATA USERS GROUP MEETING SEPTEMBER 09 2008

All Households: Conditional Source: EFH 2007. 25 STATA USERS GROUP MEETING SEPTEMBER 09 2008