e3071d3492e9546569e27aba25368305.ppt
- Количество слайдов: 23
Preserving edits when perturbing microdata for statistical disclosure control Natalie Shlomo Southampton Statistical Sciences Research Institute, University of Southampton & Ton de Waal Statistics Netherlands 1
Topics for Discussion n n n n Statistical Disclosure Control – an example PRAM – Post Randomization Method of perturbation Preserving edit constraints Micro and macro edit constraints Method for perturbing data and maintaining edits Evaluation study Results Impact on the risk of re-identification Discussion 2
Statistical Disclosure Control: Example n Record in microdata: ¨ Name of speaker: ¨ Nationality of co-author: ¨ Ever stole candy: n n n Ton de Waal Dutch Israeli yes Name of speaker: direct identifier Nationality and Nationality of co-author: indirect identifier Stealing candy: sensitive variable 3
Statistical Disclosure Control: Example n “Protected” record: ¨ Nationality: Dutch ¨ Nationality of co-author: Israeli ¨ Ever stole candy: yes n I’m still not happy! 4
Post-Randomization Perturbation Method changes values of categorical variables according to prescribed probability transition matrix n pij = p(perturbed category = j | original category = i) n Matrix P = (pij) applied independently to each record n For each record, value changed or not changed according to probability and random draw 5
Properties of PRAM t vector of original frequencies n t* vector of perturbed frequencies n E(t*|t) = t. P n Invariant PRAM: P selected in special manner n ¨ Condition of invariance: t = t. P (= E(t*|t)) 6
Example of using PRAM n Record after PRAM: ¨ Nationality: Dutch ¨ Nationality of co-author: Canadian ¨ Ever stole candy: yes 7
Edit constraints n Changing values of categorical variables will cause edited records to fail edit constraints: ¨ data of low utility ¨ inconsistent record pinpoints to potential attacker that record was perturbed n Example of edit constraint: ¨ “two authors of a paper at the UN/ECE Work Session in Ottawa are not from Canada and the Netherlands” n Attacker knows that record in our example has been perturbed 8
Preserving Edit Constraints Clean Data Set Perturbation Failed Edits Imputation Clean Data Set n Take edits as much as possible into account while applying PRAM n After PRAM has been applied: correct remaining edit failures by hot-deck imputation n Correct records for fixed perturbed variables 9
Example n Record after PRAM: ¨ Nationality of co-author: ¨ Ever stole candy: n Edited record after PRAM ¨ Nationality: ¨ Nationality of co-author: ¨ Ever stole candy: n Dutch Canadian yes I’m happy! 10
Micro Edit Constraints n Data: 1995 Israel Census sample data, 35, 773 individuals aged 15 and over in 15, 468 households across all regions and characteristics n Variable age perturbed – 86 categories n 14 micro-edits, such as: ¨ ¨ ¨ E 1 : {Under 16 and ever married} = Failure E 2 : {Age of marriage under 14} = Failure E 3 : {Age difference between spouse over 25} = Failure E 4 : {Age of mother under 14} = Failure E 5 : {Year of immigration less than year of birth} = Failure E 7 : {Under 16 and relation is spouse or parent} = Failure 11
Macro Edit Constraints Let D dataset and D(c) cell frequency of cell c n Hellinger Distance ¨ ¨ n Symmetrical distance metric that measures difference between original and perturbed probability distributions Information loss defined by larger Hellinger Distance Cramer’s V and ¨ Measures association between two categorical variables ¨ Information loss defined by reduction in Cramer’s V 12
Macro Edit Constraints n Impact on R 2 through ¨ Measures proportion of “between” variance out of the total variance, i. e. homogeneity of dependent variable within groupings ¨ Information loss defined as the proportion reduced in the “between” sum of squares for perturbed groupings compared to original groupings: 13
Methods of Perturbation n n Perturb variable randomly across all categories Perturb within a limited range of the variable, i. e. divide variable into subgroups and calculate transition matrix for each subgroup. Perturb variable within control groups defined by other highly-correlated variables. Compound highly correlated variables. 14
Evaluation Study n n n Perturbation of 86 categories of age: Random perturbation across all ages Age perturbed within categories of marital status (married, divorced, widowed and single) Invariant matrices calculated for each category. n Age perturbed within categories of marital status x five age groupings (15 -17, 18 -24, 25 -44, 45 -64, 65 -74, 75+) n Age perturbed within narrow age groupings (15 -17, 18 -24, 25 -34, 35 -44, 45 -54, 55 -64, 65 -69, 70 -74, 75+) 15
Number of Edit Failures Method of Perturbation Random No edit failures 31, 983 Within Marital Status 33, 143 Within Marital Status and Broad Age Groups 35, 023 Within Narrow Age Groups 35, 440 Note: large reduction in number of micro edits failures 16
Macro Edits Results Distortion to distribution Loss in measures of association 17
Macro Edits Results Impact on R 2 and shrinking means Loss in homogeneity within age 18
Disclosure Risk Measures n Percent unperturbed records in small cells of the key n Without perturbation expected number of correct matches would be measured by 1/Fk where Fk is the number of records in the cell k. Because only a proportion of records pd were likely not to be perturbed, the expected number of correct matches is pd/Fk n Proportion of records perturbed within a 5 year age difference where the higher the proportion the more likely to obtain a correct linkage 19
Disclosure Risk Results Slightly higher percent unperturbed records for more controlled PRAM No increased disclosure risk for more controlled Pram 20
Disclosure Risk Results Large increase in percent perturbed within 5 year age band for more controlled PRAM n Data protector must weigh the increased disclosure risk against the benefits of obtaining higher utility data 21
Discussion n Controls in the perturbation raise utility of data by minimizing micro and macro edit failures. n Risk of re-identification slightly increases depending on the risk measure and the disclosure risk scenario. n Protecting microdata by PRAM alone leaves high disclosure risk in the microdata and should be combined with other data masking techniques. n Need for more sophisticated methods for correcting edit failures on perturbed microdata based on principle of minimum change in order to improve the quality and utility of the data. 22
23
e3071d3492e9546569e27aba25368305.ppt