2081394b08b16cd7476585dfd98e5ef5.ppt

- Количество слайдов: 30

Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek

Talk Outline Background Basic design Description of operation Confidentiality outline Constraints on universe formation Other constraints Summary

Background PUBLIC remote access to confidential data Restriction of queries and responses rather than the registering and monitoring the user Current population survey (CPS), employment and economic well-being; demographic supplement Software development by Synectics HTML, my. SQL, php, to develop the query … SAS as the statistical package run against the data

Risk Model for Microdata • Intruder has access to record linkage software and identified data sources • Disclosure occurs if the intruder is successful in linking his identified data with the published microdata

Risk Model for a Model Server • Intruder has access to record linkage software and identified data sources • Intruder uses model server to reconstruct microdata for both the variables overlapping his data sources and a sensitive variable • Disclosure occurs if the intruder is successful in linking his identified data with the reconstructed microdata and has valid estimate of a sensitive characteristic or value

Basic Design Choice Enable: Choose which functions will operate – Must construct a friendly interface – Limited to the procedures developed – Safe from unknown code Disable: Choose which functions will not operate – User free to program within disabling constraints – No limit on complexity – Must be monitored (human, program or mix)

Operation User visits web site, chooses data set, explores data, chooses geography, analysis type User chooses population, constructs model, selects output Web site constructs code to send behind firewall Code checked and run against data at Census Results checked and returned to user

Structure of Confidentiality Rules Data preparation Data exploration Model universe formation Model Statement Model Output

Data exploration rules Users may request tables for categorical variables and numeric recodes up to e 1 dimensions. (start e 1=4 including geo) User may transform numeric recodes using a limited set of functions: log, root, square.

Universe formation: Categorical Variables Example: Hispanic heads of household with a college degree. Conditions: X 1=H, X 2=1, X 3=5 (table cell) Implication: Data preparation must support safe lower dimensional tables

Universe formation rules: Categorical Variables Limit on the number of categorical variables (u 1=3) Minimum on the size of the universe selected (u 2=75)

Universe Formation: Numeric Variables Example: Families in poverty Condition: Family income<18, 500. Or Family income<18, 501? Implication: Rounding or pre-assigned cutpoints.

Universe formation rules: Numeric variables Users will select categorical variables first Numeric variables can be used only at pre-assigned cutpoints. The number of observations in the whole CPS universe between cutpoints shall be at least u 3 for every numeric variable. (start u 3=80)

Universe formation rules (cont) If a cutpoint is used in universe formation the difference in the size of the model universe obtained by incrementing the cutpoint up or down cannot be less than u 4. (start u 4=4) The universe for the model must have at least u 2 observations. (start u 2=75) There will be no cutpoints above the 97 th percentile of nonzero points or the last half percentile of all points.

Model statements rules At most m 1 variables may be used in the model statement (start m 1=20) Dummy variables must distinguish at least m 2 observations (start m 2=20) No interaction term may involve more than 4 variables. (m 3=4) No model involving 3 or more variables can be fully interacted. (m 4=3)

Model Output Residuals will be based on synthetic data Limit on the number of significant digits? R 2 cannot be 1? Rules for other diagnostics

Synthetic Residuals Users may see synthetic bar charts or distributions and synthetic 2 -way plots. Synthetic data must be generated from fixed random number starts and topcoded (and bottom coded where appropriate) at 4 standard deviations from the mean.

Data preparation The topcode for numeric data needs to be calculated Cutpoints must be determined Separate lists of variables for exploration, universe formation, dependent and independent variables, model estimation Standard recodes added Inference from the collection of all 4 -way categorical tables checked

Major Hurdles Implementing facility for dummy variables Presentation of geographic options Implementing synthetic residuals Architecture for differing variable roles

Future development Relaxation of top codes Implementation of model variance estimation (NSO weighting) Introduction of new dataset Introduction of new statistical procedures Facility to add contextual data or merge files Use of non-sampled data

Overview • Avoids (as much as possible) tests which accept or reject a users choice. • Restricts the dimension of the data access. • Has some flexibility in setting system confidentiality parameters. • Changes the intruder model. • Introduces a modification of k-anonymity.

My thanks to Jerry Reiter, Laura Zayatz and Stephen Wenck http: //204. 52. 186. 190/ Contact: philip. m. [email protected] gov