A package of methods in SDC and differential privacy (DP) protects microdata files that contain numerical, geo-spatial, and categorical data from linkage and inferential attacks by intruders. Algorithms include standard SDC approaches, randomized mechanisms and additive noise with the aim of balancing privacy and utility for a range of statistical analyses methods: exploratory analysis; generation of count data; regression, classification, and clustering analyses, as well as other statistical and machine learning (ML) models to be specified. Given that a DP mechanism is not secret, the parameters of the noise addition/ random mechanism can be made public and researchers are able to account for the perturbation in their statistical inferences.
The pragmatic solution comes from the statistical community and mitigates the impact of privacy constraints on more widely used statistics and data products of national statistical institutes in particular and data providers in general. A privacy mechanism generates public microdata and tables with numeric precision and categorizations sufficient to serve the purpose of the data release, but with regularized and imputed values as needed to prevent violations of a privacy guarantee.
Simulations and systematic testing have made (and will make) the privacy mechanism more robust to attacks. The design of the mechanism and the statistical/ML expertise of the developers support previewing of potential intruder attacks and selecting combinations of methods required to counter them. Considering the rapid pace of development of data sources and integration technologies, a static privacy mechanism would not be expected to work effectively for long, much less extend to different forms of public releases of data and statistics. The proposed mechanism applies combinations of methods in layers and through pipelines. More along the lines of a spam filter than an encryption algorithm, it has to adapt to new threats as they appear.
We carry out an initial step in our proposal for data protection by using standard SDC methods: (1) consultation with users on what data and at what level of aggregation is needed; (2) suppressing variables that are not needed; (3) coarsening of quasi-identifiers using techniques such as k-anonymity.
improvement of previous algorithms
Whether or not a known relationship exists among fields in a private data set, using prior information to generate a Bayesian posterior distribution yields data that improves estimates of a regression, classification, or clustering model parameters. The proposed approach for synthetic data generation assumes a non-informative prior and allows the data to drive parameter estimation albeit with a small amount of Laplace noise added to each estimating equation. All variables are synthesized in turn using sequential multiple regression models. The approach of adding noise to the estimating equations is similar for statistical methods that are available via the remote analysis server.
See response to prior question.
It puts a sharper focus on methods that serve the public interest in high quality data and research findings according to the objectives of a national statistical institute. A narrower focus on yields of queries and development of new algorithms discounts the advantages of using combinations of methods in layers or pipelines that at the same time reduce bias or error in statistical estimates and the risk of exposure of private information. The perspective of the statistical community combines a broader public interest in accurate, representative, and reliable statistics and data products with best practices for using samples and estimates as a basis for scientific inference. The sequential regression modeling allows all forms of variables to be synthesized together thus preserving better joint distributions. Under the remote analysis server, it is likely that higher utility might be obtained since the noise is added at the output level.
It estimates the parameters of a privacy-accuracy frontier (PAF) for a range of DP epsilon values. The point on the PAF that corresponds to an epsilon represents a draw against a given privacy budget. In the case of public release microdata or tables, the privacy budget maps to an epsilon value. In the remote analysis server case, the server generates the same internal data set prior to computing and releasing analytic results. In the case of synthetic data, the privacy budget is fixed at the stage of generating the data and all subsequent data analysis is also DP and does not deplete the privacy budget. While the accuracy of data analyses may be greater in the case of the remote analysis server, the privacy budget would be the same as in the case of public release microdata or tables. In both cases the solution computes draws against the privacy budget based on cleaned and categorized data partially imputed from posterior distributions.
We have focused on traditional types of data releases at national statistical institutes, including microdata, longitudinal data, administrative data, tabular data and other forms of data where a post-data collection process can be incorporated for protecting the confidentiality of data subjects. One form of data that we have not focused on is streamed data where the data protection needs to be done at the time the data is created. This type of data is not traditionally disseminated at national statistical institutes.
Our proposed approaches can be handled using current computing capabilities within a national statistical institute under a post-data collection confidentiality protection process.
Proposed datasets are listed in the concept paper. We recommend using public-use files that are freely available to download from the web for conducting simulation studies and test cases. Geo-coded variables are generally not included in these data sets, and we recommend generating geo-codes through clustering and modeling assumptions for the purpose of developing the DP algorithms. We can then seek arrangements with national statistical institutes and other data providers for which we have restricted data to test the algorithms on real data.
Using secondary data, there are many academic papers and reports where all types of statistical models are carried out. Our proposed secondary data also include longitudinal datasets where more advanced modeling has been carried out in the literature, for example, multilevel models, event history models and survival models.
To get the statistical community involved with the development of the differential privacy standard, it is important to demonstrate that DP is useful for the dissemination of statistical data carried out at national statistical institutes and other survey organizations and data archives. Therefore, our focus has been on the confidentiality protection of these forms of data releases.