>>> Advancing Methods in Differential Privacy >>>
Real solutions from the statistical community for differentially private and high quality data releases by national statistical institutes.
Solution Overview - Describe how your approach optimizes the balance of differential privacy and data analysis utility.
A package of methods in SDC and differential privacy (DP) protects microdata files that contain numerical, geo-spatial, and categorical data from linkage and inferential attacks by intruders. Algorithms include standard SDC approaches, randomized mechanisms and additive noise with the aim of balancing privacy and utility for a range of statistical analyses methods: exploratory analysis; generation of count data; regression, classification, and clustering analyses, as well as other statistical and machine learning (ML) models to be specified. Given that a DP mechanism is not secret, the parameters of the noise addition/ random mechanism can be made public and researchers are able to account for the perturbation in their statistical inferences.
The pragmatic solution comes from the statistical community and mitigates the impact of privacy constraints on more widely used statistics and data products of national statistical institutes in particular and data providers in general. A privacy mechanism generates public microdata and tables with numeric precision and categorizations sufficient to serve the purpose of the data release, but with regularized and imputed values as needed to prevent violations of a privacy guarantee.
Simulations and systematic testing have made (and will make) the privacy mechanism more robust to attacks. The design of the mechanism and the statistical/ML expertise of the developers support previewing of potential intruder attacks and selecting combinations of methods required to counter them. Considering the rapid pace of development of data sources and integration technologies, a static privacy mechanism would not be expected to work effectively for long, much less extend to different forms of public releases of data and statistics. The proposed mechanism applies combinations of methods in layers and through pipelines. More along the lines of a spam filter than an encryption algorithm, it has to adapt to new threats as they appear.
Which randomization mechanisms does your solution use?
If other, please list and explain
We carry out an initial step in our proposal for data protection by using standard SDC methods: (1) consultation with users on what data and at what level of aggregation is needed; (2) suppressing variables that are not needed; (3) coarsening of quasi-identifiers using techniques such as k-anonymity.
Is your proposed solution an improvement or modification of previous algorithms in differential privacy or a substantially new algorithm.
improvement of previous algorithms
Provided that there is a known relationship between the fields and the analysis (research question such as regression, classification, clustering), how does your approach determine the number and order of the randomized mechanism being utilized?
Whether or not a known relationship exists among fields in a private data set, using prior information to generate a Bayesian posterior distribution yields data that improves estimates of a regression, classification, or clustering model parameters. The proposed approach for synthetic data generation assumes a non-informative prior and allows the data to drive parameter estimation albeit with a small amount of Laplace noise added to each estimating equation. All variables are synthesized in turn using sequential multiple regression models. The approach of adding noise to the estimating equations is similar for statistical methods that are available via the remote analysis server.
Provided that there is NOT a known relationship between the fields and the analysis (unknown research question), is there a prescribed sequence of privacy techniques that will always perform the best regardless of data?
See response to prior question.
How does your proposed solution differ from existing solutions? What are the advantages vs existing solutions? Disadvantages?
It puts a sharper focus on methods that serve the public interest in high quality data and research findings according to the objectives of a national statistical institute. A narrower focus on yields of queries and development of new algorithms discounts the advantages of using combinations of methods in layers or pipelines that at the same time reduce bias or error in statistical estimates and the risk of exposure of private information. The perspective of the statistical community combines a broader public interest in accurate, representative, and reliable statistics and data products with best practices for using samples and estimates as a basis for scientific inference. The sequential regression modeling allows all forms of variables to be synthesized together thus preserving better joint distributions. Under the remote analysis server, it is likely that higher utility might be obtained since the noise is added at the output level.
How well does your solution optimize utility for a given privacy budget (the utility-privacy frontier curve) and how does it accomplish this for each of the research classes (regression, classification, clustering, and unknown research question) and each of the data types (numeric, geo-spatial, and class)?
It estimates the parameters of a privacy-accuracy frontier (PAF) for a range of DP epsilon values. The point on the PAF that corresponds to an epsilon represents a draw against a given privacy budget. In the case of public release microdata or tables, the privacy budget maps to an epsilon value. In the remote analysis server case, the server generates the same internal data set prior to computing and releasing analytic results. In the case of synthetic data, the privacy budget is fixed at the stage of generating the data and all subsequent data analysis is also DP and does not deplete the privacy budget. While the accuracy of data analyses may be greater in the case of the remote analysis server, the privacy budget would be the same as in the case of public release microdata or tables. In both cases the solution computes draws against the privacy budget based on cleaned and categorized data partially imputed from posterior distributions.
Describe other data types and/or research questions that your Solution would handle well. How would performance (in terms of privacy and utility) be maintained and why? Describe other data types and/or research questions that your Solution would not handle well. How would performance (in terms of privacy and utility) degrade and why?
We have focused on traditional types of data releases at national statistical institutes, including microdata, longitudinal data, administrative data, tabular data and other forms of data where a post-data collection process can be incorporated for protecting the confidentiality of data subjects. One form of data that we have not focused on is streamed data where the data protection needs to be done at the time the data is created. This type of data is not traditionally disseminated at national statistical institutes.
How do the resource requirements for your Solution scale with the amount of data? Describe how the computational requirements of your Solution at different volumes of data can be handled using current computing technological capabilities. If your Solution requires advances in technology, describe your vision and anticipated availability for the types and scope of technological advances necessary.
Our proposed approaches can be handled using current computing capabilities within a national statistical institute under a post-data collection confidentiality protection process.
Please reference a dataset you suggest utilizing as a use case for testing algorithms. Is there existing regression, classification, and clustering analysis of this data? If so, Please describe.
Proposed datasets are listed in the concept paper. We recommend using public-use files that are freely available to download from the web for conducting simulation studies and test cases. Geo-coded variables are generally not included in these data sets, and we recommend generating geo-codes through clustering and modeling assumptions for the purpose of developing the DP algorithms. We can then seek arrangements with national statistical institutes and other data providers for which we have restricted data to test the algorithms on real data.
Using secondary data, there are many academic papers and reports where all types of statistical models are carried out. Our proposed secondary data also include longitudinal datasets where more advanced modeling has been carried out in the literature, for example, multilevel models, event history models and survival models.
Propose an evaluation method for future algorithm testing competitions.
To get the statistical community involved with the development of the differential privacy standard, it is important to demonstrate that DP is useful for the dissemination of statistical data carried out at national statistical institutes and other survey organizations and data archives. Therefore, our focus has been on the confidentiality protection of these forms of data releases.
Document upload - Submit a supporting PDF. Note that the submission should stand alone without the attachment.