We propose to develop DPSyn, an algorithm and open-source software for synthesize microdata while satisfying differential privacy. DPSyn builds on our previous work PriView (published at SIGMOD'14 under the title "PriView: practical differentially private release of marginal contingency tables"). The main idea is to generate a synthetic dataset that approximates many randomly-chosen marginal distributions of the input dataset.
Our technique deals with categorical datasets; and numerical attributes are first bucketized so that they become categorical values. Each marginal is specified by a subset of the attributes, and can be viewed as a projection from the full contingency table to those attributes. A marginal table consists of L cells; each cells contains a fraction number such that the sum of all numbers is 1. For example, a marginal table with 8 binary attributes has L=2^8=256 cells, and fully captures the distribution on these 8 attributes. A dataset with 100 binary attributes has a huge number of (100 chooses 8, to be exact) 8-way marginals. Our approach is to randomly selects m marginals of size L (e.g., m=50, L=256), computes these marginals on the input dataset in a way that satisfies differential privacy, and then synthesizes a dataset based on these marginals. The rationale for our approach are as follows. Any analysis one may want to conduct on a dataset can be performed using the joint distribution of some subset of attributes. On most subsets of attributes, a synthesized dataset that simultaneously preserves many (from dozens to hundreds) randomly-chosen marginals would have a distribution close to that of the original dataset. Under differential privacy, one can answer counting queries with good accuracy; thus we utilize marginals (which are essentially counting queries) to extract information from the input dataset.
Given a dataset as input, the first step is to generate many randomly selected noisy marginals on the dataset. The algorithm decides the parameters m (number of marginals) and L (size of marginals). Analysis in the PriView paper showed that L=256 is approximately optimal for binary attributes, we will study whether the same holds for general categorical attributes. The number M depends on the dataset size and the privacy parameter epsilon. The algorithm then computes these marginals of the input dataset, and finally adds Laplacian noises to them so that differential privacy is satisfied. The amount of noises depends on the privacy parameter epsilon, and the number of marginals, as publishing each marginal has a global sensitivity of 1.
In the second step, we use techniques developed in PriView to make all noisy marginals consistent with each other. The techniques presented in PriView were for binary attributes. We have already extended those techniques to categorical attributes. In PriView, it was shown that one can use these noisy marginals (called private views there) to reconstruct arbitrary marginals with high accuracy. This suggests that these noisy marginals captures a lot of information in the input dataset, and can be used for a broad range of data analysis tasks.
The third step is to generate a synthetic dataset given these private views. PriView only has techniques to reconstruct other query marginal. Here we propose to develop techniques to synthesize a dataset that approximates these views. This is where we expect to spend most of our efforts. We plan to investigate a few alternative methods. One method starts with a randomly generated dataset and gradually changes it to be consistent with the noisy marginals. Another is to use these noisy marginals to construct probabilistic graphical models of the dataset, and then synthesize data from these probabilistic models. The key challenge is efficiency. To be able to preserve information in datasets with dozens or more attributes, we expect to use dozens or more noisy marginals, each including 5-10 attributes. We expect the implemented software tool can generate datasets with millions of records in this setting.
Clearly, the larger m (the more marginals), the more marginal information is preserved, and the better the utility. However, a larger m also means higher noise for each marginal, when the total privacy budget is fixed. Fortunately, a larger dataset means that less privacy budget is needed for each individual marginal. With a dataset 10 times larger, one needs only 1/10 privacy budget to get marginal of the same accuracy. Furthermore, if we are willing to go beyond strict epsilon differential privacy, and accepts weaker notions such as (epsilon, delta) differential privacy, we can use more advanced composition theorems to our advantage. Essentially, if one spends 1/10 of original budget for one marginal, then one is able to publish a lot more than 10 times the original number of marginals. This means that our approach is very promising for large datasets.
If the non-pure differential privacy is allowed, we will use Gaussian Mechanism instead of Laplace Mechanism because it gives
better composition properties, which enables us to publish more marginals under the same privacy budget.
Randomness is also used in selecting sets of attributes for marginals, but this is not for the purpose of privacy.
improvement of previous algorithms
Our approach DPSyn is a general solution for any analysis regardless of the relationship, i.e., DPSyn will take treat the fields equally. But if there is a known relationship between the fields and the analysis, and the required analysis is known in advance, it is possible to change the way to select which marginals to publish. For example, if the goal is to do regression of classification with one attribute being the label, it may be beneficial to include that label attribute in all marginals (i.e., each marginal includes the label and a randomly selected set of attributes). For another example, one can drop the unrelated columns in the dataset if these are known.
DPSyn is designed as a general-purpose method. The fields are treated equally; and the output contains information about all the fields. It is possible that some information would not be well-preserved by the synthetic dataset. For example, on a binary dataset of 50 attributes where each attribute is equally likely to be 0 or 1, except that the XOR of all 50 attributes is 0, DPSyn will not be able to preserve this correlation. For such a dataset, looking at any k<50 attributes, they are all independent. Only when looking at all 50 attributes together, is there correlation. Since DPSyn uses k-way marginals, this information is lost. While such pathological examples do exist, our experiences with PriView and other prior research lead us to believe that most properties are preserved from joint information of many marginals.
Most existing methods for data analysis while satisfying DP are designed for specific tasks. Our approach is more general, and requires no tuning specific to the specific regression/classification/clustering task. In prior research, we have studied the approach of generating synthetic data from histograms and then performing tasks such as classifications and clustering. See, e.g., the following two papers:
Dong Su, Jianneng Cao, Ninghui Li, Min Lyu:
PrivPfC: differentially private data publication for classification. VLDB J. 27(2): 201-223 (2018)
Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, Min Lyu, Hongxia Jin:
Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization. ACM Trans. Priv. Secur. 20(4): 16:1-16:33 (2017)
Surprisingly, we have found that this approach of generating synthetic data provides better utility than algorithms specifically designed for classification or clustering. The main limitation of our prior work is that they are for lower-dimensional datasets. When a dataset has many attributes, the histogram involving all attributes (i.e., the full contingency table) becomes too large and too sparse to use this approach. In PrivPFC, we propose to select just a few attributes that are relevant to the classification task at hand and publishes histogram in it. There are two limitations for PrivPFC. First, the selection step does not scale when we have more than a few dozen attributes. Second, if one wants to perform any task other than classification, the synthetic dataset constructing from the histogram is unlikely to provide high utility. We have also worked on differentially private frequent itemset mining before, which is reported in the following paper.
Ninghui Li, Wahbeh H. Qardaji, Dong Su, Jianneng Cao:
PrivBasis: Frequent Itemset Mining with Differential Privacy. PVLDB 5(11): 1340-1351 (2012)
Even though our paper was published in 2012, and several papers have been published on exactly this topic since then, every such paper uses PrivBasis as the main benchmark for comparison, and no newly proposed approach is able to consistently outperform PrivBasis. In PrivBasis, we also use the approach of publishing multiple marginals. However, those marginals were selected so that frequent itemsets are likely to be covered by one or more of them, and the marginals are only used to estimate counts of itemsets.
Most existing microdata publishing approaches do not satisfy Differential Privacy and have only ad hoc privacy protection. However, there are a few proposals for generating synthetic datasets while satisfying differential privacy, which we discuss below.
To publish synthetic datasets, one approach is to train a generative model satisfying differential privacy, and use the generative model to generate a synthetic dataset. For example, PrivBayes (published at SIGMOD'14) publishes a noisy Bayesian network that approximates the data distribution by several low-dimensional marginals (histograms). PrivBayes determines the structure of a Bayesian network by first randomly selecting (using Exponential Mechanism) an attribute as the first node, and then iteratively selecting another attribute to create a new node with up to k nodes already created as the new node’s parent nodes. After the structure is determined, PrivBayes perturbs the marginals needed for computing the conditional distributions of the data. We have compared with PrivBayes while working on PrivPFC, our previous approach on data classification. We found that PrivBayes does not perform well, because the Bayesian network structure it constructs is often poor.
Besides the Bayesian model, other models can also be trained. For example, there are methods that assume the dataset is sampled from some statistical distributions (e.g., mixture models). The methods first learn parameters from the dataset, then sample the synthetic dataset from the statistical models. To the best of our knowledge, DPSynthesizer (PVLDB'14) is the only existing work that satisfies differential privacy. The main limitation of these methods is that only a restricted set of knowledge can be conveyed by the models, and oftentimes the dataset cannot be modeled perfectly. A real-world high dimensional dataset is unlikely to be well approximated by a synthesizer with a few parameters as input.
There are also work that uses deep neural networks. For example, Differentially Private Releasing via Deep Generative Mode (preprint on arXiv) trains a deep generative adversarial network (GAN) with differential privacy, by injecting random noise in the optimization procedure (e.g., stochastic gradient descent). However, the purpose of GAN is to generate data records that look authentic, instead of following the original distribution. While we do not have direct experience experimenting with this approach, we do not believe that this approach can provide high utility.
We optimize utility by taking advantage of the most advanced composition theorem. That is, given the privacy budget epsilon and delta, and the dataset column count d, we optimize the number of marginals m and the size of each marginal L (we also use l to denote the number of columns in one marginal, and L=2^l), so that the overall expected error is minimized. Specifically, for any combination of l and w, we first calculate the privacy budget epsilon', delta' allocated for each marginal, using the numerical method proposed in Privacy buckets: Upper and lower bounds for r-fold approximate differential privacy (preprinted on arXiv'18). We then calculate the expected utility, elaborated below. Finally, the l and w with the minimal privacy loss is chosen as the configuration of DPSyn.
To calculate the expected utility, we compute the amount of noise on average. Specifically, for any k-way marginal (k is an integer between 3 to 8; and we found that the influence of k is not significant), we compare the squared difference between the ground truth and the result value for each cell, and take average. Since Laplace/Gaussian noise is added, the variance is given. Specifically, when epsilon' and delta' are given, we can calculate the among of noise added, and the derive the variance.
DPSyn is designed for working with static, relational, categorical data that are information rich, so that different analysts want to perform different analysts tasks. For example, census data is a good application for DPSyn. DPSyn can handle dozens of attributes (even more than one hundred if the attributes are binary). We expect the following tasks to be handled well: identifying the relationship among attributes; causal inference; abnormal detection, and so on. Since a synthetic dataset is generated, we expect to see roughly the same performance.
Below we summarize a few cases where we expect DPSyn to not work well. Note that these cases are equally hard for most other DP-based methods.
First, when data are updated. Like many other privacy-preserving algorithms, DPSyn currently cannot handle the case when the data is constantly updated, because each publishing requires some privacy budget. It is possible though, to run DPSyn multiple times with split privacy budget. However, when the data is constantly updated, DPSyn does not work.
Second, when many important attributes are numerical and their precision matters much. That is, if turning numerical values into a few bucketized values results in too much information loss already, then DPSyn does not work well. Further, DPSyn cannot handle content data where the domain is very large or unbounded, such as text data, audio/video data.
Third, when the dataset is not in a relational database. Examples include graph models and streaming models. The original notion of differential privacy was proposed to handle the counting queries in the relational databases. Over the years, different notions have been proposed to handle different data types, e.g., location indistinguishability. But since DPSyn is built on differential privacy, it will have difficulty handling data types other than relational database. Adopting other privacy notions will be an interesting future direction.
In the first step, the method scans the dataset to construct the marginals. The computation grows linear with the size of the dataset and the number of marginals, respectively. After that, injecting noise is fast, but making the marginals consistent requires more time, since it is basically an optimization problem. It is not clear what is the theoretical running time, but with a test on the running time, the optimization can always terminate soon enough and output reasonable result. Finally, the synthesize step is linear in the size of the target dataset and the number of noisy marginals.
We run experiments on the following four datasets.
- POS (Real world performance of association rule algorithms, SIGKDD'01): A dataset containing merchant transactions of half a million users.
- Kosarak (Frequent itemset mining dataset repository.http://fimi.ua.ac.be/data/): A dataset of click streams on a Hungarian website that contains around one million users.
- Adult (UCI machine learning repository, 2010): A dataset from the UCI machine learning repository. After removing missing values, the dataset contains around 50 thousands records. The numerical attributes are bucketized into categorical attributes.
- US: A dataset from the Integrated Public Use Microdata Series (IPUMS). It has around three millions records of the United States census in 2016. There are over one hundred categorical attributes in this dataset.
For those datasets, there exists analysis for regression/classification/clustering tasks.
Sum of Squared Error (SSE) is the metrics we use to measure accuracy of the marginal information. That is, we compute the ground truth and calculate the sum of squared difference in each cell.
For each of the specific problem, we use appropriate metrics such as mis-classification rate.