Submission

submission voting

voting is closed.

introduction

title

Private Synthetic Data Generation via GANs

short description

We use a Differentially Private Generative Adversarial Network (DP-GAN) to generate private synthetic data for analysis tasks.

Eligibility

Solution Overview - Describe how your approach optimizes the balance of differential privacy and data analysis utility.

Our approach is to generate differentially private synthetic data using Generative Adversarial Networks (GANs). This synthetic data can then be used for a variety of analysis tasks, including classification, regression, clustering, and answering unknown research questions. If the synthetic data are statistically similar to the original (sensitive) data, then analysis on the synthetic data should be accurate with respect to the original database. By generating synthetic data privately, any future analysis on the data will also be private, due to the post-processing guarantees of differential privacy.

GANs are a type of generative model, in which two neural networks, commonly known as the Generator (G_y) and Discriminator (D_w), are trained against each other in a zero-sum game. These neural networks are parameterized by their edge weights---y and w for G_y and D_w, respectively---which specify the function computed by each network. The Generator takes as input a random vector drawn from a known distribution, and produces a new datapoint that (hopefully) has a similar distribution to the true data distribution. If we are given a finite-size database, then the true data distribution can be interpreted as the empirical distribution that would arise from sampling entries of the database with replacement. The Discriminator then tries to detect whether this new datapoint is from the Generator or from the true data distribution. If the Discriminator is too successful in distinguishing between the Generator’s outputs and the true data, then this feedback is used to improve the Generator’s data generation process.

Note that the Discriminator has access to the true data, while the Generator only receives feedback about the true data through the Discriminator’s output. This will be useful in designing differentially private GANs, since only the Discriminator must be made to satisfy differential privacy. The Generator’s update from the Discriminator’s output is simply post-processing, and will therefore be differentially private as well.

We now describe the design of differentially private GANs (DP-GANs), where the Discriminator is trained in a differentially private manner. GANs are typically trained using iterative updates from noisy (stochastic) gradient descent steps. Once we know the sensitivity of each update to the Discriminator, then we add noise proportional to the sensitivity using the Gaussian Mechanism. The overall privacy guarantee of the algorithm follows from composition of these private updates.

We reduce the sensitivity of these gradient descent updates (and hence improve overall accuracy) by clipping the stochastic gradients, essentially ensuring that the gradient will lie in a bounded range. This allows for an upper bound on the magnitude of each update, and hence the sensitivity. We then only need to add noise proportional to the sensitivity to ensure differential privacy. More details of our gradient clipping procedure can be found in the Supporting PDF.

Once training is complete, our Generator can generate private synthetic data that closely approximates the original data with respect to standard statistical measures (losing only 5-20% of the statistical scores) within a practical privacy limit (epsilon < 10, delta < 10^{-5}). The private synthetic data also retains high accuracy for common machine learning tasks, losing at only 1-3% in accuracy relative to non-private synthetic data, when both are generated from moderate-sized datasets. We elaborate on our accuracy metrics in the answer to later questions.

We further improve accuracy using several modifications to the standard GANs training procedure, including smart clipping techniques, warm starts in training, parameter optimization, and tailoring GAN architecture for various data types. Technical details of these modifications will be highlighted throughout the remainder of the submission and in the Supporting PDF. Some of these modifications are analysis specific (e.g., classification, regression, or clustering), while some are general-purpose modifications to optimize the training procedure.

Our algorithm enjoys formal differential privacy guarantees through standard composition. In practice, we can employ a moments accountant (see, e.g., Abadi et al. 2016) to give even tighter privacy bounds. Suppose our algorithm is a composition of T updates, each of which are (eps, delta)-differentially private. Using a moments accountant will yield (O(eps T^0.5), delta)-differential privacy for the overall algorithm. Relative to advanced composition, this saves a factor of (log(1/delta)^0.5) in the epsilon parameter and factor of T(>>1) in the delta, which is a significant improvement. For the same privacy budget, this method yields far better training accuracy compared to advanced-composition-based analysis. More details of the moments accountant are in the Supporting PDF.

Which randomization mechanisms does your solution use?

Report Noisy Max Mechanism
Other

If other, please list and explain

Gaussian Mechanism

Is your proposed solution an improvement or modification of previous algorithms in differential privacy or a substantially new algorithm.

improvement of previous algorithms

Provided that there is a known relationship between the fields and the analysis (research question such as regression, classification, clustering), how does your approach determine the number and order of the randomized mechanism being utilized?

The algorithmic underpinnings of our approach will be the same for all data analysis tasks: privately generate synthetic data, and use the synthetic data to perform the analysis task. We first describe the algorithmic construction of DP-GANs, and then describe how the algorithm can be tailored to improve accuracy for specific analysis tasks (e.g., regression, classification, clustering).

DP-GANs are trained using noisy clipped stochastic gradient descent updates. Let y and w respectively denote the edge weight parameters of the Generator (G_y) and Discriminator (D_w). An objective of interest for our GAN is the Jensen-Shannon Divergence, defined as follows:
O(w,y):=E_{x~p_true}[log D_w(x)] + E_{z~p_z}[log(1-D_w(G_y(z)))],
where p_true is true data distribution and p_z is the distribution of the input noise provided to the Generator. This objective is the value of the zero-sum game played by D and G. In the min-max form of the game, D chooses w to maximize O(w,y) and G chooses y to minimize O(w,y). Their equilibrium strategies will achieve objective value min_y max_w O(w,y). However, since O(w,y) is a non-convex non-concave objective, these optimal strategies are typically not efficiently computable. Instead, we use gradient descent-ascent schemes to allow D and G to iteratively learn their optimal strategies with added noise to ensure differential privacy. As pointed out before, since only D_w(x) has access to the true data, we need only make D_w private to ensure that the entire GAN is differentially private.

Our DP-GAN training (simplified) procedure is an iterative process with an inner and an outer iteration loop. In each inner iteration, we train the Discriminator using sample batch gradients. We generate m samples of x_i and z_i for i = {1, … , m} from p_true and p_z respectively. We define our objective evaluated on our sample to be:
O_sample(w,y) = (1/m)* sum_i[ log D_w(x_i) + log(1-D_w(G_y(z_i))) ],
where sum_i indicates a sum over all i from 1 to m of the quantity inside the square brackets. Our algorithm then computes the gradient of O_sample with respect to w, and we refer to this gradient as g_w. We clip gradient g_w to ensure that its norm is upper bounded by some constant C. We call this clipped gradient g_clip_w. To ensure differential privacy, we add N(0, kC) noise for input parameter k (i.e., Gaussian noise with mean 0 and variance kC) to g_clip_w, and refer to this noisy clipped gradient as g_noise_clip_w. Finally, we update w using gradient ascent, i.e., w <-- w + s_D* g_noise_clip_w for step size s_D. It is possible to use more advanced algorithms for this update step (such as Adam or Adagrad, which are commonly used in deep learning), but we use gradient ascent here for simplicity of presentation.

After enough inner iterations for Discriminator D_w to converge (usually a small constant such as 4), we run the outer iteration to update the Generator G_y. This trains G_y through feedback from the updated D_w. The Generator’s parameter y is updated using the same sample batch gradient descent method, although we do not perform clipping or noising on the Generator’s gradient g_y. The update step is simply y <-- y - s_G* g_y for step size s_G. Note that the gradient of O_sample with respect to y does not depend on the sampled data x_i, so no additional noise is needed to preserve privacy. We only perform a single update step to y in each outer iteration.

The algorithm keeps track of its accumulated privacy loss from the updates to D_w, and runs the outer iteration until its privacy budget is exhausted. Multiple updates to D_w are performed with noisy clipped gradients for each update to G_y. This ensures that G_y is always trained with respect to the current best differentially private D_w. The complete algorithm and a statement of its privacy guarantees can be found in the Supporting PDF.

When the research task is known in advance (e.g., regression, classification, clustering), the algorithm can be adapted to significantly improve the accuracy. We will first reserve some true data to privately train the appropriate model (e.g., random forest model for clustering) up to some satisfactory accuracy on the true data. We will then check the accuracy of this model on synthetic data generated from the generator at each outer iteration. Since each iteration of the generator uses a different parameter y, this allows us many different observations of the DP-GAN’s performance under different parameter settings. We then use Noisy Argmax to choose a few top-performing sets of trained DP-GAN models over time. This will only affect our privacy budget by a constant factor, and can lead to substantial accuracy improvements. More details on can be found in the Supporting PDF.

Provided that there is NOT a known relationship between the fields and the analysis (unknown research question), is there a prescribed sequence of privacy techniques that will always perform the best regardless of data?

One significant advantage of our synthetic data generation approach is its extreme generalizability to answering unknown research questions. The approach described in our answer to the previous question (without the final Noisy Argmax post-processing step) will continue to perform well for unknown research questions. Even without any prior knowledge of the data or analysis task, DP-GANs will continue to privately generate synthetic data that closely matches the true data, and thus should be accurate for a wide variety of research questions.

Further, GANs are an active area of study within the machine learning community, with much ongoing research to extend their applicability to new data types and machine learning tasks. Using DP-GANs to generate synthetic data gives us flexibility to utilize future advances in machine learning as they arise.

How does your proposed solution differ from existing solutions? What are the advantages vs existing solutions? Disadvantages?

Our approach for private data analysis is to generate synthetic data rather than to answer a pre-specified set of queries. Although there are many theoretically promising differentially private algorithms in the query-based model, these algorithms can suffer inflexibility to complicated analysis tasks if the wrong input query class is used. In practice, data analysts would like to perform different machine learning tasks adaptively over time, so specifying a fixed query class upfront greatly limits their flexibility for future analysis.

There have been many recent works on privately generating synthetic data (see, e.g., a survey by Surendra and Mohan, 2017). We now detail the advantages of our solution over these previous approaches.

Firstly, many algorithms require a class of statistical properties or machine learning tasks as an input in order to generate of data, and then generate synthetic data that are representative with respect to only those statistical measures or learning tasks specified as input. This approach, like the query-based model, causes analysts lose flexibility to perform creative analysis on the data in the future. DP-GANs overcome this inflexibility because they do not require data analysis tasks to be specified upfront.

Secondly, many existing algorithms require impractical running time and space. For example, SmallDB (by Blum, Ligett, and Roth) requires exponential running time in the number of data entries. Private Multiplicative Weights (by Hardt and Rothblum) is not space efficient when the size of data universe is large, which is often the case in practice. DP-GANs, however, can be trained within hours for moderate datasets, and many practical optimizations for training GANs have been developed for large applications. See our answer to the question on resource requirements for more details.

An additional advantage of DP-GANs is the ease of communicating and publishing the final product. To share their private output, query-based algorithms would need to publish answers to many queries, and algorithms that generate synthetic data would naturally want to publish synthesized data. For DP-GANs, however, it is sufficient to publish only the trained parameters of the DP-GAN. This is much more efficient in terms of space- and communication-complexity because the sets of parameters are significantly smaller than the original dataset. An analyst can use these parameters to generate her own arbitrarily large dataset. She can additionally continue training the DP-GAN using her own data, starting from the published parameters. This issue of dynamic and growing data is discussed in more detail later in the question on additional data types for which our algorithm performs well.

The main disadvantage of DP-GANs, as with most neural network based learning models, is that no theoretical accuracy guarantees are known. In our case, it means that there is no formal guarantee that synthetic data from trained DP-GANs will share the same statistical properties as the true data. GANs and DP-GANs have been observed to perform well in practice in many application domains. As GANs and deep learning are active topics of research in the machine learning community, we expect to have a better understanding of the practical accuracy guarantees of GANs and DP-GANs in the near future. Our DP-GANs will enjoy ongoing advances in machine learning research as they are developed.

One other small disadvantage of our approach is that the combination of optimization techniques can complicate the task of communicating to the public how privacy is preserved. If the analyst works for a company or government organization that strives for transparency, this communication may be complex. Achieving transparency may also require more work in their front-end implementation if the algorithm is to be published for future use by others.

Our contribution beyond existing work on DP-GANs is to propose a technically promising way to combine state-of-the-art results in neural net architectures, GANs, and optimization techniques in DP-GANs. Specifically, we propose to combine optimization techniques of smart gradient clipping, warm starting, and task-specific model subselection, with several known architecture for GANs to handle different types of data. Each of these techniques are described in more detail in the answer to the next question and in the Supporting PDF.

How well does your solution optimize utility for a given privacy budget (the utility-privacy frontier curve) and how does it accomplish this for each of the research classes (regression, classification, clustering, and unknown research question) and each of the data types (numeric, geo-spatial, and class)?

We use two types of metrics to evaluate the utility and accuracy of our synthetic data. The first metric is simply the closeness of our private synthetic data to the true database. Note that this metric is applicable to any research class, including regression, classification, clustering, or answering an unknown future research question. Previous work has shown that DP-GANs are able to produce synthetic data that are close to the true data (within 5-20% for common statistical measures), using a reasonable privacy budget (epsilon < 10, delta < 10^{-5}). The second metric compares the accuracy of a specific analysis task (e.g., regression, classification, clustering) performed on both the synthetic data and the true data. Previous work has also shown that private synthetic data generated by DP-GANs loses only 1-3% in accuracy relative to both non-private synthetic data and true data for several common machine learning tasks. More details on these accuracy metrics can be found in our answer to the question on evaluation methods.

Beyond simply noting the success of DP-GANs in previous work, we propose additional extensions and modifications that we expect will further improve the utility-privacy frontier. As described in previous answers, DP-GANs work by clipping the gradient (i.e., bounding the sensitivity of an update) in each step of gradient ascent/descent in GAN training, and adding multivariate Gaussian noise to the gradient based on its sensitivity. This noise injection makes each update step differentially private, and overall privacy loss is accumulated as training progresses. We propose the following four modifications to improve utility for a given privacy budget. More details of each modification technique are given in the Supporting PDF.

First, different parameters in a neural network may have different gradients, and hence ought to be clipped and injected with noise differently. This is particularly relevant for gradient coordinates that are small in magnitude, as adding large amounts of noise to these coordinates may significantly harm accuracy. To address this, we use smart clipping techniques to group gradient coordinates according to their relative magnitude, and add noise that only scales with the maximum magnitude in the group. Note that this is a grouping of parameters and not private data entries. Each group is clipped and the corresponding gradient is made appropriately noisy, so the update will remain private with respect to the true data. We also adaptively choose the amount of clipping over time, which further improves accuracy.

Second, motivated by the observation that GANs tend to be unstable at the beginning of its training, researchers have proposed to warm-start the DP-GAN by treating a small (2%) set of data as public and use them to train the GAN non-privately. Previous work shows that this improves accuracy by about 15% for standard statistical measures in machine learning. However, releasing even a small subset of sensitive data may raise legal or ethical concerns. Instead, we propose using publicly available data (e.g., 1940 Census data or published clinical trial datasets, in our case). As long as this public data is statistically similar to the sensitive data, it will provide the same accuracy improvements as subsampling. An analyst could similarly use domain knowledge or personal experience as a warm-start for DP-GAN.

Third, when the research task is known in advance, we can increase the accuracy by optimizing our choice of final parameters. We can check the performance of a (privately) pre-trained model on data generated under each update of the Generator’s parameter settings, and use Report Noisy Argmax to privately select the top-performing parameter values.

Finally, we provide extensions of the existing DP-GAN framework that are designed to handle discrete data, categorical data, geospatial data, and data arising from fat-tailed distributions. We propose encoding discrete data with small values (e.g., values below 15) as short binary strings, and treating discrete data with larger values as real-valued entries. Categorical data can be handled with a newly-developed GAN architecture using the Gumbel-softmax function, which is a continuous approximation of multinomial distribution parameterized by softmax function. Geospatial data can be pre-processed by encoding geographical location as a two-dimensional real-valued attribute containing latitude and longitude. If the geospatial attribute describes a region (e.g., city or neighborhood), we can either randomly sample a point within that region or chose the center of the region. If we have data sampled from a fat-tailed distribution, such as the distribution of household size in Census data, we can first cluster the data into majority groups (close to the median) and minority groups (away from median), and generate synthetic data separately for both groups.

Describe other data types and/or research questions that your Solution would handle well. How would performance (in terms of privacy and utility) be maintained and why? Describe other data types and/or research questions that your Solution would not handle well. How would performance (in terms of privacy and utility) degrade and why?

Our algorithm is incredibly robust, and would continue to perform well for many data types and research questions beyond those considered in this challenge. Below we highlight just a few of the analyses that are made possible with our approach. All references are provided in the Supporting PDF.

Recent work (Xie et al., 2018, and Triastcyn and Faltings, 2018) on DP-GANs demonstrated successful private generation of synthetic image data. Since our algorithm builds upon these previous results, we anticipate that our algorithm should be similarly successful at generating and analyzing image datasets.

Our approach can also handle graph data, which will enable private analysis of social networks, where each node contains the information of an individual. We can compactly represent each node (including its location in the graph and any other relevant attributes) as a vector. Prior work (see Wang et al., 2017) gave techniques for this embedding that allowed the vectors to be interpretable by GANs. Our DP-GAN can then be used to generate synthetic graphs that share the same statistical properties as the original graph. This will allow, for example, the research question of private link prediction in graphs, where an analyst is interested to learn about relationships between pairs of individuals in a network.

One very practical challenge for differential privacy is dynamically changing data, where entries may be added, deleted, or changed over the course of analysis. Our DP-GAN algorithm can seamlessly adapt to the changing nature of a training set because the neural network does not need to be retrained, but rather can continue training with a new dataset. The number of iterations needed to adapt to changed datasets can be much smaller than restarting the algorithm, which will vastly decreasing the additional privacy loss to run DP-GAN on the new dataset.

In addition to predictive analysis tasks, our approach can also be used for intermediate data analysis tasks like feature selection and dimension reduction, using algorithms such as principal component analysis (PCA), kernel PCA, and linear/generalized discriminant analysis on the synthetic data.

The machine learning community has been extensively studying GANs and extending its applications to new data types and machine learning tasks. Using DP-GANs to generate synthetic data gives us flexibility to utilize future advances in machine learning as they arise. One such application is active learning, where a learner has access to unlabelled data, and must select a small number of points to label to maximize its learning efficiency. Non-private GANs have been shown to perform well for active learning tasks by suggesting points that would most likely increase accuracy of the model. However, privacy concerns arise in many practical active learning applications---such as choosing participants for a clinical trial based on their medical history. We propose that DP-GANs can be useful for private active learning.

How do the resource requirements for your Solution scale with the amount of data? Describe how the computational requirements of your Solution at different volumes of data can be handled using current computing technological capabilities. If your Solution requires advances in technology, describe your vision and anticipated availability for the types and scope of technological advances necessary.

Practical GAN training for small and moderate datasets---such as the common benchmark MNIST dataset of images of digits, or clinical data with tens of thousands of entries---can be done within hours on a personal computer.

In the case of large applications, the use of cloud servers for data storage and GPUs for computation is not uncommon. Standard GAN training libraries such as TensorFlow support GPU training by mere addition of two lines of code. Parallel training is an active topic of study because Stochastic Gradient Descent (SGD) is highly parallelizable, either by processing multiple data points simultaneously in each layer or performing SGD on multiple mini-batches. These can be implemented on multi-core CPU along with GPU, where more intensive subroutines such as matrix vector multiplication can be performed. As research on parallel computing and GAN-optimized hardware advances, this will continue to reduce the computational resources required for deep learning.

In theory, training GANs (and other types of neural networks) is an NP-hard problem, so practitioners should not expect to train our DP-GAN to global optimality in reasonable runtime. Fortunately, our objective is smooth because our gradients do not change too rapidly, so we can reach a local optimum. First-order stationary point (FSPs), i.e., points with zero gradient norm, is a necessary condition for local optimality. So it is important to understand convergence rates to FSPs. For smooth objectives, all well-known first-order stochastic algorithms will converge to a FSP. SGD is one famous member of the family of first-order stochastic algorithms. Other commonly used algorithms are AdaGrad, Adam, Stochastic Nesterov’s AGD, or more recently SVRG, SAGA and RapGrad. For example, SVRG, SAGA or RapGrad only require O(n^(2/3)/alpha) gradient evaluations to reach a point that is alpha-close to an FSP, for a database of size n.

Unfortunately, FSPs are not sufficient to guarantee a local minimum, as there can be negative curvature of the objective function, along which the objective might decrease locally. Such points are commonly called as saddle points. In minimization problems for neural networks, it is important to escape saddle points. This notion of escaping saddle points and converging to zero gradient points is called “convergence to second-order stationary points” (SSPs, as opposed to FSPs). Our solution adds noise to the stochastic gradient which naturally allows to escape saddle points with high probability. This implies better training for our Discriminator and thus a better Generator.

Finally, we note that a Generator only needs to be trained once to generate as much synthetic data as desired. Although the up-front cost of training the DP-GAN may be high, this computational cost can be amortized over the generation of an arbitrarily large synthetic dataset. Further, this Generator and any synthetic data it produces can be freely shared without incurring additional privacy cost, due to the post-processing guarantees of differential privacy.

Please reference a dataset you suggest utilizing as a use case for testing algorithms. Is there existing regression, classification, and clustering analysis of this data? If so, Please describe.

Integrated Public Use Microdata Series (IPUMS) is one of the largest population databases available online, consisting of historical samples from both United States and international census records. Census extracts include a wide variety of selectable attributes, spanning numerical, categorical, and geospatial data types. Recently IPUMS put forth a preliminary data release of the 1940 United States full census extract consisting of approximately 130 million entries. This included demographic, economic, and location information on both the household and individual level, including attributes representing income, race, and census district (i.e., geographical location). Given the recency of its release, it has been relatively unexplored by external regression, classification, and clustering analyses, leaving much room for novel investigation.

We believe this to be an interesting and ideal use case for several reasons. First, GANs generally perform with higher accuracy as the number of samples increases. We hypothesize that with 130 million entries, the full census extract will be sufficiently large for our model to approximate the underlying distribution of data with high accuracy. Second, given that the census attributes span numerical, categorical, and geospatial data types, it represents a instance commonly faced in practice where data types vary, and a situation that our techniques are intended to address. Finally, given that the information included is real information about household and individual level attributes, it could easily be interpreted in another context as sensitive information requiring privacy methods like the one we propose.

IPUMS Data:
Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public Use Microdata Series: Version 7.0 [dataset]. Minneapolis: University of Minnesota, 2017. https://doi.org/10.18128/D010.V7.0.

GIS Data:
Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 12.0 [Database]. Minneapolis: University of Minnesota. 2017. http://doi.org/10.18128/D050.V12.0

Propose an evaluation method for future algorithm testing competitions.

We propose two types of metrics to evaluate the performance of any differentially private synthetic data generation algorithm, including ours. This performance should be evaluated under the constraint of small privacy budget (e.g., epsilon < 10, delta < 10^{-5}). The judges could also vary the privacy budget used for evaluation, to see how accuracy of the algorithm changes with the privacy parameters.

The first metric is simply the closeness of our private synthetic data to the true database. Closeness can be measured both visually---for example, by comparing synthetically generated images to images in the true dataset---and statistically using Inception scores and Jensen-Shannon scores that measure the quality and diversity of datasets. Note that this metric is applicable to any research class, including regression, classification, clustering, or answering an unknown future research question.

The second metric compares the accuracy of a specific analysis task (e.g., regression, classification, clustering) performed on both the synthetic data and the true data. Accuracy can be evaluated by learning a model on the dataset of interest (private synthetic, non-private synthetic, and true), and evaluating the performance of this model on a holdout of the true dataset. These machine learning tasks include logistic regression transfer learning, random forest, support vector machine, and nearest neighbors, which can all be used for regression, classification, and clustering. This approach could also be used to evaluate utility of an algorithm for other unknown research questions, beyond regression, classification, and clustering.

Previous work showed that DP-GANs perform well with respect to both types of metrics. Visually, DP-GANs have been used to generate synthetic images of human-written digits, pictures of bedrooms, scenery of real world attractions, and faces of celebrities that look realistic to the human eye. Statistically, the scores of private synthetic data are within 5-20% of those of non-private synthetic data also generated by GANs. DP-GANs have been shown to lose only 1-3% in accuracy for several common machine learning tasks, relative to both non-private synthetic data and true data. Furthermore, the synthetic data generated by DP-GANs retains most of the correlation structure among the variables, indicating that accuracy would also be high for arbitrary future analyses (unknown research question). Since our approach builds upon this previous work, we believe that our algorithm with perform well with respect to our proposed metrics.

Document upload - Submit a supporting PDF. Note that the submission should stand alone without the attachment.

main.pdf

comments (public)

Was this page helpful? yes no