MDA uses inverse-distance weighted random combination to aggregate identifiable entities into larger anonymized clusters.
Solution Overview - Describe how your approach optimizes the balance of differential privacy and data analysis utility.
Multi-Dimensional Aggregation reduces a dataset to any desired level of identifiability via an extension of geo-spatial aggregation into a multi-dimensional plane. Identifiable records are combined with their most similar neighbors until the desired level of anonymization is achieved. Such a technique leaves the relationships of the non-identifying data intact and preserves the magnitude of the relationship between quasi-identifiers and the remainder of the dataset with a variance penalty.
Dataset preparation involves dimensionality reduction of quasi-identifiers and full expansion of categorical identifiers into binary indicators. The exact form of dimensionality reduction of quasi-identifiers will depend on the goals of the data analysis, but may take the form of reducing dates to their year components (e.g. date of birth to year of birth) or addresses to continuous coordinates. Categorical variables may then be fully expanded into 0/1 indicator variables.
Once data preparation is complete, records are aggregated into groups of unique quasi-identifier combinations. The k-anonymity is calculated for each combination (in this context, the number of observations in the combination-1).
Combinations below the desired threshold of anonymity will have the multi-dimensional distance (ρ) of their quasi-identifiers calculated to each of the other identifier combinations. For many data types, a Pearson's correlation coefficient is a good choice of distance measure. Using a random number generator, a pair of identifier combinations will be selected using weights of 1- ρ 2, and the two combinations will be merged. An average of the quasi-identifiers will be assigned to all the records of the two combinations. At this stage, the two combinations will be indistinguishable while non-identifiable data will be preserved. Starting with the most identifiable record, aggregation will continue until the desired level of anonymity is achieved in the dataset.
The flexibility of this approach allows for maximization of utility through customizability and the preservation of relationships between identifying and non-identifying variables. The aggregation method can be applied on a custom set of identifiers; for example, if one quasi-identifier proves crucial to analysis, it can be left unchanged with other identifiers being further aggregated to compensate. If a variable offers too much identifying information and must be aggregated, it will still offer unbiased analytic estimates, though a variance penalty will be incurred. Privacy can be increased to any desired level.
Which randomization mechanisms does your solution use?
If other, please list and explain
Our solution uses a random aggregation algorithm to de-identify data. For records with an insufficient level of anonymity, the multi-dimensional distance of quasi-identifiers calculated to each of the other identifier combinations. Using a random number generator, a pair of identifier combinations will be selected using weights of 1- ρ 2, and the two combinations will be merged. An average value of each of the quasi-identifiers will be assigned to all the records from the two combinations of quasi identifiers.
Is your proposed solution an improvement or modification of previous algorithms in differential privacy or a substantially new algorithm.
improvement of previous algorithms
Provided that there is a known relationship between the fields and the analysis (research question such as regression, classification, clustering), how does your approach determine the number and order of the randomized mechanism being utilized?
The most identifiable records will be aggregated first. The randomization will occur via a twisted generalized feedback shift register using an arbitrary seed, weighted by multi-dimensional distance. If a given field is known to be important for an analysis, it can be excluded from the aggregation process and will be unchanged by the de-identification, which will leave it intact for analysis. The aggregation of other variables can be increased to compensate.
Provided that there is NOT a known relationship between the fields and the analysis (unknown research question), is there a prescribed sequence of privacy techniques that will always perform the best regardless of data?
For non-identifiable variables in the dataset, the described technique will perform equivalently to the identified dataset in all aspects. For quasi-identifiable variables, the variance of the estimates will increase as identifiers are aggregated.
The structure of the data will affect the performance of the model. The original dimensionality of the quasi-identifiers and the success of the pre-aggregation dimensionality reduction will determine the magnitude of the variance penalty upon aggregation. The amount of required aggregation to achieve a desired level of de-identification is a function of the dimensionality of the quasi-identifiers of the dataset. Reduction of the dimensions will improve the performance of the model.
How does your proposed solution differ from existing solutions? What are the advantages vs existing solutions? Disadvantages?
xisting solutions (geographic aggregation) are limited to geographic areas, which allows for a very limited domain of use. Our proposed solution is able to accommodate identifiers of any data type and number through the expansion of dimensionality beyond space.
In addition to the wider domain of use, the randomized component of our process makes aggregation less predictable and decreases the odds of unmasking.
How well does your solution optimize utility for a given privacy budget (the utility-privacy frontier curve) and how does it accomplish this for each of the research classes (regression, classification, clustering, and unknown research question) and each of the data types (numeric, geo-spatial, and class)?
The flexibility of multi-dimensional aggregation allows for optimization of utility to any privacy budget. Individual projects have differing requirements for privacy and precision of results. Altering the k-anonymization parameter of our solution allows researchers to intuitively adjust the degree of anonymization. The model works equivalently with privacy goals ranging from pairing of records to aggregation of a millions of records.
The restriction of the solution to non-identifiers maximizes utility for any of the research classes (regression, classification, clustering, and unknown research question) and data types (numeric, geo-spatial, and class) as the data is fundamentally unchanged.
Research questions involving the quasi-identifiers would incur a variance penalty that would scale with the privacy budget and the dimensionality of the quasi-identifiers. Higher dimensional data (class>geo-spatial>numeric) would incur a higher variance penalty, as more aggregation would be required to de-identify the data. This increased variance would manifest in the following ways:
Regression- Decreased precision of estimates
Classification- Decreased precision of classification
Clustering- Diminished relationship with aggregated variables, potential type II error.
Depending on the needs of the project, utility can be preserved in key analytic quasi-identifiers by increasing the aggregation of the other quasi-identifiers. Such flexibility maximizes the area under the utility-privacy frontier curve for a given analysis.
Describe other data types and/or research questions that your Solution would handle well. How would performance (in terms of privacy and utility) be maintained and why? Describe other data types and/or research questions that your Solution would not handle well. How would performance (in terms of privacy and utility) degrade and why?
In research questions that require a low to moderate dimensionality of identifiers, our solution would be optimal and would yield a de-identified dataset that is very similar to the original for all analytic purposes. The amount of required aggregation scales to the dimensionality of the identifiers. On a low to moderate dimension dataset, utility would be maximized across any desired privacy level.
While it is unclear what the goals would be, research questions involving very unique identifiers (e.g. exact phone number instead of area code, full date of birth instead of year) would not be well handled by our solution. The cost of privacy would be a dataset that is homogenous across the identifiers and would yield null relationships with identifiable variables (low utility). Relationships among non-identifying variables would be preserved for analytic purposes, however.
How do the resource requirements for your Solution scale with the amount of data? Describe how the computational requirements of your Solution at different volumes of data can be handled using current computing technological capabilities. If your Solution requires advances in technology, describe your vision and anticipated availability for the types and scope of technological advances necessary.
The resource requirements for our solution would scale exponentially to the dimensionality of the quasi-identifiers in the data; the raw number of observations is expected to be a much smaller factor in resource requirements. Distance calculations would occur between every combination of aggregated data cluster, leading to an exponential resource-dimensionality relationship. The overall resource requirement would also depend on the desired anonymity of the dataset, as the algorithm would continue until this is achieved.
The requirements to calculate most distance metrics (including Pearson's correlation) are quite low, and could conceivably be scaled to any required project.
Please reference a dataset you suggest utilizing as a use case for testing algorithms. Is there existing regression, classification, and clustering analysis of this data? If so, Please describe.
BRFSS is a public health database collected by the CDC on a nation-wide survey. It contains hundreds of variables of all types (including quasi-identifiers and geo-spatial information) and has been used for public health research for decades. Analyses of all types have been performed on the data. Some examples include:
Regression: Lu, Minggen & Yang, Wei. (2018). Multivariate Logistic Regression Analysis of Complex Survey Data with Application to BRFSS Data.
Classification: Yoon S., Taha B., Bakken, S. Using a Data Mining Approach to Discover Behavior Correlates of Chronic Disease: A Case Study of Depression. Studies in health technology and informatics. 2014;201:71-78.
Clustering: Elliott, M. R., & West, B. T. (2015). “Clustering by Interviewer”: A Source of Variance That Is Unaccounted for in Single-Stage Health Surveys. American Journal of Epidemiology,182(2), 118-126. doi:10.1093/aje/kwv018
Propose an evaluation method for future algorithm testing competitions.
We suggest a multi-tiered approach including that includes a comparison of results to de-identified data and subject-matter expert evaluation of the models.
Comparison of results to de-identified data may take the form of measuring the area under the privacy-utility frontier curve, where privacy could be measured as the k-anonymization and utility would be a measure of bias or imprecision of the resulting estimates. The algorithm with the lowest area under this curve would ostensibly perform the best in the widest range of real-world analytic and privacy requirements.
Expert evaluation of the data would be crucial for identifying algorithms that are novel, useful, and have a sound theoretical underpinning. A winning algorithm would be expected to exhibit these virtues, and a raw analytic comparison (e.g. area under the privacy-utility curve) could overlook them.
Document upload - Submit a supporting PDF. Note that the submission should stand alone without the attachment.
BEEFERS Unlinkable Data Challenge PDF 3.pdf