NIST PSCR

 30,262
The Unlinkable Data Challenge: Advancing Methods in Differential Privacy

The Unlinkable Data Challenge: Advancing Methods in Differential Privacy

Propose a mechanism to enable the protection of personally identifiable information while maintaining a dataset's utility for analysis.
stage:
Won
prize:
$40,000
Partners
Overview

Challenge Overview

The digital revolution has radically changed the way we interact with data. In a pre-digital age, personal data was something that had to be deliberately asked for, stored, and analyzed. The inefficiency of pouring over printed or even hand-written data made it difficult and expensive to conduct research. It also acted as a natural barrier that protected personally identifiable information (PII) --  it was extremely difficult to use a multitude of sources to identify particular individuals included in shared data.

Our increasingly digital world turns almost all our daily activities into data collection opportunities, from the more obvious entry into a webform to connected cars, cell phones, and wearables. Dramatic increases in computing power and innovation over the last decade along with both public and private organizations increasingly automating data collection make it possible to combine and utilize the data from all of these sources to complete valuable research and data analysis.

At the same time, these same increases in computing power and innovations can also be used to the detriment of individuals through linkage attacks: auxiliary and possibly completely unrelated datasets in combination with records in the dataset that contain sensitive information can be used to determine uniquely identifiable individuals.

This valid privacy concern is unfortunately limiting the use of data for research, including datasets within the Public Safety sector that might otherwise be used to improve protection of people and communities. Due to the sensitive nature of information contained in these types of datasets and the risk of linkage attacks, these datasets can’t easily be made available to analysts and researchers. In order to make the best use of data that contains PII, it is important to disassociate the data from PII. There is a utility vs. privacy tradeoff however, the more that a dataset is altered, the more likely that there will be a reduced utility of the de-identified dataset for analysis and research purposes.

Currently popular de-identification techniques are not sufficient. Either PII is not sufficiently protected, or the resulting data no longer represents the original data. Additionally, it is difficult or even impossible to quantify the amount of privacy that is lost with current techniques.

This competition is about creating new methods, or improving existing methods of data de-identification, in a way that makes de-identification of privacy-sensitive datasets practical. A first phase hosted on HeroX will ask for ideas and concepts, while later phases executed on Topcoder will focus on the performance of developed algorithms.

 

What Can You Do Right Now?

  • Click ACCEPT CHALLENGE above to sign up for the challenge
  • Read the Challenge Guidelines to learn about the requirements and rules
  • Share this challenge on social media using the icons above. Show your friends, your family, or anyone you know who has a passion for discovery.
  • Start a conversation in our Forum to join the conversation, ask questions or connect with other innovators.
Guidelines

Challenge Guidelines

Challenge Overview

The National Institute of Standards and Technology (NIST) promotes U.S. innovation and industrial competitiveness by advancing measurement science, standards, and technology in ways that enhance economic security and improve our quality of life.  We have identified potential opportunities for using the most recent and forecasted developments in analytics technologies to gain insights from public safety data which would, in turn, inform decision-making and increase safety.  But a critical consideration to moving forward with these opportunities is assuring data privacy. 

Databases across the country include information with potentially important research implications including location data collected from mobile devices which can be used for contingency planning for disaster scenarios; travel data which can be used to identify safety risks within the industry; hospital and medical record data which can assist researchers in tracking contagious diseases such as virus outbreaks, the epidemiology of drug abuse and other health epidemics; and patterns of violence in local communities.  However, also included are the personally identifiable information of police officers, victims, persons of interest, witnesses, confidential informants, and suspects.

In 2002, a study by Sweeney found that in the United States, the combination of just 3 “quasi-identifiers” (date of birth, 5 digit postal code, and gender) uniquely identifies 87% of the population.  It was demonstrated that by combining a public healthcare information dataset with a publicly available voters’ list and using quasi-identifiers, it is possible to mine the secret health records of all state employees from a published dataset, where only explicit identifiers are removed.

Unfortunately, making minor changes to birth dates and other PII do not provide adequate protection against such linkage attacks.  With the advent of “big data” and technological advances in linking data, there are far too many other possible data related to each of us that can lead to our identity being uncovered.  Those of us in the “edges” of the datasets are more vulnerable to linkage.  For example, with geo-spatial data, the location of a family living in a remote area -- say one family living in the far reaches of a wilderness area or desert - is harder to protect than the many families living in an urban area.  Similarly, someone with a less common attribute or combination of attributes is also more vulnerable.

Grant award recipients of the Public Safety Communications Research Division (PSCR) arm of NIST have already begun identifying data collection efforts to support decision making. Consider just two examples from the ten awardees:

  • The Creation of a Unified Analysis Framework and the Data Comparison Center - This research effort proposes a tool to ingest Computer Aided Dispatch (CAD) and Records Management Systems (RMS) data to build a data warehouse and master dataset for the community to foster “data driven decision processes.”  Both the CAD and RMS data include PII and contextual data which allows for identification.
  • Towards Cognitive Assistant Systems for Emergency Response: This project envisions using a verbal interface and natural language processing to build a database of field notes to fuse with other data sources.  This data will include PII, health data and contextual information.

These types of datasets are expected to proliferate in the near future, and all of them contain sensitive PII that may put individual members of the public at risk.  Due to the sensitive nature of information contained in these types of datasets, these datasets cannot, or should not, be made available to analysts and researchers without PII being protected.  However, it is not enough to simply remove the PII from these datasets as it is well known that using auxiliary and possibly completely unrelated datasets, in combination with records in the dataset that contain sensitive information, can still be determined to correspond to uniquely identifiable individuals (known as a linkage attack).

 

How important is this?

There is no absolute protection that data will not be misused.  Even a dataset that protects individual identities well may, if it gets into the wrong hands, be used for ill purposes.  Weaknesses in the security of the original data can threaten the privacy of individuals.

This challenge is focused on proactively protecting individual privacy while allowing for data to be used by researchers for positive purposes and outcomes.  NIST has strong commitments to both public safety research and the preservation of security and privacy, including the use of de-identification.  NISTIR 8053, De-Identification of Personal Information, October 2015, addresses de-identification terminology and the need for de-identification and examples and scenarios of use.  NIST Special Publication 800-188, De-Identifying Government Datasets (2016) provides guidance regarding the selection, use and evaluation of de-identification techniques for US government datasets.  It also provides a framework that can be adapted by Federal agencies to frame the governance of de-identification procedures with the ultimate goal of reducing disclosure risk that might result from an intentional data release.  NIST also held the first workshop on de-identification for federal agency stakeholders in June of 2016 to understand government needs.  In January 2017, NIST IR 8062, “An Introduction to Privacy Engineering and Risk Management in Federal Systems” was published, which includes a privacy risk model.

It is well known that privacy in data release is an important area for the Federal Government(which has an Open Data Policy), state governments, the public safety sector and manycommercial non-governmental organizations.  Developments coming out of this competitionwould hopefully drive major advances in the practical applications of differential privacy forthese organizations.

The purpose of this series of competitions is to provide a platform for researchers to develop more advanced differentially private methods that can substantially improve the privacy protection and utility of the resulting datasets. 

Getting Involved - How to Participate

The Challenge

The Unlinkable Data Challenge is a multi-phased Challenge.  This first phase of the Challenge is intended to source detailed concepts for new approaches, inform the final design of up to three subsequent phases, and provide recommendations for matching phase 1 competitors into teams for subsequent phases.  Teams will predict and justify where their algorithm falls with respect to the utility-privacy frontier curve.

In this phase, competitors are asked to propose how to de-identify a dataset using less than the available privacy budget (see Appendix A), while also maintaining the dataset’s utility for analysis.  For example, the de-identified data, when put through the same analysis pipeline as the original dataset, produces comparable results (i.e. similar coefficients in a linear regression model, or a classifier that produces similar predictions on subsamples of the data).

This phase of the Challenge seeks Conceptual Solutions that describe how to use and/or combine methods in differential privacy to mitigate privacy loss when publicly releasing datasets in a variety of industries such as public safety, law enforcement, healthcare/biomedical research, education, and finance.  We are limiting the scope to addressing research questions and methodologies that require regression, classification, and clustering analysis on datasets that contain numerical, geo-spatial, and categorical data.

To compete in this phase, we are asking that you propose a new algorithm utilizing existing or new randomized mechanisms with a justification of how this will optimize privacy and utility across different analysis types.  We are also asking you to propose a dataset that you believe would make a good use case for your proposed algorithm, and provide a means of comparing your algorithm and other algorithms.

All submissions must be made using the submission form provided.  Submissions will be judged using the listed criteria and scoring scheme.

Teams that participate in the HeroX challenge, as well as newly formed teams that did not participate, can proceed to a leader-board-driven competition on Topcoder, the Algorithm Competition #1.  It is anticipated that Competition #1 will be followed by iterating improvements in the Algorithm Sprint, and finish with a final penultimate Challenge to further boost performance in the Algorithm Competition #2.  Where a competitor's algorithm falls with respect to the utility-privacy frontier curve will determine who wins subsequent Topcoder Competitions.

Information covering the State of the Art in De-identification and Evaluating Privacy and Utility, along with links to additional resources, are included in Appendices A and BAppendix C includes a sample Use Case.

 

The Prize

A total prize purse of up to $50,000 is available for this challenge. The planned awards are as follows:

  • $15,000 - Grand Prize
  • $10,000 - Runner up prize
  • $5,000 - Honorable Mention Prize
  • $10,000 - Two, $5,000 People’s Choice Prizes

 

The Timeline

Pre-registration begins           February 1, 2018

Open to submissions              April 25, 2018

Submission deadline               July 26, 2018 @ 5pm ET

People’s Choice Voting           August 14 - August 28, 2018

Winners Announced               September 12, 2018

 

How do I win?

To be eligible for an award, your proposal must, at minimum:

  • Meet the eligibility requirements stated below and in the Challenge Specific Agreement.
  • Satisfy the Judging Scorecard requirements
  • Thoughtfully address the Submission Form questions
  • Be scored higher than your competitors!

Judging Criteria

 

Analysis Class - Regression:

Differential Privacy Capability and Utility

The balance of privacy and utility protected and the quality of evidence that privacy/utility will be protected at this level for each type of data (numerical, geo-spatial, and categorical data).

15

Analysis Class - Classification:

Differential Privacy Capability and Utility

The balance of privacy and utility protected and the quality of evidence that privacy/utility will be protected at this level for each type of data (numerical, geo-spatial, and categorical data).

15

Analysis Class - Clustering:

Differential Privacy Capability and Utility

The balance of privacy and utility protected and the quality of evidence that privacy/utility will be protected at this level for each type of data (numerical, geo-spatial, and categorical data).

15

Analysis Class - Unknown research question:

Differential Privacy Capability and Utility

 

How does the solution handle a case where a dataset needs privacy protected, but the research questions are unknown?

15

Thoroughness in Self-Evaluation

The competitor answered the questions thoroughly, including the question about what use cases the Solution would not handle well.

 

 

5

Innovation

Subjective determination of uniqueness and likeliness to lead to greater future improvements than other Solutions

20

Computing Requirements/ Feasibility

Feasibility of using this Solution for larger volume use cases

5

Robustness & Generalizability

The Solution handles the provided classes and types of data well and can handle other use case classes and types of data.  This could also include the ability to vary the balance between privacy and utility.

5

Dataset suggestion

A dataset is proposed that contains numerical, geo-spatial, and classification types of data as well as existing exploratory data analysis like regression, clustering, and classification analysis.

5

 

Rules

The NIST official rules are posted on Challenge.gov and a full copy of those rules are in the Challenge Specific Agreement on the HeroX Unlinkable Data Challenge: Advancing Methods in Differential Privacy website.  The following information is provided as reference and is not all-inclusive.      

Contestant Eligibility:

To be eligible for the cash prizes, each contestant or team of contestants must include an Official Representative who is an individual age 18 or older at the time of entry and a U.S. citizen or permanent resident of the United States or its territories.  In the case of a private entity, the business shall be incorporated in and maintain a primary place of business in the United States or its territories.  Contestants may not be a Federal entity or Federal employee acting within the scope of their employment.  NIST Guest Researchers, as well as direct recipients of NIST funding awards through any Center of Excellence established by NIST, are eligible to enter, but are not eligible to receive cash awards. Non-NIST Federal employees acting in their personal capacities should consult with their respective agency ethics officials to determine whether their participation in this Competition is permissible.  Contestants, including individuals and private entities, must not have been convicted of a felony criminal violation under any Federal law within the preceding 24 months and must not have any unpaid Federal tax liability that has been assessed, for which all judicial and administrative remedies have been exhausted or have lapsed, and that is not being paid in a timely manner pursuant to an agreement with the authority responsible for collecting the tax liability.  Contestants must not be suspended, debarred, or otherwise excluded from doing business with the Federal Government.  Multiple individuals and/or legal entities may collaborate as a group to submit a single entry and a single individual from the group must be designated as an official representative for each entry.  That designated individual will be responsible for meeting all entry and evaluation requirements.

Teams:

Contest submissions can be from an individual or a team(s). If a team of individuals, a corporation, or an organization is selected as a prize winner, NIST will award a single dollar amount to the winning Team(s) and each Team, whether consisting of a single or multiple contestants, is solely responsible for allocating any prize amount among its member contestants as they deem appropriate.  NIST will not arbitrate, intervene, advise on, or resolve any matters between entrant members.  It will be up to the winning Team(s) to reallocate the prize money among its member contestants, if they deem it appropriate.

Payments:

The prize competition winners will be paid prizes directly from NIST.  Prior to payment, winners will be required to verify eligibility.  The verification process with the agency includes providing the full legal name, tax identification number or social security number, routing number and banking account to which the prize money can be deposited directly.

Intellectual Property:

Any applicable intellectual property rights to an Entry will remain with the Participant. By participating in the prize challenge, the Participant is not granting any rights in any patents, pending patent applications, or copyrights related to the technology described in the Entry. However, by submitting an Entry, the Participant is granting NIST, NASA, and any parties acting on their behalf certain limited rights as set forth herein.

By submitting an Entry, the Participant grants to NIST, NASA, and any parties acting on their behalf the right to review the Entry, to describe the Entry in any materials created in connection with this competition, and to screen and evaluate the Entry.  NIST and NASA, and any parties acting on their behalf will also have the right to publicize Participant’s name and, as applicable, the names of Participant’s team members and/or Organization which participated in submitting the Entry following the conclusion of the Competition.

As part of its submission, the Participant must provide written consent granting NIST, NASA, and any parties acting on their behalf, a royalty-free, non-exclusive, irrevocable, worldwide license to display publicly and use for promotional purposes the Participant’s entry (“demonstration license”).  This demonstration license includes posting or linking to the Participant’s entry on NIST and NASA’s websites, including the Competition Website, and partner websites, and inclusion of the Participant’s Entry in any other media, worldwide.

Registration and Submissions:

Submissions must be made online (only), via upload to the HeroX.com website, on or before 5pm ET on July 26, 2018. Any uploads must be in PDF format. No late submissions will be accepted.

Winner Selection:

Grand Prize, Runner Up, and Honorable Mention winners will be selected per the official Judging Criteria.  Final determination of the winners will be made at the sole discretion of NIST. Scores and feedback from NIST will not be shared. 

The $5,000 popular choice awards will be awarded based on the number of votes received during the voting period.  A competitor is eligible to win both a Judges' Award and the People's Choice Award.

All votes are subject to review.  Any competitor using unfair methods to solicit votes will be automatically disqualified from the challenge.

Entries that are eligible for the Voting stage will become viewable to the public. Depending on the number of entries received, either all, or a selected shortlist, will move on to the Voting stage.

Additional Information

  • By participating in the challenge, each competitor agrees to submit only their original idea. Any indication of "copying" amongst competitors is grounds for disqualification.
  • All applications will go through a process of due diligence; any application found to be misrepresentative, plagiarized, or sharing an idea that is not their own will be automatically disqualified.
  • Submissions must be made in English.
  • All ineligible applicants will be automatically removed from the competition with no recourse or reimbursement.
  • No purchase or payment of any kind is necessary to enter or win the competition.
  • Void wherever restricted or prohibited by law.
Timeline
Updates 28
Forum 8
Community 631
Entries