The digital revolution has radically changed the way we interact with data. In a pre-digital age, personal data was something that had to be deliberately asked for, stored, and analyzed. The inefficiency of pouring over printed or even hand-written data made it difficult and expensive to conduct research. It also acted as a natural barrier that protected personally identifiable information (PII) -- it was extremely difficult to use a multitude of sources to identify particular individuals included in shared data.
Our increasingly digital world turns almost all our daily activities into data collection opportunities, from the more obvious entry into a webform to connected cars, cell phones, and wearables. Dramatic increases in computing power and innovation over the last decade along with both public and private organizations increasingly automating data collection make it possible to combine and utilize the data from all of these sources to complete valuable research and data analysis.
At the same time, these same increases in computing power and innovations can also be used to the detriment of individuals through linkage attacks: auxiliary and possibly completely unrelated datasets in combination with records in the dataset that contain sensitive information can be used to determine uniquely identifiable individuals.
This valid privacy concern is unfortunately limiting the use of data for research, including datasets within the Public Safety sector that might otherwise be used to improve protection of people and communities. Due to the sensitive nature of information contained in these types of datasets and the risk of linkage attacks, these datasets can’t easily be made available to analysts and researchers. In order to make the best use of data that contains PII, it is important to disassociate the data from PII. There is a utility vs. privacy tradeoff however, the more that a dataset is altered, the more likely that there will be a reduced utility of the de-identified dataset for analysis and research purposes.
Currently popular de-identification techniques are not sufficient. Either PII is not sufficiently protected, or the resulting data no longer represents the original data. Additionally, it is difficult or even impossible to quantify the amount of privacy that is lost with current techniques.
This competition is about creating new methods, or improving existing methods of data de-identification, in a way that makes de-identification of privacy-sensitive datasets practical. A first phase hosted on HeroX will ask for ideas and concepts, while later phases executed on Topcoder will focus on the performance of developed algorithms.