Hurry, get your submission in! Entries close in less than 2 days

SAP

 14,400
SAP's Data Anonymization Challenge

SAP's Data Anonymization Challenge

Protect personally identifying information in semi-structured data while preserving the utility of the dataset for machine learning tasks.
stage:
Enter
prize:
$60,000
Overview

Challenge Overview

Welcome to SAP's Data Anonymization Challenge!

This challenge will award up to $60,000 in total prizes for solutions that can identify personally identifying information (PII) within semi-structured data and then anonymize it.

 

Why This Matters

Access to data is a major differentiator for businesses in today's global marketplace and can be instrumental in breaking down data silos, extracting business intelligence through machine learning and using AI-driven insights to deliver better experiences. However, a critical consideration to moving forward with these opportunities is assuring data privacy.

Semi-structured text documents are an essential part of many business processes, for example invoices, sales orders, or payment advises. Translating these semi-structured data into structured data is essential to allow further downstream processing and automation. 

To foster research and development of machine learning approaches for document processing, it is necessary to allow researchers to work with large amounts of realistic documents. To comply with data protection regulation, companies have to anonymize documents and remove any personally identifying information. This redaction should be done in a way that produces realistic-looking documents and minimizes negative impact on machine learning model training.

Openness is a key principle for SAP as it is the foundation for co-innovation and integration. We are embracing open standards and open source, and are providing rapid access to data and business processes through open APIs, so customers and partners can turn data into value as easy as possible. In return, it is crucial for SAP to use the power and speed of communities to innovate even faster. In this spirit, the winning solutions from the challenge will be open-sourced. Openness creates more value for everybody.
 

Your Challenge

In this challenge, you will work with a set of 25000 invoices from the public RVL-CDIP DatasetRVL-CDIP Dataset [1,2]. Some of the invoices are (low quality) scans or contain handwritten notes. You are also welcome to use other datasets that are available to you to train your model and maximize its generalizability.

Your tasks are as follows:

  1. Build a model that can identify the bounding boxes of the following types of personally identifying information:
    1. Personal names,
    2. Signatures, and handwritten notes.
  2. Develop a system to redact the content of the bounding boxes with a realistic replacement such that the anonymized data remains effective training data for machine learning tasks. This will require efforts to preserve the style, orientation, imperfections and complexities of the original data. Anonymized substitute text that is simpler and clearer to read than the original data will result in a less comprehensive and less effective training data set.
     

As ground truth for your models training you will be provided with a sample of bounding boxes of the personally identifying information as specified above.

 

 

Guidelines
Timeline
Updates 9

Challenge Updates

1 week to submit to Task 2 Leaderboard and Q&A Webinar

Nov. 12, 2019, 3 p.m. PST by Kyla Jeffrey

1 Week Left to Submit to Task 2 Leaderboard

That's right, your leaderboard submissions are due on November 19th at 5 pm PT (Los Angeles). Please see our last update for full details on how to submit --> view here.

 

Attend our Q & A Webinar LIVE

Join SAP for a webinar where you have an opportunity to ask your questions LIVE! You won't want to miss this opportunity to get the inside scoop on SAP's Data Anonymization Challenge! 

Submit your questions in advance when you register or by commenting on this update.

Time: Nov 21  at 8:00 am Pacific Time (Los Angeles)

 

Sign Up Now


Task 2 Leaderboard - Entries due Nov 19

Nov. 8, 2019, 11:20 a.m. PST by Kyla Jeffrey

Hi Everyone,

Today we are releasing the images for the Task 2 Leaderboard.  Your anonymized images for the Task 2 Leaderboard will be due November 19th at 5:00 pm Pacific Time (Los Angeles).

 

What is Task 2?

Develop a system to replace the content in the bounding boxes, as given in task 1, with realistic dummy content such that:

  1. Personally identifying information is removed, and
  2. The anonymized data remains effective training data for machine learning tasks. This will require efforts to preserve the style, orientation, imperfections and complexities of the original data. Anonymized substitute text that is simpler and clearer to read than the original data will result in a less comprehensive and less effective training data set.

 

Task 2 Leaderboard

Please anonymize the following images and submit them as individual PNG files to the Task 2 Leaderboard.

Download the image set here -->

Note that the images below have boxes drawn around the part the have to be anonymized. In the resources zip you will find the raw image together with label data indicating the bounding boxes.


Preparing for the Task 2 Leaderboard

Nov. 1, 2019, 8:28 a.m. PDT by Kyla Jeffrey

Hi Everyone,

 

The images for the Task 2 Leaderboard will be released next week on Friday November 8th, 2019.  Your anonymized images for the Task 2 Leaderboard will be due November 19th at 5:00 pm Pacific Time (Los Angeles).

What is Task 2?

Develop a system to replace the content in the bounding boxes, as given in task 1, with realistic dummy content such that:

  1. Personally identifying information is removed, and
  2. The anonymized data remains effective training data for machine learning tasks. This will require efforts to preserve the style, orientation, imperfections and complexities of the original data. Anonymized substitute text that is simpler and clearer to read than the original data will result in a less comprehensive and less effective training data set.

Sample Images

As you prepare for Task 2, please take a look at these example images which give you an idea of the complexity. Note, these are not the images that will be used for the Task 2 Leaderboard.

EASY  IMAGES

imagesm_mnl50c00_ti31689076.png

 

imagese_epd40c00_ti10161367.png

 

MEDIUM IMAGES

imagesf_fbt10c00_2085530535.png
imagesj_jmh62d00_86463216.png

HARD IMAGES

imagesl_lwa54c00_80701781.png
imagesz_zhy61c00_2084061355.png

Task 2 Leaderboard Deadline Change

Oct. 28, 2019, 12:30 p.m. PDT by Liz Treadwell

We want to give everyone additional time to work on their submissions to the Task 2 leaderboard, so we've extended the deadline from October 31st to November 19th. To make sure you have the correct deadline time, you can view it in the Timeline section here.

 

If you have any questions, please post them in the challenge forum.


Important Update - Guidelines Changes

Oct. 24, 2019, 1:08 p.m. PDT by Liz Treadwell

Hi everyone,

We wanted to let you know that we have updated the scoring system for Task 1. We would like to shift the focus on the tasks that are important  to SAP. Hence, these are the following changes we are going to make:

  • Merge the classes of handwritten and signatures into the class signature.
  • Remove invoices with personal addresses or personal telephone numbers from the test set.
  • You can download the updated test file list here.

We’ve also added the possibility to label an invoice as not anonymizable due to bad quality. If an invoice is labeled as such it will count with a score of 0.35. This corresponds roughly to the score of predicting no boxes for each invoice.

To label an invoice as bad quality just add "bad_quality": true to the label, e.g., {"id": "imagesl_lgq30c00_ti01410925.png", "bad_quality": true}

In addition, we updated the zip file containing the scoring algorithm. You can download it here. We made sure that predictions to the previous scoring will still work with the current algorithm.We also submitted an empty label set flagging every invoice as bad quality to the leaderboard. This gives a baseline prediction of 0.35.

If you have already submitted an entry to the leaderboard, you may see an update to your score based on this new scoring mechanism.

The motivation behind the above changes is that we wanted to simplify the challenge while still maintaining the full anonymization invoices. In more detail, the merging change is just a direct simplification. The removal of invoices with personal addresses and phone numbers is due to those labels only occurring in less than 10% of the cases. Lastly, the addition of a “bad quality” label is motivated by the fact that in practical applications those invoices can be handled manually.

We hope these updates will help as you continue to develop your solutions. Remember, you can submit an entry to the leaderboard at any time to see how it performs. You can also update your leaderboard entry once per day with any improvements you’ve made. If you have any questions regarding the above changes, please comment directly in the challenge forum.


Forum 12
Community 360
Leaderboard
Press
Resources
FAQ