SAP's Data Anonymization Challenge

Protect personally identifying information in semi-structured data while preserving the utility of the dataset for machine learning tasks.

Challenge Overview

Welcome to SAP's Data Anonymization Challenge!

This challenge will award up to $60,000 in total prizes for solutions that can identify personally identifying information (PII) within semi-structured data and then anonymize it.


Why This Matters

Access to data is a major differentiator for businesses in today's global marketplace and can be instrumental in breaking down data silos, extracting business intelligence through machine learning and using AI-driven insights to deliver better experiences. However, a critical consideration to moving forward with these opportunities is assuring data privacy.

Semi-structured text documents are an essential part of many business processes, for example invoices, sales orders, or payment advises. Translating these semi-structured data into structured data is essential to allow further downstream processing and automation. 

To foster research and development of machine learning approaches for document processing, it is necessary to allow researchers to work with large amounts of realistic documents. To comply with data protection regulation, companies have to anonymize documents and remove any personally identifying information. This redaction should be done in a way that produces realistic-looking documents and minimizes negative impact on machine learning model training.

Openness is a key principle for SAP as it is the foundation for co-innovation and integration. We are embracing open standards and open source, and are providing rapid access to data and business processes through open APIs, so customers and partners can turn data into value as easy as possible. In return, it is crucial for SAP to use the power and speed of communities to innovate even faster. In this spirit, the winning solutions from the challenge will be open-sourced. Openness creates more value for everybody.

Your Challenge

In this challenge, you will work with a set of 25000 invoices from the public RVL-CDIP DatasetRVL-CDIP Dataset [1,2]. Some of the invoices are (low quality) scans or contain handwritten notes. You are also welcome to use other datasets that are available to you to train your model and maximize its generalizability.

Your tasks are as follows:

  1. Build a model that can identify the bounding boxes of the following types of personally identifying information:
    1. Personal names,
    2. Signatures, and handwritten notes.
  2. Develop a system to redact the content of the bounding boxes with a realistic replacement such that the anonymized data remains effective training data for machine learning tasks. This will require efforts to preserve the style, orientation, imperfections and complexities of the original data. Anonymized substitute text that is simpler and clearer to read than the original data will result in a less comprehensive and less effective training data set.

As ground truth for your models training you will be provided with a sample of bounding boxes of the personally identifying information as specified above.



Updates 19

Challenge Updates

Results of SAP’s Data Anonymization Challenge

Feb. 7, 2020, 8 a.m. PST by Kyla Jeffrey

Thank you to everyone who participated in SAP’s Data Anonymization Challenge.

Results of Task 1:

  1. Alexandre Wermann: 44.2% (deemed ineligible*)
  2. Ovidiu Dobre: 34.3%
  3. LH42: deemed ineligible**

Results of Task 2

  1. Ovidiu Dobre
  2. Alexandre Wermann
  3. LH42: deemed ineligible*

Unfortunately, this time no competitor produced a solution that accomplished both Task 1 and Task 2 to a satisfactory level to be open-sourced. 

However, in recognition of participants’ efforts, SAP will distribute the following rewards among the top three contestants:

  1. Alexandre Wermann: $0*
  2. Ovidiu Dobre: $6,666,67 ($5,000 Consolation Prize + $1,666,67 for invitation to final scoring)
  3. LH42: $1,666,67 for invitation to final scoring

*Alexandre Wermann is ineligible to receive prize money as an SAP employee.

**LH42’s submission was deemed ineligible due to a dependency on external services that are not part of the required test environment.

Two Days Remaining!

Dec. 10, 2019, 7:22 p.m. PST by Kyla Jeffrey

This is your official two day reminder!

Please ensure you submit your entries to the Task 1 Leaderboard by Thursday December 12 at 5 pm Pacific Time (Los Angeles). This is your last opportunity to do so! Here's what you need to know:

  • You do not need to have submitted previously to submit now.
  • You are able to submit once per day prior to 12 noon Pacific Time and have your submission scored. We strongly recommend submitting early to ensure your submission is in the correct format for scoring
  • Review the Challenge Guidelines in full to ensure you are clear on all requirements.
  • If you have any questions, please review the challenge forum and the recording from the Q&A webinar.

Final Scoring Submission Requirements

Following Thursday's submission deadline, top teams will be invited to the Final Scoring. If you are invited to the final scoring, you will have one week to submit your files. We have updated the files required to make the submission easier for finalists. Here's what you will need to submit:


  1. One zip file containing the following for Task 1 & Task 2:
    1. Code that is executable by SAP on the environment specified in Test Environment of the Challenge Guidelines.
    2. README file with clear and concise instructions for running the code. Also, please provide an expected time for training and expected time for inference
    3. For any machine learning model you are using, you should provide us with the trained model as well as instructions on how to retrain the model.
    4. Source code and documentation
  2. Description of your algorithm and approach. Please include any citations to existing research and list any additional datasets you have used and provide links to the datasets (1 - 5 page PDF document)

Submissions due Thursday!

Dec. 9, 2019, 11:16 p.m. PST by Kyla Jeffrey

Reminder that your submission to the Task 1 Leaderboard is due Thursday at 5 pm Pacific Time (Los Angeles).

You don't need to have submitted prior to now in order to participate. However, you must submit by this deadline to remain in the competition. 

Please review the tips and what to expect that were shared last week for more information here.

Deadline to Submit: December 12th

Dec. 6, 2019, 11:47 a.m. PST by Kyla Jeffrey


The deadline to submit your Task 1 Leaderboard Submission is December 12th at 5 pm PT (Los Angeles time). 

You must submit by this deadline in order to be invited to the final scoring. You do not need to have submitted to the Task 2 Leaderboard to submit to this deadline.



  1. Ensure you submit to the Task 1 Leaderboard early if you have not already. This will ensure that you have submitted in a format that allows us to score your entry. If you receive a -1 on the leaderboard, your submission was not scored. Please reach out to us if this is the case for you.
  2. The leaderboard is scored everyday at 12 noon Pacific Time (Lost Angeles). You can submit once per day to be rescored. Please wait until your entry has been scored before bringing it back into editing mode and submitting it again.
  3. Review the Challenge Guidelines in full to ensure you are clear on all requirements.
  4. If you have any questions, please review the challenge forum and the recording from the Q&A webinar.


What can you expect over the next few weeks?

Task 1 Leaderboard Closes and Finalists Invited to Final Scoring: 12/12/2019 at 5:00 pm Pacific Time

  • Top scoring competitors from the Task 1 leaderboard will be invited to submit to the Final Scoring as finalists.
  • The finalists will have one week to prepare their final submission.

**Finalists’ Deadline for Submitting to Final Scoring: 12/19/2019 at 5:00 pm Pacific Time

  • Final submissions due for all finalists invited to the Final Scoring. 
  • **Please ensure you have some time available between 12/12 and 12/19 to submit to final scoring. The details for this are in the challenge guidelines

Evaluation Period: 12/20/2019 to early 2020

  • SAP will score finalist submissions as per the Scoring Metrics below
  • Finalists must be available to debug and troubleshoot their code during this period to ensure SAP can successfully execute it in the specified Test Environment. Failure to do so will result in disqualification.


Task 2 Leaderboard Images

Dec. 2, 2019, 8:06 p.m. PST by Kyla Jeffrey

Curious how the teams are doing on anonymizing images for Task 2? Here is the first image anonymized by our three finalists. 

Please note that the leaderboard rank was based on all three images and we are only displaying one image here.

1st Place: 

2nd Place:

3rd Place: 

Forum 17
Community 390