This challenge will award up to $60,000 in total prizes for solutions that can identify personally identifying information (PII) within semi-structured data and then anonymize it.
Access to data is a major differentiator for businesses in today's global marketplace and can be instrumental in breaking down data silos, extracting business intelligence through machine learning and using AI-driven insights to deliver better experiences. However, a critical consideration to moving forward with these opportunities is assuring data privacy.
Semi-structured text documents are an essential part of many business processes, for example invoices, sales orders, or payment advises. Translating these semi-structured data into structured data is essential to allow further downstream processing and automation.
To foster research and development of machine learning approaches for document processing, it is necessary to allow researchers to work with large amounts of realistic documents. To comply with data protection regulation, companies have to anonymize documents and remove any personally identifying information. This redaction should be done in a way that produces realistic-looking documents and minimizes negative impact on machine learning model training.
Openness is a key principle for SAP as it is the foundation for co-innovation and integration. We are embracing open standards and open source, and are providing rapid access to data and business processes through open APIs, so customers and partners can turn data into value as easy as possible. In return, it is crucial for SAP to use the power and speed of communities to innovate even faster. In this spirit, the winning solutions from the challenge will be open-sourced. Openness creates more value for everybody.
In this challenge, you will work with a set of 25000 invoices from the public RVL-CDIP DatasetRVL-CDIP Dataset [1,2]. Some of the invoices are (low quality) scans or contain handwritten notes. You are also welcome to use other datasets that are available to you to train your model and maximize its generalizability.
Your tasks are as follows:
As ground truth for your models training you will be provided with a sample of bounding boxes of the personally identifying information as specified above.
SAP’s Data Anonymization Challenge will award up to $60,000 in total prizes.
In addition to the monetary prizes offered, the winner will be invited to a special event by SAP.
A successful solution will accomplish the following:
Personally Identifying Information is defined as:
Pre-Registration Launches: 9/29/2019
Challenge Details Released: 9/5/2019
Task 1 Leaderboard Opens: 9/30/2019
Task 2 Leaderboard Submission Deadline: 11/19/2019 at 5:00 pm Pacific Time
Task 1 Leaderboard Closes and Finalists Invited to Final Scoring: 12/12/2019 at 5:00 pm Pacific Time
Finalists’ Deadline for Submitting to Final Scoring: 12/19/2019 at 5:00 pm Pacific Time
Evaluation Period: 12/20/2019 to early 2020
Winner’s Announced: Early 2020
Example of Bounding Box for Task 1:
Example of Redacted Invoice for Task 2:
Please note the colored boxes and labels are for illustrative purposes only and should not be included in your submission.
Task 1 Leaderboard Submission:
Task 2 Leaderboard Submission:
Final Submission for Finalists Invited to Final Scoring:
The labels for the bounding box should be written in a file consisting of one JSON document per line. The JSON document should have the following schema. It is a JSON object with the keys id, bounding_boxes and bad_quality. The value of id is a string to identify the PNG image of the invoice. The value of bounding_boxes is an array of JSON objects each denoting a bounding box. Each bounding box object has the keys x0, y0, x1, y1, and label. The values of x0, y0, x1, y1 are floating point in the unit interval denoting the coordinates of the bounding box as a ratio. The value of label should be in "name", "handwritten". It denotes the type of content of the bounding box. The value of bad_quality should be a boolean. It indicates that the invoice could not be processed due to bad quality.
Your code will be tested in the following environment:
Please ensure your solution works without modification in the test environment.
Task 1 and Task 2 will be scored separately. All teams will be scored on the first task, but only the top scorers of the first task will be invited to Final Scoring and receive scoring for the second task.
Task 1: Scoring for Task 1 will be based on the number of predicted bounding boxes that match the ground truth in both labels and box placement. We say that the placement of a bounding box B1 is matching another bounding box B2 if its intersection over union is greater than 0.8. In detail, let a(B) be the area of the bounding box B and let B1 ⋂ B2, B1 ⋃ B2 be, respectively, the intersection and union of B1 and B2. Then B1 matches B2 if a(B1 ⋂ B2)/ a(B1 ⋃ B2) > 0.8.
For an invoice, a predicted bounding box is matching a ground truth bounding box if both the label is matching and the bounding boxes are matching.
Let m be the number of matching bounding boxes. Further, let ngt be the number of ground truth bounding boxes, and let npred be the number of predicted bounding boxes. The score for a document and category of PII is defined as score = m/ngt *0.75^(npred-m) for ngt > 0 and score = 0.75 ^(npred-m) if ngt = 0. So the score is the ratio of correctly identified boxes penalized by 25% for each additionally predicted box that does not appear in the ground truth data.
The score for the whole document is the average of the scores for each category of PII.
Documents flagged as bad quality will receive a score of 0.35 regardless of the bounding boxes.
We provided a sample implementation of the above scorer in the download section.
Task 2: The second task will be scored by assessing the realism of the anonymized invoices. We will use your submission of anonymized invoices from the provided dataset and other sources. We will manually check if anonymized invoices can be distinguished from real invoices.
Your submission to the second task has to be designed in such a way that the content of the new bounding box does not contain any personally identifying information from the original document. Please make sure to include an explanation in your model description.
For this challenge, you will work with a set of 25000 invoices and invoice-like documents from the RVL-CDIP Dataset [1,2]. Some of the invoices are (low quality) scans or contain handwritten notes.
Additionally, we have labeled invoices never previously seen by competitors that will be used as the test dataset for the leaderboard and final scoring.
ZIP-File containing the download-script, labels for task 1, an implementation of the scorer for task 1, and dummy implementations of task 1 and 2 - Access Here.
Downloading of Data Set
Downloading the dataset could take a few hours. Run the following command. It requires convert from image-magick.
This script should generate a directory invoices/ containing 25000 png-files.
The file labels.jsonl contains labels for 1000 invoices. The field id matches the filename in invoices/.
Task 1 Sample
task1_sample.py is a dummy solution for task 1. It creates correctly formatted, random labels. For instance, the following command will create random labels for the invoices in labels.jsonl.
jq -r .id labels.jsonl | ./task1_sample.py - > labels_sample.jsonl
This command requires jq.
Task 1 Scoring
A sample implementation of the scoring algorithms is contained in task1_score.py and metrics.py. The following command will score the random labels created against the reference labels.
./task1_score.py labels.jsonl labels_sample.jsonl
Task 2 Sample
task2_sample.py is a dummy solution for task 2. It redacts sensitive information by drawing a solid box over it.
It can be run as follows.
./task2_sample.py labels.jsonl out
This command will save the redacted images in the directory out/.
This script also has an option to outline the bounding boxes for inspection. This option triggered by adding the --box outline parameter, i.e., ./task2_sample.py --box outline labels.jsonl out2.
To be eligible to compete, you must comply with all the terms of the challenge as defined in the Challenge-Specific Agreement.
Submissions must be made in English. All challenge-related communication will be in English.
No specific qualifications or expertise in the field of data anonymization is required.
Registration and Submissions:
Intellectual Property Rights:
If Challenge Sponsor notifies Innovator that Submission is eligible for a Prize, Innovator will be considered qualified as a finalist (“Finalist”). Challenge Sponsor will require all content and assets developed by Finalists as part of their Submissions to be licensed under the Creative Commons CC BY (4.0) license. Challenge Sponsor will also require all code developed by Finalists as part of their Submissions to be licensed under the Apache License 2.0. Once all development has been completed, all designs, code, content, and assets developed by all Finalists will be released under the Creative Commons CC BY (4.0) and Apache License 2.0.
Selection of Winners:
Prizes will be awarded based on the Winning Criteria section above. In the case of a tie, the winner(s) will be selected based on the highest votes from the Judges.
In the case of no winner, SAP reserves the right to withhold the Prize amount. In place of the original prize amount, SAP will issue a Consolation Prize to the team or individual closest to the winning solution in the amount of at least $5,000 USD and Consolation Prize in the amount of $2,000 USD to the second closest team or individual.
 https://www.cs.cmu.edu/~aharley/rvl-cdip/ A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015
 The Legacy Tobacco Document Library (LTDL), University of California, San Francisco, 2007. http://legacy.library.ucsf.edu/.