Alpana Samanta

Understanding Dataset Usage

We are from ML challengers Team and are currently working on SAP's Data Anonymization Challenge.
We needed some details in order to complete Task

1. We have worked on and created a generalized Ml algorithm to identify Information and their respective Bounding Boxes,
however our concern is:
1) As per the Dataset in file, we have files such as train,test and Val, we are quiet confused regarding their usage, since as per the files content it contains just
the image filePathName and the File type which may not be helpful/sufficient for making any predictions for identifying the Personal Information in any document.The same is useful if we have to identify the file type, which is anyway not mentioned in the challenge description/scope.
2) Regarding the Handwritten Text, can we assume it to be only for English language?
3) Do we have to work only for Invoices file type? As it is the only document type referred in the Challenge.

Please provide us with the necessary inputs so that we can proceed further.
Thanks in Advance
ML Challengers Team
Alexander Kreuzer
Hi Alpana,

1)+3) The labels relevant for this challenge are contained in labels.jsonl.
The challenge will work only on the invoices from rvl-cdip.tar.gz.
labels_only.tar.gz is only needed for selecting those invoices.
The invoices will be used in PNG format and you would have to convert them.

You can run ./ to do all that for you.
After running this script you can find the invoices in the directory ./invoices.

2) The majority of the handwriting is in English. However, we do not guarantee that all handwriting is.

Let me know if you have any further questions.

- Alex

Aquib Azim
I believe that an illustrative tutorial to see the process & results around bash -> ./ is very helpful for people to understand what the outcome looks like so we know of how well our milestone is/could-be.

Alexander Kreuzer
Hi Aquib,

I will give some more info on running the download script. In this and the following posts.
Attached here is a screenshot of a linux system after extraction the zip you can find in the resources section.

Alexander Kreuzer
After running the download script via `bash`. (This might take several hours)
You will arrive at something like what is shown on the following screenshot.
As you can see several file were created.
Important for this challenge is only `labels.jsonl` and the directory `invoices`. If you like you can delete the directory `images` and the files labels.tar.gz and `rvl-cdip.tar.gz` files.

I hope this helps. If you have any more questions, please do not hesitate to ask.

- Alex
Modified on Oct. 30, 2019, 6:03 p.m. PDT
