# SAP Data Anonymization Challenge

## Downloading of Data Set

Run the following command. It requires `convert` from [image-magick](https://imagemagick.org/).

```bash
./download.sh
```

This script should generate a directory `invoices/` containing 25000 png-files.

## Training data

The file `labels.jsonl` contains labels for 1000 invoices. The field `id` matches the filename in `invoices/`.

## Task 1 Sample

`task1_sample.py` is a dummy solution for task 1. It creates correctly formatted, random labels.
For instance, the following command will create random labels for the invoices in `labels.jsonl`.

```bash
jq -r .id labels.jsonl | ./task1_sample.py -  > labels_sample.jsonl
```

This command requires [jq](https://stedolan.github.io/jq/).

## Task 1 Scoring

A sample implementation of the scoring algorithms is contained in `task1_score.py` and `metrics.py`.
The following command will score the random labels created against the reference labels.

```bash
./task1_score.py labels.jsonl labels_sample.jsonl
```

## Task 2 Sample

`task2_sample.py` is a dummy solution for task 2. It redacts sensitive information by drawing a solid box over it.

It can be run as follows.

```bash
./task2_sample.py labels.jsonl out
```

This command will save the redacted images in the directory `out`.

This script also has an option to outline the bounding boxes for inspection. This option triggered by adding the `--box outline` parameter, i.e., `./task2_sample.py --box outline labels.jsonl out2`.
