Task 4: Information extraction and entity typing from long documents with complex layouts

Task definition

The challenge is about information acquisition and inference in the field of natural language processing. Collecting information from real, long documents must deal with complex page layouts by integrating found entities along multiple pages and text sections, tables, plots, forms, etc. To encourage progress in deeper and more complex information extraction, we present a dataset in which systems have to find the most important information about different types of entities from formal documents. These units are not only classes from the systems for recognising units with a standard name (NER) (e.g. person, location or organisation), but also the roles of units in whole documents (e.g. chairman of the board, date of issue).

Data description

Data consists of two folders (train and validate). Each of these folders contains a `.csv` file with ground truth values for each report. For each report there are `.pdf` (raw input), `.txt` (text input) and `.hocr` (text input with positional info) files placed in the `reports/{report_id}` folder.

The dataset contains 2216 unique records.

208910;ZAKŁADY MAGNEZYTOWE ROPCZYCE SA;2012-08-30;2012-01-01;2012-06-30;39-100;Ropczyce;ul. Przemysłowa;1;[('2012-08-30', 'Józef Siwiec', 'Prezes Zarządu'), ('2012-08-30', 'Marian Darłak', 'Wiceprezes Zarządu'), ('2012-08-30', 'Robert Duszkiewicz', 'Wiceprezes Zarządu')]
118734;PC GUARD S.A.;2009-08-31;2009-01-01;2009-06-30;60-467;Poznań;Jasielska 16;16;[('2009-08-31', 'Dariusz Grześkowiak', 'Prezes Zarządu'), ('2009-08-31', 'Mariusz Bławat', 'Członek Zarządu')]

Data description

The `ground_truth.csv` csv files included in the directories contain the following columns:

  • *id* - unique identifier of a specific financial report
  • *company* - name of the company
  • *drawing_date* - date which specifies when the financial report was submitted
  • *period_from* - start of the obligation period
  • *period_to* - end of the obligation period
  • *postal_code* - postal code of the company
  • *city* - the city where the company is registered
  • *street* - the name of the street where the company is registerd
  • *street_no* - the number of the street at which the company is registered
  • *people* - members/chairmen of the company management. A cell contains a list of tuples.

Where each tuple has the following form: (<date of signature>, <name and surname>,
<position>) e.g. ('2019-12-16', 'Jan Kowalski', 'Prezes Zarządu')
For each column no more than 15% of documents can have incorrect ground truth value.

Test data

Please click the link below to download the test data:


Submissions will be evaluated according to the F1 measure.

Coverage in the whole dataset:

company-present 88.16
street-present 88.99
drawing_date-present 93.40
postal_code-present 94.70
city-present 98.85
street_no-present 98.92
period_from-present 99.57
period_to-present 99.75
people-present 100.00

Task introduction video