PolEval 2020 :: Tasks

Task 1: Post-editing and rescoring of automatic speech recognition results

The goal of this task is to create a system for converting a sequence of words from a specific automatic speech recognition (ASR) system into another sequence of words that more accurately describes the actual spoken utterance. The training data will consist of utterance pairs each pair containing an utterance generated by the output of the ASR and an utterance containing the correct transcription of the utterance.

Task details

Task 2: Morphosyntactic tagging of Middle, New and Modern Polish

Morphosyntactic disambiguation is one of the most classic NLP problems. For nearly ten years the development and evaluation of morphosyntactic taggers for Polish were focused on the same dataset, namely NKJP1M. Our shared task provides an opportunity to build new systems or tune existing ones to test them in a slightly different environment of more diverse and less standardised historical data. Although the data may seem to be unusual and atypical for everyday applications of NLP tools, the best performing solutions may be deployed in a growing number of projects aimed at building historical corpora of various periods of Polish.

Task details

Task 3: Word sense disambiguation

The main aim of the task is to identify the correct meaning (sense) for each ambiguous word appearing in the text. In this task we would like to focus on knowledge-based approaches to the problem of word sense disambiguation by using available language resources (mainly wordnets and existing thesauri), and weakly supervised approaches making use of small annotated data (a seed) with bootstrapping methods (semi-supervised learning) and the knowledge extracted from large unstructured textual corpora.

Task details

Task 4: Information extraction and entity typing from long documents with complex layouts

The challenge is about information acquisition and inference in the field of natural language processing. Collecting information from real, long documents must deal with complex page layouts by integrating found entities along multiple pages and text sections, tables, plots, forms, etc. To encourage progress in deeper and more complex information extraction, we present a dataset in which systems have to find the most important information about different types of entities from formal documents. These units are not only classes from the systems for recognising units with a standard name (NER) (e.g. person, location or organisation), but also the roles of units in whole documents (e.g. chairman of the board, date of issue).

Task details

POLEVAL 2020

Task 1: Post-editing and rescoring of automatic speech recognition results

Task 2: Morphosyntactic tagging of Middle, New and Modern Polish

Task 3: Word sense disambiguation

Task 4: Information extraction and entity typing from long documents with complex layouts