Task 1: Post-editing and rescoring of automatic speech recognition results
The goal of this task is to create a system for converting a sequence of words from a specific automatic speech recognition (ASR) system into another sequence of words that more accurately describes the actual spoken utterance. The training data will consist of utterance pairs each pair containing an utterance generated by the output of the ASR and an utterance containing the correct transcription of the utterance.
Task 2: Morphosyntactic tagging of Middle, New and Modern Polish
Morphosyntactic disambiguation is one of the most classic NLP problems. For nearly ten years the development and evaluation of morphosyntactic taggers for Polish were focused on the same dataset, namely NKJP1M. Our shared task provides an opportunity to build new systems or tune existing ones to test them in a slightly different environment of more diverse and less standardised historical data. Although the data may seem to be unusual and atypical for everyday applications of NLP tools, the best performing solutions may be deployed in a growing number of projects aimed at building historical corpora of various periods of Polish.
Task 3: Word sense disambiguation
The main aim of the task is to identify the correct meaning (sense) for each ambiguous word appearing in the text. In this task we would like to focus on knowledge-based approaches to the problem of word sense disambiguation by using available language resources (mainly wordnets and existing thesauri), and weakly supervised approaches making use of small annotated data (a seed) with bootstrapping methods (semi-supervised learning) and the knowledge extracted from large unstructured textual corpora.
Task 4: Information extraction and entity typing from long documents with complex layouts
The challenge is about information acquisition and inference in the field of natural language processing. Collecting information from real, long documents must deal with complex page layouts by integrating found entities along multiple pages and text sections, tables, plots, forms, etc. To encourage progress in deeper and more complex information extraction, we present a dataset in which systems have to find the most important information about different types of entities from formal documents. These units are not only classes from the systems for recognising units with a standard name (NER) (e.g. person, location or organisation), but also the roles of units in whole documents (e.g. chairman of the board, date of issue).