Results
Task 1: Post-editing and rescoring of ASR results
Gold-standard data
Results
The number of submissions exceeded our expectations! We are amazed at the number of experiments people managed to perform and even if the result wasn't the best, the conclusions will make a great contribution to this area of research. To this end, we would like to encourage all the authors to make at least a short writeup of their chosen method(s).
The task of creating an interesting competition turned out to be more difficult than anticipated. If the ASR system used was too good, the post-editing problem would be too minor and unimpressive. A very bad system, on the other hand, will make the language too difficult to analyze. A compromise was made and the chosen system had a somewhat average word-error-rate (from the tested systems) amounting to 27.6%. For those that decided to use the lattice as their input, they could count on the oracle word-error-rate of 17.7%, presenting a sort of floor for the error rate of the given ASR system. That means that the system likely made many errors which were unrecoverable from the post-editing perspective, therefore an improvement of even a few absolute percentage points makes a significant difference.
The results were very close in a few cases. To make the assessment a bit more interesting we calculated both the error rate compared to the reference (which was not known to the submitters), as well as the single best output (which was known to the submitters). The latter shows the amount of changes the submission made to the original (i.e. how "brave" is the submission, because even a minor amount of changes could yield a decent result).
Without further ado, the results were as follows:
Submission | Affiliation | WER % | Changes % |
---|---|---|---|
KRS + spaces | UJ. AGH | 25.9 | 3.6 |
KRS | UJ. AGH | 26.9 | 1.6 |
Polbert | https://skok.ai/ | 26.9 | 2.1 |
BiLSTM-CRF edit-operations tagger | Adam Mickiewicz University | 24.7 | 6.2 |
base-4g-rr | Samsung R&D Institute Poland | 27.7 | 2.0 |
t-REx_k10 | Uniwersytet Wrocławski | 24.9 | 14.2 |
t-REx_k5 | Uniwersytet Wrocławski | 25.0 | 14.2 |
t-REx_fbs | Uniwersytet Wrocławski | 24.31 | 17.2 |
PJA_CLARIN_1k | Polish-Japanese Academy of Information Technology | 33.5 | 9.1 |
PJA_CLARIN_10k | Polish-Japanese Academy of Information Technology | 32.0 | 9.6 |
PJA_CLARIN_20k | Polish-Japanese Academy of Information Technology | 31.8 | 9.9 |
PJA_CLARIN_40k | Polish-Japanese Academy of Information Technology | 31.8 | 10.3 |
PJA_CLARIN_50k | Polish-Japanese Academy of Information Technology | 31.8 | 10.2 |
CLARIN_SEJM_40k | Polish-Japanese Academy of Information Technology | 33.7 | 19.1 |
CLARIN_SEJM_50k | Polish-Japanese Academy of Information Technology | 32.5 | 17.7 |
MLM+bert_base_polish | 73.9 | 2.1 | |
tR-Ex_xk | Uniwersytet Wrocławski. Instytut Informatyki | 25.7 | 18.1 |
tR-Ex_fbs | Uniwersytet Wrocławski. Instytut Informatyki | 24.31 | 17.2 |
tR-Ex_fx | Uniwersytet Wrocławski. Instytut Informatyki | 25.0 | 23.3 |
tR-Ex_kxv2 | Uniwersytet Wrocławski. Instytut Informatyki | 25.5 | 17.1 |
The submission titled "MLM+bert_base_polish" only had 171 out of 462 files submitted and cannot be directly compared with others. The result for only the present files was 26.8%. If we extract these 171 files from the winning submission, its result would yield 23.4%, so this submission would have lost in this comparison as well.
The winning submission by the Wrocław Uniwersity team titled "flair-bigsmall" was submitted twice (both submission were identical) and it also lacked the output for 2 files. The result above is calculated assuming these two files were completely incorrect. If we didn't account for these files, the result would be 24.0%.
It was a close call with between the Wrocław University and Adam Mickiewicz University teams, but ultimately the former team won. We would like to again thank all the teams for participating!
Task 2: Morphosyntactic tagging of Middle, New and Modern Polish
Gold-standard data
Results
Submission | Affiliation | Accuracy | Acc on known | Acc on ign | Acc on manual |
---|---|---|---|---|---|
Alium-1.25 | 0.8880 | 0,8985 | 0,4295 | 0,2427 | |
Alium-1000 | 0,8880 | 0,8985 | 0,4306 | 0,2427 | |
KFTT train | UJ, AGH | 0,9564 | 0,9600 | 0,7991 | 0,6661 |
KFTT train+devel wo_morf | UJ, AGH | 0,9563 | 0,9595 | 0,8191 | 0,6730 |
KFTT train+devel | UJ, AGH | 0,9573 | 0,9607 | 0,8102 | 0,6781 |
Simple Baseline: COMBO | Allegro.pl, Melodice.org | 0,9284 | 0,9363 | 0,5838 | 0,5232 |
CMC Graph Heuristics | Wrocław University of Science and Technology | 0,9121 | 0,9214 | 0,5072 | 0,1670 |
Simple Baselines: XLM-R | Allegro.pl, Melodice.org | 0,9499 | 0,9562 | 0,6770 | 0,6850 |
Eight solutions for this task were submitted by four contestants. The results achieved are far better than we anticipated. Tagging of historical Polish can be expected to be more difficult than tagging contemporary language: the tagset includes more features, some of them describing very rare phenomena; the number of tokens unknown to the morphological analyser is larger (2.25% vs. 1.26%); the word order is less stable (with many discontinuous constructions). Yet, the results compare favourably to the results of PolEval 2017 Task 1(A) for contemporary language (http://2017.poleval.pl/index.php/results/). The best overall accuracy is 95.7% compared to 94.6% of PolEval 2017. The most striking improvement lays in tagging tokens unknown to the morphological analyser: 81.9% compared to 67% in PolEval 2017.
These results require a further study, which will hopefully lead to interesting discussions during the PolEval 2020 conference session, but generally we can conclude that the presented systems not only improve on tagging historical texts, but provide better taggers also for contemporary Polish, which is a great achievement.
Task 3: Word sense disambiguation
Gold-standard data
Results
Submission | Affiliation | Precision KPWr | Recall KPWr | Precision Sherlock | Recall Sherlock |
---|---|---|---|---|---|
Polbert for WSD (v2) | skok.ai | 0.599296 | 0.588727 | 0.592263 | 0.576850 |
Polbert for WSD | skok.ai | 0.564432 | 0.550860 | 0.564384 | 0.542966 |
PolevalWSDv1 | 0.318547 | 0.231085 | 0.291732 | 0.200867 |
Task 4: IE and entity typing from long documents with complex layouts
Submission | Affiliation | F1 score |
---|---|---|
CLEX | Wrocław University of Science and Technology | 0.651±0.019 |
double_big | Poznan University of Technology; WIZIPISI | 0.606±0.017 |
300_xgb | Poznan University of Technology; WIZIPISI | 0.592±0.015 |
double_small | Poznan University of Technology; WIZIPISI | 0.588±0.018 |
300_RF | Poznan University of Technology; WIZIPISI | 0.587±0.015 |
middle_big | Poznan University of Technology; WIZIPISI | 0.585±0.016 |
100_RF | Poznan University of Technology; WIZIPISI | 0.584±0.016 |
Multilingual BERT + Random Forest | skok.ai | 0.440±0.014 |