PolEval 2020 :: Results

Results

Task 1: Post-editing and rescoring of ASR results

Gold-standard data

Results

The number of submissions exceeded our expectations! We are amazed at the number of experiments people managed to perform and even if the result wasn't the best, the conclusions will make a great contribution to this area of research. To this end, we would like to encourage all the authors to make at least a short writeup of their chosen method(s).

The task of creating an interesting competition turned out to be more difficult than anticipated. If the ASR system used was too good, the post-editing problem would be too minor and unimpressive. A very bad system, on the other hand, will make the language too difficult to analyze. A compromise was made and the chosen system had a somewhat average word-error-rate (from the tested systems) amounting to 27.6%. For those that decided to use the lattice as their input, they could count on the oracle word-error-rate of 17.7%, presenting a sort of floor for the error rate of the given ASR system. That means that the system likely made many errors which were unrecoverable from the post-editing perspective, therefore an improvement of even a few absolute percentage points makes a significant difference.

The results were very close in a few cases. To make the assessment a bit more interesting we calculated both the error rate compared to the reference (which was not known to the submitters), as well as the single best output (which was known to the submitters). The latter shows the amount of changes the submission made to the original (i.e. how "brave" is the submission, because even a minor amount of changes could yield a decent result).

Without further ado, the results were as follows:

Submission	Affiliation	WER %	Changes %
KRS + spaces	UJ. AGH	25.9	3.6
KRS	UJ. AGH	26.9	1.6
Polbert	https://skok.ai/	26.9	2.1
BiLSTM-CRF edit-operations tagger	Adam Mickiewicz University	24.7	6.2
base-4g-rr	Samsung R&D Institute Poland	27.7	2.0
t-REx_k10	Uniwersytet Wrocławski	24.9	14.2
t-REx_k5	Uniwersytet Wrocławski	25.0	14.2
t-REx_fbs	Uniwersytet Wrocławski	24.31	17.2
PJA_CLARIN_1k	Polish-Japanese Academy of Information Technology	33.5	9.1
PJA_CLARIN_10k	Polish-Japanese Academy of Information Technology	32.0	9.6
PJA_CLARIN_20k	Polish-Japanese Academy of Information Technology	31.8	9.9
PJA_CLARIN_40k	Polish-Japanese Academy of Information Technology	31.8	10.3
PJA_CLARIN_50k	Polish-Japanese Academy of Information Technology	31.8	10.2
CLARIN_SEJM_40k	Polish-Japanese Academy of Information Technology	33.7	19.1
CLARIN_SEJM_50k	Polish-Japanese Academy of Information Technology	32.5	17.7
MLM+bert_base_polish		73.9	2.1
tR-Ex_xk	Uniwersytet Wrocławski. Instytut Informatyki	25.7	18.1
tR-Ex_fbs	Uniwersytet Wrocławski. Instytut Informatyki	24.31	17.2
tR-Ex_fx	Uniwersytet Wrocławski. Instytut Informatyki	25.0	23.3
tR-Ex_kxv2	Uniwersytet Wrocławski. Instytut Informatyki	25.5	17.1

The submission titled "MLM+bert_base_polish" only had 171 out of 462 files submitted and cannot be directly compared with others. The result for only the present files was 26.8%. If we extract these 171 files from the winning submission, its result would yield 23.4%, so this submission would have lost in this comparison as well.

The winning submission by the Wrocław Uniwersity team titled "flair-bigsmall" was submitted twice (both submission were identical) and it also lacked the output for 2 files. The result above is calculated assuming these two files were completely incorrect. If we didn't account for these files, the result would be 24.0%.

It was a close call with between the Wrocław University and Adam Mickiewicz University teams, but ultimately the former team won. We would like to again thank all the teams for participating!

Task 2: Morphosyntactic tagging of Middle, New and Modern Polish

Gold-standard data

Download here

Results

Submission	Affiliation	Accuracy	Acc on known	Acc on ign	Acc on manual
Alium-1.25		0.8880	0,8985	0,4295	0,2427
Alium-1000		0,8880	0,8985	0,4306	0,2427
KFTT train	UJ, AGH	0,9564	0,9600	0,7991	0,6661
KFTT train+devel wo_morf	UJ, AGH	0,9563	0,9595	0,8191	0,6730
KFTT train+devel	UJ, AGH	0,9573	0,9607	0,8102	0,6781
Simple Baseline: COMBO	Allegro.pl, Melodice.org	0,9284	0,9363	0,5838	0,5232
CMC Graph Heuristics	Wrocław University of Science and Technology	0,9121	0,9214	0,5072	0,1670
Simple Baselines: XLM-R	Allegro.pl, Melodice.org	0,9499	0,9562	0,6770	0,6850

Eight solutions for this task were submitted by four contestants. The results achieved are far better than we anticipated. Tagging of historical Polish can be expected to be more difficult than tagging contemporary language: the tagset includes more features, some of them describing very rare phenomena; the number of tokens unknown to the morphological analyser is larger (2.25% vs. 1.26%); the word order is less stable (with many discontinuous constructions). Yet, the results compare favourably to the results of PolEval 2017 Task 1(A) for contemporary language (http://2017.poleval.pl/index.php/results/). The best overall accuracy is 95.7% compared to 94.6% of PolEval 2017. The most striking improvement lays in tagging tokens unknown to the morphological analyser: 81.9% compared to 67% in PolEval 2017.

These results require a further study, which will hopefully lead to interesting discussions during the PolEval 2020 conference session, but generally we can conclude that the presented systems not only improve on tagging historical texts, but provide better taggers also for contemporary Polish, which is a great achievement.

Task 3: Word sense disambiguation

Gold-standard data

Download here

Results

Submission	Affiliation	Precision KPWr	Recall KPWr	Precision Sherlock	Recall Sherlock
Polbert for WSD (v2)	skok.ai	0.599296	0.588727	0.592263	0.576850
Polbert for WSD	skok.ai	0.564432	0.550860	0.564384	0.542966
PolevalWSDv1		0.318547	0.231085	0.291732	0.200867

Task 4: IE and entity typing from long documents with complex layouts

Submission	Affiliation	F1 score
CLEX	Wrocław University of Science and Technology	0.651±0.019
double_big	Poznan University of Technology; WIZIPISI	0.606±0.017
300_xgb	Poznan University of Technology; WIZIPISI	0.592±0.015
double_small	Poznan University of Technology; WIZIPISI	0.588±0.018
300_RF	Poznan University of Technology; WIZIPISI	0.587±0.015
middle_big	Poznan University of Technology; WIZIPISI	0.585±0.016
100_RF	Poznan University of Technology; WIZIPISI	0.584±0.016
Multilingual BERT + Random Forest	skok.ai	0.440±0.014

POLEVAL 2020

Results

Task 1: Post-editing and rescoring of ASR results

Task 2: Morphosyntactic tagging of Middle, New and Modern Polish

Task 3: Word sense disambiguation

Task 4: IE and entity typing from long documents with complex layouts