RSS twitter Login
Home Contact Login

LREC 2020 Paper Dissemination (6/10)

Share this page!
twitter google-plus linkedin share

LREC 2020 was not held in Marseille this year and only the Proceedings were published.

The ELRA Board and the LREC 2020 Programme Committee now feel that those papers should be disseminated again, in a thematic-oriented way, shedding light on specific “topics/sessions”.

Packages with several sessions will be disseminated every Tuesday for 10 weeks, from Nov 10, 2020 until the end of January 2021.

Each session displays papers’ title and authors, with corresponding abstract (for ease of reading) and url, in like manner as the Book of Abstracts we used to print and distribute at LRECs.

We hope that you discover interesting, even exciting, work that may be useful for your own research.

Group of papers sent on December 15, 2020

Links to each session


Machine Translation


LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition

Benjamin Beilharz, Xin Sun, Sariya Karimova and Stefan Riezler

We present a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that speech alignment quality is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for German speech recognition and for end-to-end German-to-English speech translation.


SEDAR: a Large Scale French-English Financial Domain Parallel Corpus

Abbas Ghaddar and Phillippe Langlais

This paper describes the acquisition, preprocessing and characteristics of SEDAR, a large scale English-French parallel corpus for the financial domain. Our extensive experiments on machine translation show that SEDAR is essential to obtain good performance on finance. We observe a large gain in the performance of machine translation systems trained on SEDAR when tested on finance, which makes SEDAR suitable to study domain adaptation for neural machine translation. The first release of the corpus comprises 8.6 million high quality sentence pairs that are publicly available for research at


JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Makoto Morishita, Jun Suzuki and Masaaki Nagata

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes a broader range of domains and how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches achieved or surpassed performance comparable to model training from the initial state and reduced the training time. Additionally, we trained the model with an in-domain dataset and JParaCrawl to show how we achieved the best performance with them. JParaCrawl and the pre-trained models are freely available online for research purposes.


Neural Machine Translation for Low-Resourced Indian Languages

Himanshu Choudhary, Shivansh Rao and Rajesh Rohilla

A large number of significant assets are available online in English, which is frequently translated into native languages to ease the information sharing among local people who are not much familiar with English. However, manual translation is a very tedious, costly, and time-taking process. To this end, machine translation is an effective approach to convert text to a different language without any human involvement. Neural machine translation (NMT) is one of the most proficient translation techniques amongst all existing machine translation systems. In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam. We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system that overcomes the OOV (Out Of Vocabulary) problem for low resourced morphological rich Indian languages which do not have much translation available online. We also collected corpus from different sources, addressed the issues with these publicly available data and refined them for further uses. We used the BLEU score for evaluating our system performance. Experimental results and survey confirmed that our proposed translator (24.34 and 9.78 BLEU score) outperforms Google translator (9.40 and 5.94 BLEU score) respectively.


Content-Equivalent Translated Parallel News Corpus and Extension of Domain Adaptation for NMT

Hideya Mino, Hideki Tanaka, Hitoshi Ito, Isao Goto, Ichiro Yamada and Takenobu Tokunaga

In this paper, we deal with two problems in Japanese-English machine translation of news articles. The first problem is the quality of parallel corpora. Neural machine translation (NMT) systems suffer degraded performance when trained with noisy data. Because there is no clean Japanese-English parallel data for news articles, we build a novel parallel news corpus consisting of Japanese news articles translated into English in a content-equivalent manner. This is the first content-equivalent Japanese-English news corpus translated specifically for training NMT systems. The second problem involves the domain-adaptation technique. NMT systems suffer degraded performance when trained with mixed data having different features, such as noisy data and clean data. Though the existing methods try to overcome this problem by using tags for distinguishing the differences between corpora, it is not sufficient. We thus extend a domain-adaptation method using multi-tags to train an NMT model effectively with the clean corpus and existing parallel news corpora with some types of noise. Experimental results show that our corpus increases the translation quality, and that our domain-adaptation method is more effective for learning with the multiple types of corpora than existing domain-adaptation methods are.


NMT and PBSMT Error Analyses in English to Brazilian Portuguese Automatic Translations

Helena Caseli and Marcio Inácio

Machine Translation (MT) is one of the most important natural language processing applications. Independently of the applied MT approach, a MT system automatically generates an equivalent version (in some target language) of an input sentence (in some source language). Recently, a new MT approach has been proposed: neural machine translation (NMT). NMT systems have already outperformed traditional phrase-based statistical machine translation (PBSMT) systems for some pairs of languages. However, any MT approach outputs errors. In this work we present a comparative study of MT errors generated by a NMT system and a PBSMT system trained on the same English -- Brazilian Portuguese parallel corpus. This is the first study of this kind involving NMT for Brazilian Portuguese. Furthermore, the analyses and conclusions presented here point out the specific problems of NMT outputs in relation to PBSMT ones and also give lots of insights into how to implement automatic post-editing for a NMT system. Finally, the corpora annotated with MT errors generated by both PBSMT and NMT systems are also available.


Evaluation Dataset for Zero Pronoun in Japanese to English Translation

Sho Shimazu, Sho Takase, Toshiaki Nakazawa and Naoaki Okazaki

In natural language, we often omit some words that are easily understandable from the context. In particular, pronouns of subject, object, and possessive cases are often omitted in Japanese; these are known as zero pronouns. In translation from Japanese to other languages, we need to find a correct antecedent for each zero pronoun to generate a correct and coherent translation. However, it is difficult for conventional automatic evaluation metrics (e.g., BLEU) to focus on the success of zero pronoun resolution. Therefore, we present a hand-crafted dataset to evaluate whether translation models can resolve the zero pronoun problems in Japanese to English translations. We manually and statistically validate that our dataset can effectively evaluate the correctness of the antecedents selected in translations. Through the translation experiments using our dataset, we reveal shortcomings of an existing context-aware neural machine translation model.


Better Together: Modern Methods Plus Traditional Thinking in NP Alignment

Ádám Kovács, Judit Ács, Andras Kornai and Gábor Recski

We study a typical intermediary task to Machine Translation, the alignment of NPs in the bitext. After arguing that the task remains relevant even in an end-to-end paradigm, we present simple, dictionary- and word vector-based baselines and a BERT-based system. Our results make clear that even state of the art systems relying on the best end-to-end methods can be improved by bringing in old-fashioned methods such as stopword removal, lemmatization, and dictionaries.


Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Haiyue Song, Raj Dabre, Atsushi Fujita and Sadao Kurohashi

Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we have released our code for parallel data creation.


Being Generous with Sub-Words towards Small NMT Children

Arne Defauw, Tom Vanallemeersch, Koen Van Winckel, Sara Szoc and Joachim Van den Bogaert

In the context of under-resourced neural machine translation (NMT), transfer learning from an NMT model trained on a high resource language pair, or from a multilingual NMT (M-NMT) model, has been shown to boost performance to a large extent. In this paper, we focus on so-called cold start transfer learning from an M-NMT model, which means that the parent model is not trained on any of the child data. Such a set-up enables quick adaptation of M-NMT models to new languages. We investigate the effectiveness of cold start transfer learning from a many-to-many M-NMT model to an under-resourced child. We show that sufficiently large sub-word vocabularies should be used for transfer learning to be effective in such a scenario. When adopting relatively large sub-word vocabularies we observe increases in performance thanks to transfer learning from a parent M-NMT model, both when translating to and from the under-resourced language. Our proposed approach involving dynamic vocabularies is both practical and effective. We report results on two under-resourced language pairs, i.e. Icelandic-English and Irish-English.


Document Sub-structure in Neural Machine Translation

Radina Dobreva, Jie Zhou and Rachel Bawden

Current approaches to machine translation (MT) either translate sentences in isolation, disregarding the context they appear in, or model context at the level of the full document, without a notion of any internal structure the document may have. In this work we consider the fact that documents are rarely homogeneous blocks of text, but rather consist of parts covering different topics. Some documents, such as biographies and encyclopedia entries, have highly predictable, regular structures in which sections are characterised by different topics. We draw inspiration from Louis and Webber (2014) who use this information to improve statistical MT and transfer their proposal into the framework of neural MT. We compare two different methods of including information about the topic of the section within which each sentence is found: one using side constraints and the other using a cache-based model. We create and release the data on which we run our experiments - parallel corpora for three language pairs (Chinese-English, French-English, Bulgarian-English) from Wikipedia biographies, which we extract automatically, preserving the boundaries of sections within the articles.


An Evaluation Benchmark for Testing the Word Sense Disambiguation Capabilities of Machine Translation Systems

Alessandro Raganato, Yves Scherrer and Jörg Tiedemann

Lexical ambiguity is one of the many challenging linguistic phenomena involved in translation, i.e., translating an ambiguous word with its correct sense. In this respect, previous work has shown that the translation quality of neural machine translation systems can be improved by explicitly modeling the senses of ambiguous words. Recently, several evaluation test sets have been proposed to measure the word sense disambiguation (WSD) capability of machine translation systems. However, to date, these evaluation test sets do not include any training data that would provide a fair setup measuring the sense distributions present within the training data itself. In this paper, we present an evaluation benchmark on WSD for machine translation for 10 language pairs, comprising training data with known sense distributions. Our approach for the construction of the benchmark builds upon the wide-coverage multilingual sense inventory of BabelNet, the multilingual neural parsing pipeline TurkuNLP, and the OPUS collection of translated texts from the web. The test suite is available at


MEDLINE as a Parallel Corpus: a Survey to Gain Insight on French-, Spanish- and Portuguese-speaking Authors’ Abstract Writing Practice

Aurélie Névéol, Antonio Jimeno Yepes and Mariana Neves

Background: Parallel corpora are used to train and evaluate machine translation systems. To alleviate the cost of producing parallel resources for evaluation campaigns, existing corpora are leveraged. However, little information may be available about the methods used for producing the corpus, including translation direction. Objective: To gain insight on MEDLINE parallel corpus used in the biomedical task at the Workshop on Machine Translation in 2019 (WMT 2019). Material and Methods: Contact information for the authors of MEDLINE articles included in the English/Spanish (EN/ES), English/French (EN/FR), and English/Portuguese (EN/PT) WMT 2019 test sets was obtained from PubMed and publisher websites. The authors were asked about their abstract writing practices in a survey. Results: The response rate was above 20%. Authors reported that they are mainly native speakers of languages other than English. Although manual translation, sometimes via professional translation services, was commonly used for abstract translation, authors of articles in the EN/ES and EN/PT sets also relied on post-edited machine translation. Discussion: This study provides a characterization of MEDLINE authors’ language skills and abstract writing practices. Conclusion: The information collected in this study will be used to inform test set design for the next WMT biomedical task.


JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

Zhuoyuan Mao, Fabien Cromieres, Raj Dabre, Haiyue Song and Sadao Kurohashi

Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boosting NMT quality for languages with small parallel corpora. However, they do not account for linguistic information obtained using syntactic analyzers which is known to be invaluable for several Natural Language Processing (NLP) tasks. To this end, we propose JASS, Japanese-specific Sequence to Sequence, as a novel pre-training alternative to MASS for NMT involving Japanese as the source or target language. JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training which focuses on Japanese linguistic units called bunsetsus. In our experiments on ASPEC Japanese–English and News Commentary Japanese–Russian translation we show that JASS can give results that are competitive with if not better than those given by MASS. Furthermore, we show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods indicating their complementary nature. We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.


A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?

Julia Ive, Lucia Specia, Sara Szoc, Tom Vanallemeersch, Joachim Van den Bogaert, Eduardo Farah, Christine Maroti, Artur Ventura and Maxim Khalilov

We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.


Linguistically Informed Hindi-English Neural Machine Translation

Vikrant Goyal, Pruthwik Mishra and Dipti Misra Sharma

Hindi-English Machine Translation is a challenging problem, owing to multiple factors including the morphological complexity and relatively free word order of Hindi, in addition to the lack of sufficient parallel training data. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. To overcome the data sparsity issue caused by the lack of large parallel corpora for Hindi-English, we propose a method to employ additional linguistic knowledge which is encoded by different phenomena depicted by Hindi. We generalize the embedding layer of the state-of-the-art Transformer model to incorporate linguistic features like POS tag, lemma and morph features to improve the translation performance. We compare the results obtained on incorporating this knowledge with the baseline systems and demonstrate significant performance improvements. Although, the Transformer NMT models have a strong efficacy to learn language constructs, we show that the usage of specific features further help in improving the translation performance.


A Test Set for Discourse Translation from Japanese to English

Masaaki Nagata and Makoto Morishita

We made a test set for Japanese-to-English discourse translation to evaluate the power of context-aware machine translation.  For each discourse phenomenon, we systematically collected examples where the translation of the second sentence depends on the first sentence.  Compared with a previous study on test sets for English-to-French discourse translation \cite{Bawden_elal_NAACL2018}, we needed different approaches to make the data because Japanese has zero pronouns and represents different senses in different characters.  We improved the translation accuracy using context-aware neural machine translation, and the improvement mainly reflects the betterment of the translation of zero pronouns.


An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages

Aaron Mueller, Garrett Nicolai, Arya D. McCarthy, Dylan Lewis, Winston Wu and David Yarowsky

In this work, we explore massively multilingual low-resource neural machine translation. Using translations of the Bible (which have parallel structure across languages), we train models with up to 1,107 source languages. We create various multilingual corpora, varying the number and relatedness of source languages. Using these, we investigate the best ways to use this many-way aligned resource for multilingual machine translation. Our experiments employ a grammatically and phylogenetically diverse set of source languages during testing for more representative evaluations. We find that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance---the best number depends on the source language. Furthermore, training on related languages can improve or degrade performance, depending on the language. As there is no one-size-fits-most answer, we find that it is critical to tailor one's approach to the source language and its typology.


TDDC: Timely Disclosure Documents Corpus

Nobushige Doi, Yusuke Oda and Toshiaki Nakazawa

In this paper, we describe the details of the Timely Disclosure Documents Corpus (TDDC). TDDC was prepared by manually aligning the sentences from past Japanese and English timely disclosure documents in PDF format published by companies listed on the Tokyo Stock Exchange.  TDDC consists of approximately 1.4 million parallel sentences in Japanese and English. TDDC was used as the official dataset for the 6th Workshop on Asian Translation to encourage the development of machine translation.


MuST-Cinema: a Speech-to-Subtitles corpus

Alina Karakanta, Matteo Negri and Marco Turchi

Growing needs in localising audiovisual content in multiple languages through subtitles call for the development of automatic solutions for human subtitling. Neural Machine Translation (NMT) can contribute to the automatisation of subtitling, facilitating the work of human subtitlers and reducing turn-around times and related costs. NMT requires high-quality, large, task-specific training data. The existing subtitling corpora, however, are missing both alignments to the source language audio and important information about subtitle breaks. This poses a significant limitation for developing efficient automatic approaches for subtitling, since the length and form of a subtitle directly depends on the duration of the utterance. In this work, we present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles. The corpus is comprised of (audio, transcription, translation) triplets. Subtitle breaks are preserved by inserting special symbols. We show that the corpus can be used to build models that efficiently segment sentences into subtitles and propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.


On Context Span Needed for Machine Translation Evaluation

Sheila Castilho, Maja Popović and Andy Way

Despite  increasing efforts to improve evaluation of machine translation (MT) by going beyond the sentence level to the document level, the definition of what exactly constitutes a "document level" is still not clear. This work deals with the context span necessary for a more reliable MT evaluation. We report results from a series of surveys involving three domains and 18 target languages designed to identify the necessary context span as well as issues related to it. Our findings indicate that, despite the fact that some issues and spans are strongly dependent on domain and on the target language, a number of common patterns can be observed so that general guidelines for context-aware MT evaluation can be drawn.


A Multilingual Parallel Corpora Collection Effort for Indian Languages

Shashank Siripragada, Jerin Philip, Vinay P. Namboodiri and C V Jawahar

We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.


To Case or not to case: Evaluating Casing Methods for Neural Machine Translation

Thierry Etchegoyhen and Harritxu Gete

We present a comparative evaluation of casing methods for Neural Machine Translation, to help establish an optimal pre- and post-processing methodology. We trained and compared system variants on data prepared with the main casing methods available, namely translation of raw data without case normalisation, lowercasing with recasing, truecasing, case factors and inline casing. Machine translation models were prepared on WMT 2017 English-German and English-Turkish datasets, for all translation directions, and the evaluation includes reference metric results as well as a targeted analysis of case preservation accuracy. Inline casing, where case information is marked along lowercased words in the training data, proved to be the optimal approach overall in these experiments.


The MARCELL Legislative Corpus

Tamás Váradi, Svetla Koeva, Martin Yamalov, Marko Tadić, Bálint Sass, Bartłomiej Nitoń, Maciej Ogrodniczuk, Piotr Pęzik, Verginica Barbu Mititelu, Radu Ion, Elena Irimia, Maria Mitrofan, Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, Andraz Repar, Matjaž Rihtar and Janez Brank

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.


ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

Felipe Soares, Mark Stevenson, Diego Bartolome and Anna Zaretskaya

The Google Patents is one of the main important sources of patents information. A striking characteristic is that many of its abstracts are presented in more than one language, thus making it a potential source of parallel corpora. This article presents the development of a parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned. We demonstrate the capabilities of our corpus by training Neural Machine Translation (NMT) models for the main 9 language pairs, with a total of 18 models. Our parallel corpus is freely available in TSV format and with a SQLite database, with complementary information regarding patent metadata.


Corpora for Document-Level Neural Machine Translation

Siyou Liu and Xiaojun Zhang

Instead of translating sentences in isolation, document-level machine translation aims to capture discourse dependencies across sentences by considering a document as a whole. In recent years, there have been more interests in modelling larger context for the state-of-the-art neural machine translation (NMT). Although various document-level NMT models have shown significant improvements, there nonetheless exist three main problems: 1) compared with sentence-level translation tasks, the data for training robust document-level models are relatively low-resourced; 2) experiments in previous work are conducted on their own datasets which vary in size, domain and language; 3) proposed approaches are implemented on distinct NMT architectures such as recurrent neural networks (RNNs) and self-attention networks (SANs). In this paper, we aims to alleviate the low-resource and under-universality problems for document-level NMT. First, we collect a large number of existing document-level corpora, which covers 7 language pairs and 6 domains. In order to address resource sparsity, we construct a novel document parallel corpus in Chinese-Portuguese, which is a non-English-centred and low-resourced language pair. Besides, we implement and evaluate the commonly-cited document-level method on top of the advanced Transformer model with universal settings. Finally, we not only demonstrate the effectiveness and universality of document-level NMT, but also release the preprocessed data, source code and trained models for comparison and reproducibility.


OpusTools and Parallel Corpus Diagnostics

Mikko Aulamo, Umut Sulubacak, Sami Virpioja and Jörg Tiedemann

This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.


Literary Machine Translation under the Magnifying Glass: Assessing the Quality of an NMT-Translated Detective Novel on Document Level

Margot Fonteyne, Arda Tezcan and Lieve Macken

Several studies (covering many language pairs and translation tasks) have demonstrated that translation quality has improved enormously since the emergence of neural machine translation systems. This raises the question whether such systems are able to produce high-quality translations for more creative text types such as literature and whether they are able to generate coherent translations on document level. Our study aimed to investigate these two questions by carrying out a document-level evaluation of the raw NMT output of an entire novel. We translated Agatha Christie's novel The Mysterious Affair at Styles with Google’s NMT system from English into Dutch and annotated it in two steps: first all fluency errors, then all accuracy errors. We report on the overall quality, determine the remaining issues, compare the most frequent error types to those in general-domain MT, and investigate whether any accuracy and fluency errors co-occur regularly. Additionally, we assess the inter-annotator agreement on the first chapter of the novel.


Handle with Care: A Case Study in Comparable Corpora Exploitation for Neural Machine Translation

Thierry Etchegoyhen and Harritxu Gete

We present the results of a case study in the exploitation of comparable corpora for Neural Machine Translation. A large comparable corpus for Basque-Spanish was prepared, on the basis of independently-produced news by the Basque public broadcaster EiTB, and we discuss the impact of various techniques to exploit the original data in order to determine optimal variants of the corpus. In particular, we show that filtering in terms of alignment thresholds and length-difference outliers has a significant impact on translation quality. The impact of tags identifying comparable data in the training datasets is also evaluated, with results indicating that this technique might be useful to help the models discriminate noisy information, in the form of informational imbalance between aligned sentences. The final corpus was prepared according to the experimental results and is made available to the scientific community for research purposes.


The FISKMÖ Project: Resources and Tools for Finnish-Swedish Machine Translation and Cross-Linguistic Research

Jörg Tiedemann, Tommi Nieminen, Mikko Aulamo, Jenna Kanerva, Akseli Leino, Filip Ginter and Niko Papula

This paper presents FISKMÖ, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns.


Multiword Expression aware Neural Machine Translation

Andrea Zaninello and Alexandra Birch

Multiword Expressions (MWEs) are a frequently occurring phenomenon found in all natural languages that is of great importance to linguistic theory, natural language processing applications, and machine translation systems. Neural Machine Translation (NMT) architectures do not handle these expressions well and previous studies have rarely addressed MWEs in this framework. In this work, we show that annotation and data augmentation, using external linguistic resources, can improve both translation of MWEs that occur in the source, and the generation of MWEs on the target, and increase performance by up to 5.09 BLEU points on MWE test sets. We also devise a MWE score to specifically assess the quality of MWE translation which agrees with human evaluation. We make available the MWE score implementation – along with MWE-annotated training sets and corpus-based lists of MWEs – for reproduction and extension.


Morphology and Tagging

Back to Top

An Enhanced Mapping Scheme of the Universal Part-Of-Speech for Korean

Myung Hee Kim and Nathalie Colineau

When mapping a language specific Part-Of-Speech (POS) tag set to the Universal POS tag set (UPOS), it is critical to consider the individual language’s linguistic features and the UPOS definitions. In this paper, we present an enhanced Sejong POS mapping to the UPOS in accordance with the Korean linguistic typology and the substantive definitions of the UPOS categories. This work updated one third of the Sejong POS mapping to the UPOS. We also introduced a new mapping for the KAIST POS tag set, another widely used Korean POS tag set, to the UPOS.


Finite State Machine Pattern-Root Arabic Morphological Generator, Analyzer and Diacritizer

Maha Alkhairy, Afshan Jafri and David Smith

We describe and evaluate the Finite-State Arabic Morphologizer (FSAM) – a concatenative (prefix-stem-suffix) and templatic (root- pattern) morphologizer that generates and analyzes undiacritized Modern Standard Arabic (MSA) words, and diacritizes them. Our bidirectional unified-architecture finite state machine (FSM) is based on morphotactic MSA grammatical rules. The FSM models the root-pattern structure related to semantics and syntax, making it readily scalable unlike stem-tabulations in prevailing systems. We evaluate the coverage and accuracy of our model, with coverage being percentage of words in Tashkeela (a large corpus) that can be analyzed. Accuracy is computed against a gold standard, comprising words and properties, created from the intersection of UD PADT treebank and Tashkeela. Coverage of analysis (extraction of root and properties from word) is 82%. Accuracy results are: root computed from a word (92%), word generation from a root (100%), non-root properties of a word (97%), and diacritization (84%). FSAM’s non-root results match or surpass MADAMIRA’s, and root result comparisons are not made because of the concatenative nature of publicly available morphologizers.


An Unsupervised Method for Weighting Finite-state Morphological Analyzers

Amr Keleg, Francis Tyers, Nick Howell and Tommi Pirinen

Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word's context to disambiguate its set of candidate analyses.


Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction

Danushka Bollegala, Ryuichi Kiryo, Kosuke Tsujino and Haruki Yukawa

Language-independent tokenisation (LIT) methods that do not require labelled language resources or lexicons have recently gained popularity because of their applicability in resource-poor languages. Moreover, they compactly represent a language using a fixed size vocabulary and can efficiently handle unseen or rare words. On the other hand, language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources. Unlike subtokens produced by LIT methods, LST methods produce valid morphological subwords. Despite the contrasting trade-offs between LIT vs. LST methods, their performance on downstream NLP tasks remain unclear. In this paper, we empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages. Our experimental results covering eight languages show that LST consistently outperforms LIT when the vocabulary size is large, but LIT can produce comparable or better results than LST in many languages with comparatively smaller (i.e. less than 100K words) vocabulary sizes, encouraging the use of LIT when language-specific resources are unavailable, incomplete or a smaller model is required. Moreover, we find that smoothed inverse frequency (SIF) to be an accurate method to create word embeddings from subword embeddings for multilingual semantic similarity prediction tasks. Further analysis of the nearest neighbours of tokens show that semantically and syntactically related tokens are closely embedded in subword embedding spaces.


A Supervised Part-Of-Speech Tagger for the Greek Language of the Social Web

Maria Nefeli Nikiforos and Katia Lida Kermanidis

The increasing volume of communication via microblogging messages on social networks has created the need for efficient Natural Language Processing (NLP) tools, especially for unstructured text processing.  Extracting information from unstructured social text is one of the most demanding NLP tasks. This paper presents the first part-of-speech tagged data set of social text in Greek, as well as the first supervised part-of-speech tagger developed for such data sets.


Bag & Tag'em - A New Dutch Stemmer

Anne Jonker, Corné de Ruijt and Jornt de Gruijl

We propose a novel stemming algorithm that is both robust and accurate compared to state-of-the-art solutions, yet addresses several of the problems that current stemmers face in the Dutch language. The main issue is that most current stemmers cannot handle 3rd person singular forms of verbs and many irregular words and conjugations, unless a (nearly) brute-force approach is used. Our algorithm combines a new tagging module with a stemmer that uses tag-specific sets of rigid rules: the Bag & Tag’em (BT) algorithm. The tagging module is developed and evaluated using three algorithms: Multinomial Logistic Regression (MLR), Neural Network (NN) and Extreme Gradient Boosting (XGB). The stemming module’s performance is compared with that of current state-of-the-art stemming algorithms for the Dutch Language. Even though there is still room for improvement, the new BT algorithm performs well in the sense that it is more accurate than the current stemmers and faster than brute-force-like algorithms. The code and data used for this paper can be found at:


Glawinette: a Linguistically Motivated Derivational Description of French Acquired from GLAWI

Nabil Hathout, Franck Sajous, Basilio Calderone and Fiammetta Namer

Glawinette is a derivational lexicon of French that will be used to feed the Démonette database.  It has been created from the GLAWI machine readable dictionary.  We collected couples of words from the definitions and the morphological sections of the dictionary  and then selected the ones that form regular formal analogies and that instantiate frequent enough formal patterns. The graph structure of the morphological families has then been used to identify for each couple of lexemes derivational patterns that are close to the intuition of the morphologists.


BabyFST - Towards a Finite-State Based Computational Model of Ancient Babylonian

Aleksi Sahala, Miikka Silfverberg, Antti Arppe and Krister Lindén

Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3% and recall up to 93.7% on lemmatization and POS-tagging task on token level from a transcribed input. Since Akkadian word forms exhibit a high degree of morphological ambiguity, in that only 20.1% of running word tokens receive a single unambiguous analysis, we attempt a first pass at weighting our finite-state transducer, using existing extensive Akkadian corpora which have been partially validated for their lemmas and parts-of-speech but not the entire morphological analyses. The resultant weighted finite-state transducer yields a moderate improvement so that for 57.4% of the word tokens the highest ranked analysis is the correct one. We conclude with a short discussion on how morphological ambiguity in the analysis of Akkadian could be further reduced with improvements in the training data used in weighting the finite-state transducer as well as through other, context-based techniques.


Morphological Analysis and Disambiguation for Gulf Arabic: The Interplay between Resources and Methods

Salam Khalifa, Nasser Zalmout and Nizar Habash

In this paper we present the first full morphological analysis and disambiguation system for Gulf Arabic. We use an existing state-of-the-art morphological disambiguation system to investigate the effects of different data sizes and different combinations of morphological analyzers for Modern Standard Arabic, Egyptian Arabic, and Gulf Arabic. We find that in very low settings, morphological analyzers help boost the performance of the full morphological disambiguation task. However, as the size of resources increase, the value of the morphological analyzers decreases.


Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus

Eleni Metheniti and Guenter Neumann

Multilingual, inflectional corpora are a scarce resource in the NLP community, especially corpora with annotated morpheme boundaries. We are evaluating a generated, multilingual inflectional corpus with morpheme boundaries, generated from the English Wiktionary (Metheniti and Neumann, 2018), against the largest, multilingual, high-quality inflectional corpus of the UniMorph project (Kirov et al., 2018). We confirm that the generated Wikinflection corpus is not of such quality as UniMorph, but we were able to extract a significant amount of words from the intersection of the two corpora. Our Wikinflection corpus benefits from the morpheme segmentations of Wiktionary/Wikinflection and from the manually-evaluated morphological feature tags of the UniMorph project, and has 216K lemmas and 5.4M word forms, in a total of 68 languages.


Introducing a Large-Scale Dataset for Vietnamese POS Tagging on Conversational Texts

Oanh Tran, Tu Pham, Vu Dang and Bang Nguyen

This paper introduces a large-scale human-labeled dataset for the Vietnamese POS tagging task on conversational texts. To this end, wepropose a new tagging scheme (with 36 POS tags) consisting of exclusive tags for special phenomena of conversational words, developthe annotation guideline and manually annotate 16.310K sentences using this guideline. Based on this corpus, a series of state-of-the-art tagging methods has been conducted to estimate their performances. Experimental results showed that the Conditional Random Fields model using both automatically learnt features from deep neural networks and handcrafted features yielded the best performance. Thismodel achieved 93.36% in the accuracy score which is 1.6% and 2.7% higher than the model using either handcrafted features orautomatically-learnt features, respectively. This result is also a little bit higher than the model of fine-tuning BERT by 0.94% in theaccuracy score. The performance measured on each POS tag is also very high with >90% in the F1 score for 20 POS tags and >80%in the F1 score for 11 POS tags. This work provides the public dataset and preliminary results for follow-up research on this interesting direction.


UniMorph 3.0: Universal Morphology

Arya D. McCarthy, Christo Kirov, Matteo Grella, Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekaterina Vylomova, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, Timofey Arkhangelskiy, Nataly Krizhanovsky, Andrew Krizhanovsky, Elena Klyachko, Alexey Sorokin, John Mansfield, Valts Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan Cotterell, Mans Hulden and David Yarowsky

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological paradigms for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. We have implemented several improvements to the extraction pipeline which creates most of our data, so that it is both more complete and more correct. We have added 66 new languages, as well as new parts of speech for 12 languages. We have also amended the schema in several ways. Finally, we present three new community tools: two to validate data for resource creators, and one to make morphological data available from the command line. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018.


Building the Spanish-Croatian Parallel Corpus

Bojana Mikelenić and Marko Tadić

This paper describes the building of the first Spanish-Croatian unidirectional parallel corpus, which has been constructed at the Faculty of Humanities and Social Sciences of the University of Zagreb. The corpus is comprised of eleven Spanish novels and their translations to Croatian done by six different professional translators. All the texts were published between 1999 and 2012. The corpus has more than 2 Mw, with approximately 1 Mw for each language. It was automatically sentence segmented and aligned, as well as manually post-corrected, and contains 71,778 translation units. In order to protect the copyright and to make the corpus available under permissive CC-BY licence, the aligned translation units are shuffled. This limits the usability of the corpus for research of language units at sentence and lower language levels only. There are two versions of the corpus in TMX format that will be available for download through META-SHARE and CLARIN ERIC infrastructure. The former contains plain TMX, while the latter is lemmatised and POS-tagged and stored in the aTMX format.


DerivBase.Ru: a Derivational Morphology Resource for Russian

Daniil Vodolazsky

Russian morphology has been studied for decades, but there is still no large high coverage resource that contains the derivational families (groups of words that share the same root) of Russian words. The number of words used in different areas of the language grows rapidly, thus the human-made dictionaries published long time ago cannot cover the neologisms and the domain-specific lexicons. To fill such resource gap, we have developed a rule-based framework for deriving words and we applied it to build a derivational morphology resource named DerivBase.Ru, which we introduce in this paper.


Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning

Stig-Arne Grönroos, Sami Virpioja and Mikko Kurimo

Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard. We publish implementations of the new algorithms in the widely-used Morfessor software package.


Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian

Ranka Stankovic, Branislava Šandrih, Cvetana Krstev, Miloš Utvić and Mihailo Skoric

The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The sr_basic annotated dataset will also be published.


Fine-grained Morphosyntactic Analysis and Generation Tools for More Than One Thousand Languages

Garrett Nicolai, Dylan Lewis, Arya D. McCarthy, Aaron Mueller, Winston Wu and David Yarowsky

Exploiting the broad translation of the Bible into the world's languages, we train and distribute morphosyntactic tools for approximately one thousand languages, vastly outstripping previous distributions of tools devoted to the processing of inflectional morphology. Evaluation of the tools on a subset of available inflectional dictionaries demonstrates strong initial models, supplemented and improved through ensembling and dictionary-based reranking. Likewise, a novel type-to-token based evaluation metric allows us to confirm that models generalize well across rare and common forms alike


Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus

Mohamed Balabel, Injy Hamed, Slim Abdennadher, Ngoc Thang Vu and Özlem Çetinoğlu

Code-switching has become a prevalent phenomenon across many communities. It poses a challenge to NLP researchers, mainly due to the lack of available data needed for training and testing applications. In this paper, we introduce a new resource: a corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Beside the corpus itself, we provide annotation guidelines to address the unique challenges of annotating code-switch data. Another challenge that we address is the fact that Egyptian Arabic orthography and grammar are not standardized.


Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation

Alexey Sorokin

We investigate how to improve quality of low-resource morphological inflection without annotating more data. We examine two methods, language models and data augmentation. We show that the model whose decoder that additionally uses the states of the langauge model improves the model quality by 1.5% in combination with both baselines. We also demonstrate that the augmentation of data improves performance by 9% in average when adding $1000$ artificially generated word forms to the dataset.


Visual Modeling of Turkish Morphology

Berke Özenç and Ercan Solak

In this paper, we describe the steps in a visual modeling of Turkish morphology using diagramming tools. We aimed to make modeling easier and more maintainable while automating much of the code generation. We released the resulting analyzer, MorTur, and the diagram conversion tool, DiaMor as free, open-source utilities. MorTur analyzer is also publicly available on its web page as a web service. MorTur and DiaMor are part of our ongoing efforts in building a set of natural language processing tools for Turkic languages under a consistent framework.


Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic

Jón Daðason, David Mollberg, Hrafn Loftsson and Kristín Bjarnadóttir

In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer.


Morphological Segmentation for Low Resource Languages

Justin Mott, Ann Bies, Stephanie Strassel, Jordan Kodner, Caitlin Richter, Hongzhi Xu and Mitchell Marcus

This paper describes a new morphology resource created by Linguistic Data Consortium and the University of Pennsylvania for the DARPA LORELEI Program. The data consists of approximately 2000 tokens annotated for morphological segmentation in each of 9 low resource languages, along with root information for 7 of the languages. The languages annotated show a broad diversity of typological features. A minimal annotation scheme for segmentation was developed such that it could capture the patterns of a wide range of languages and also be performed reliably by non-linguist annotators. The basic annotation guidelines were designed to be language-independent, but included language-specific morphological paradigms and other specifications. The resulting annotated corpus is designed to support and stimulate the development of unsupervised morphological segmenters and analyzers by providing a gold standard for their evaluation on a more typologically diverse set of languages than has previously been available. By providing root annotation, this corpus is also a step toward supporting research in identifying richer morphological structures than simple morpheme boundaries.



Back to Top


CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin and Edouard Grave

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.


On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning

Yerai Doval, Jose Camacho-Collados, Luis Espinosa Anke and Steven Schockaert

Cross-lingual word embeddings are vector representations of words in different languages where words with similar meaning are represented by similar vectors, regardless of the language. Recent developments which construct these embeddings by aligning monolingual spaces have shown that accurate alignments can be obtained with little or no supervision, which usually comes in the form of bilingual dictionaries. However, the focus has been on a particular controlled scenario for evaluation, and there is no strong evidence on how current state-of-the-art systems would fare with noisy text or for language pairs with major linguistic differences. In this paper we present an extensive evaluation over multiple cross-lingual embedding models, analyzing their strengths and limitations with respect to different variables such as target language, training corpora and amount of supervision. Our conclusions put in doubt the view that high-quality cross-lingual embeddings can always be learned without much supervision.


Building an English-Chinese Parallel Corpus Annotated with Sub-sentential Translation Techniques

Yuming Zhai, Lufei Liu, Xinyi Zhong, Gbariel Illouz and Anne Vilnat

Human translators often resort to different non-literal translation techniques besides the literal translation, such as idiom equivalence, generalization, particularization, semantic modulation, etc., especially when the source and target languages have different and distant origins. Translation techniques constitute an important subject in translation studies, which help researchers to understand and analyse translated texts. However, they receive less attention in developing Natural Language Processing (NLP) applications. To fill this gap, one of our long term objectives is to have a better semantic control of extracting paraphrases from bilingual parallel corpora. Based on this goal, we suggest this hypothesis: it is possible to automatically recognize different sub-sentential translation techniques. For this original task, since there is no dedicated data set for English-Chinese, we manually annotated a parallel corpus of eleven genres. Fifty sentence pairs for each genre have been annotated in order to consolidate our annotation guidelines. Based on this data set, we conducted an experiment to classify between literal and non-literal translations. The preliminary results confirm our hypothesis. The corpus and code are available. We hope that this annotated corpus will be useful for linguistic contrastive studies and for fine-grained evaluation of NLP tasks, such as automatic word alignment and machine translation.


Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajic, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers and Daniel Zeman

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.


EMPAC: an English–Spanish Corpus of Institutional Subtitles

Iris Serrat Roozen and José Manuel Martínez Martínez

The EuroparlTV Multimedia Parallel Corpus (EMPAC) is a collection of subtitles in English and Spanish for videos from the EuropeanParliament’s Multimedia Centre. The corpus has been compiled with the EMPAC toolkit. The aim of this corpus is to provide a resource to study institutional subtitling on the one hand, and, on the other hand, facilitate the analysis of web accessibility to institutional multimedia content. The corpus covers a time span from 2009 to 2017, it is made up of 4,000 texts amounting to two and half millions of tokens for every language, corresponding to approximately 280 hours of video. This paper provides 1) a review of related corpora; 2) a revision of typical compilation methodologies of subtitle corpora; 3) a detailed account of the corpus compilation methodology followed; and, 4) a description of the corpus. In the conclusion, the key findings are summarised regarding formal aspects of the subtitles conditioning the accessibility to the multimedia content of the EuroparlTV.


Cross-Lingual Word Embeddings for Turkic Languages

Elmurod Kuriyozov, Yerai Doval and Carlos Gómez-Rodríguez

There has been an increasing interest in learning cross-lingual word embeddings to transfer knowledge obtained from a resource-rich language, such as English, to lower-resource languages for which annotated data is scarce, such as Turkish, Russian, and many others. In this paper, we present the first viability study of established techniques to align monolingual embedding spaces for Turkish, Uzbek, Azeri, Kazakh and Kyrgyz, members of the Turkic family which is heavily affected by the low-resource constraint.  Those techniques are known to require little explicit supervision, mainly in the form of bilingual dictionaries, hence being easily adaptable to different domains, including low-resource ones.  We obtain new bilingual dictionaries and new word embeddings for these languages and show the steps for obtaining cross-lingual word embeddings using state-of-the-art techniques. Then, we evaluate the results using the bilingual dictionary induction task.  Our experiments confirm that the obtained bilingual dictionaries outperform previously-available ones, and that word embeddings from a low-resource language can benefit from resource-rich closely-related languages when they are aligned together.  Furthermore, evaluation on an extrinsic task (Sentiment analysis on Uzbek) proves that monolingual word embeddings can, although slightly, benefit from cross-lingual alignments.


How Universal are Universal Dependencies? Exploiting Syntax for Multilingual Clause-level Sentiment Detection

Hiroshi Kanayama and Ran Iwamoto

This paper investigates clause-level sentiment detection in a multilingual scenario. Aiming at a high-precision, fine-grained, configurable, and non-biased system for practical use cases, we have designed a pipeline method that makes the most of syntactic structures based on Universal Dependencies, avoiding machine-learning approaches that may cause obstacles to our purposes. We achieved high precision in sentiment detection for 17 languages and identified the advantages of common syntactic structures as well as issues stemming from structural differences on Universal Dependencies. In addition to reusable tips for handling multilingual syntax, we provide a parallel benchmarking data set for further research.


Multilingual Culture-Independent Word Analogy Datasets

Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė and Marko Robnik-Šikonja

In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We designed the monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.


GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

Marta R. Costa-jussà, Pau Li Lin and Cristina España-Bonet

We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets.


SpiCE: A New Open-Access Corpus of Conversational Bilingual Speech in Cantonese and English

Khia A. Johnson, Molly Babel, Ivan Fong and Nancy Yiu

This paper describes the design, collection, orthographic transcription, and phonetic annotation of SpiCE, a new corpus of conversational Cantonese-English bilingual speech recorded in Vancouver, Canada. The corpus includes high-quality recordings of 34 early bilinguals in both English and Cantonese—to date, 27 have been recorded for a total of 19 hours of participant speech. Participants completed a sentence reading task, storyboard narration, and conversational interview in each language. Transcription and annotation for the corpus are currently underway. Transcripts produced with Google Cloud Speech-to-Text are available for all participants, and will be included in the initial SpiCE corpus release. Hand-corrected orthographic transcripts and force-aligned phonetic transcripts will be released periodically, and upon completion for all recordings, comprise the second release of the corpus. As an open-access language resource, SpiCE will promote bilingualism research for a typologically distinct pair of languages, of which Cantonese remains understudied despite there being millions of speakers around the world. The SpiCE corpus is especially well-suited for phonetic research on conversational speech, and enables researchers to study cross-language within-speaker phenomena for a diverse group of early Cantonese-English bilinguals. These are areas with few existing high-quality resources.


Identifying Cognates in English-Dutch and French-Dutch by means of Orthographic Information and Cross-lingual Word Embeddings

Els Lefever, Sofie Labat and Pranaydeep Singh

This paper investigates the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In a first step, lists of potential cognate pairs in English-Dutch and French-Dutch are manually labelled. The resulting gold standard is used to train and evaluate a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity between their word embeddings represents the semantic relation between these words. By adding domain-specific information to pretrained fastText embeddings, we are able to obtain good embeddings for words that did not yet have a pretrained embedding (e.g. Dutch compound nouns). These embeddings are then aligned in a cross-lingual vector space by exploiting their structural similarity (cf. adversarial learning). Our results indicate that although the classifier already achieves good results on the basis of orthographic information, the performance further improves by including semantic information in the form of cross-lingual word embeddings.


Lexicogrammatic translationese across two targets and competence levels

Maria Kunilovskaya and Ekaterina Lapshinova-Koltunski

This research employs genre-comparable data from a number of parallel and comparable corpora to explore the specificity of translations from English into German and Russian produced by students and professional translators. We introduce an elaborate set of human-interpretable lexicogrammatic translationese indicators and calculate the amount of translationese manifested in the data for each target language and translation variety. By placing translations into the same feature space as their sources and the genre-comparable non-translated reference texts in the target language, we observe two separate translationese effects: a shift of translations into the gap between the two languages and a shift away from either language. These trends are linked to the features that contribute to each of the effects. Finally, we compare the translation varieties and find out that the professionalism levels seem to have some correlation with the amount and types of translationese detected, while each language pair demonstrates a specific socio-linguistically determined combination of the translationese effects.


UniSent: Universal Adaptable Sentiment Lexica for 1000+ Languages

Ehsaneddin Asgari, Fabienne Braune, Benjamin Roth, Christoph Ringlstetter and Mohammad Mofrad

In this paper, we introduce UniSent universal sentiment lexica for 1000+ languages. Sentiment lexica are vital for sentiment analysis in absence of document-level annotations, a very common scenario for low-resource languages. To the best of our knowledge, UniSent is the largest sentiment resource to date in terms of the number of covered languages, including many low resource ones. In this work, we use a massively parallel Bible corpus to project sentiment information from English to other languages for sentiment analysis on  Twitter data. We introduce a method called DomDrift to mitigate the huge domain mismatch between Bible and Twitter by a confidence weighting scheme that uses domain-specific embeddings to compare the nearest neighbors for a candidate sentiment word in the source (Bible) and target (Twitter) domain. We evaluate the quality of UniSent in a subset of languages for which manually created ground truth was available, Macedonian, Czech, German, Spanish, and French. We show that the quality of UniSent is comparable to manually created sentiment resources when it is used as the sentiment seed for the task of word sentiment prediction on top of embedding representations. In addition, we show that emoticon sentiments could be reliably predicted in the Twitter domain using only UniSent and monolingual embeddings in German, Spanish, French, and Italian. With the publication of this paper, we release the UniSent sentiment lexica at


CanVEC - the Canberra Vietnamese-English Code-switching Natural Speech Corpus

Li Nguyen and Christopher Bryant

This paper introduces the Canberra Vietnamese-English Code-switching corpus (CanVEC), an original corpus of natural mixed speech that we semi-automatically annotated with language information, part of speech (POS) tags and Vietnamese translations. The corpus, which was built to inform a sociolinguistic study on language variation and code-switching, consists of 10 hours of recorded speech (87k tokens) between 45 Vietnamese-English bilinguals living in Canberra, Australia. We describe how we collected and annotated the corpus by pipelining several monolingual toolkits to considerably speed up the annotation process. We also describe how we evaluated the automatic annotations to ensure corpus reliability. We make the corpus available for research purposes.


A Spelling Correction Corpus for Multiple Arabic Dialects

Fadhl Eryani, Nizar Habash, Houda Bouamor and Salam Khalifa

Arabic dialects are the non-standard varieties of Arabic commonly spoken -- and increasingly written on social media -- across the Arab world.  Arabic dialects do not have standard orthographies, a challenge for natural language processing applications. In this paper, we present the MADAR CODA Corpus, a collection of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their raw original form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project and are in parallel across the cities (2,000 sentences from each city).  This publicly available resource is intended to support research on spelling correction and text normalization for Arabic dialects.  We present results on a bootstrapping technique we use to speed up the CODA annotation, as well as on the degree of similarity across the dialects before and after CODA annotation.


A Dataset for Multi-lingual Epidemiological Event Extraction

Stephen Mutuvi, Antoine Doucet, Gael Lejeune and Moses Odeo

This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (ProMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DAnIEL) system. DAnIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting to extract events: repetition and saliency. The system has wide geographical and language coverage, including low-resource languages.  In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic-related and unrelated news articles that constitute the corpus.


Swiss-AL: A Multilingual Swiss Web Corpus for Applied Linguistics

Julia Krasselt, Philipp Dressen, Matthias Fluor, Cerstin Mahlow, Klaus Rothenhäusler and Maren Runte

The Swiss Web Corpus for Applied Linguistics (Swiss-AL) is a multilingual (German, French, Italian) collection of texts from selected web sources. Unlike most other web corpora it is not intended for NLP purposes, but rather designed to support data-based and data-driven research on societal and political discourses in Switzerland. It currently contains 8 million texts (approx. 1.55 billion tokens), including news and specialist publications, governmental opinions, and parliamentary records, web sites of political parties, companies, and universities, statements from industry associations and NGOs, etc. A flexible processing pipeline using state-of-the-art components allows researchers in applied linguistics to create tailor-made subcorpora for studying discourse in a wide range of domains. So far, Swiss-AL has been used successfully in research on Swiss public discourses on energy and on antibiotic resistance.


Analysis of GlobalPhone and Ethiopian Languages Speech Corpora for Multilingual ASR

Martha Yifiru Tachbelie, Solomon Teferra Abate and Tanja Schultz

In this paper, we present the analysis of GlobalPhone (GP) and speech corpora of Ethiopian languages (Amharic, Tigrigna, Oromo and Wolaytta). The aim of the analysis is to select speech data from GP for the development of multilingual Automatic Speech Recognition (ASR) system for the Ethiopian languages. To this end, phonetic overlaps among GP and Ethiopian languages have been analyzed. The result of our analysis shows that there is much phonetic overlap among Ethiopian languages although they are from three different language families. From GP, Turkish, Uyghur and Croatian are found to have much overlap with the Ethiopian languages. On the other hand, Korean has less phonetic overlap with the rest of the languages. Moreover, morphological complexity of the GP and Ethiopian languages, reflected by type to token ration (TTR) and out of vocabulary (OOV) rate, has been analyzed. Both metrics indicated the morphological complexity of the languages. Korean and Amharic have been identified as extremely morphologically complex compared to the other languages. Tigrigna, Russian, Turkish, Polish, etc. are also among the morphologically complex languages.


Multilingualization of Medical Terminology: Semantic and Structural Embedding Approaches

Long-Huei Chen and Kyo Kageura

The multilingualization of terminology is an essential step in the translation pipeline, to ensure the correct transfer of domain-specific concepts. Many institutions and language service providers construct and maintain multilingual terminologies, which constitute important assets. However, the curation of such multilingual resources requires significant human effort; though automatic multilingual term extraction methods have been proposed so far, they are of limited success as term translation cannot be satisfied by simply conveying meaning, but requires the terminologists and domain experts' knowledge to fit the term within the existing terminology. Here we propose a method to encode the structural property of a term by aligning their embeddings using graph convolutional networks trained from separate languages. We observe that the structural information can augment the semantic methods also explored in this work, and recognize the unique nature of terminologies allows our method to fully take advantage and produce superior results.


Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo and Wolaytta

Solomon Teferra Abate, Martha Yifiru Tachbelie, Michael Melese, Hafte Abera, Tewodros Abebe, Wondwossen Mulugeta, Yaregal Assabie, Million Meshesha, Solomon Afnafu and Binyam Ephrem Seyoum

Automatic Speech Recognition (ASR) is one of the most important technologies to support spoken communication in modern life. However, its development benefits from large speech corpus. The development of such a corpus is expensive and most of the human languages, including the Ethiopian languages, do not have such resources. To address this problem, we have developed four large (about 22 hours) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo and Wolaytta. To assess usability of the corpora for (the purpose of) speech processing, we have developed ASR systems for each language. In this paper, we present the corpora and the baseline ASR systems we have developed. We have achieved word error rates (WERs) of 37.65%, 31.03%, 38.02%, 33.89% for Amharic, Tigrigna, Oromo and Wolaytta, respectively. This results show that the corpora are suitable for further investigation towards the development of ASR systems. Thus, the research community can use the corpora to further improve speech processing systems. From our results, it is clear that the collection of text corpora to train strong language models for all of the languages is still required, especially for Oromo and Wolaytta.


Incorporating Politeness across Languages in Customer Care Responses: Towards building a Multi-lingual Empathetic Dialogue Agent

Mauajama Firdaus, Asif Ekbal and Pushpak Bhattacharyya

Customer satisfaction is an essential aspect of customer care systems. It is imperative for such systems to be polite while handling customer requests/demands. In this paper, we present a large multi-lingual conversational dataset for English and Hindi. We choose data from Twitter having both generic and courteous responses between customer care agents and aggrieved users. We also propose strong baselines that can induce courteous behaviour in generic customer care response in a multi-lingual scenario. We build a deep learning framework that can simultaneously handle different languages and incorporate polite behaviour in the customer care agent's responses. Our system is competent in generating responses in different languages (here, English and Hindi) depending on the customer’s preference and also is able to converse with humans in an empathetic manner to ensure customer satisfaction and retention. Experimental results show that our proposed models can converse in both the languages and the information shared between the languages helps in improving the performance of the overall system. Qualitative and quantitative analysis shows that the proposed method can converse in an empathetic manner by incorporating courteousness in the responses and hence increasing customer satisfaction.


WikiBank: Using Wikidata to Improve Multilingual Frame-Semantic Parsing

Cezar Sas, Meriem Beloucif and Anders Søgaard

Frame-semantic annotations exist for a tiny fraction of the world's languages, Wikidata, however, links knowledge base triples to texts in many languages, providing a common, distant supervision signal for semantic parsers. We present WikiBank, a multilingual resource of partial semantic structures that can be used to extend pre-existing resources rather than creating new man-made resources from scratch. We also integrate this form of supervision into an off-the-shelf frame-semantic parser and allow cross-lingual transfer. Using Google's Sling architecture, we show significant improvements on the English and Spanish CoNLL 2009 datasets, whether training on the full available datasets or small subsamples thereof.


Multilingual Corpus Creation for Multilingual Semantic Similarity Task

Mahtab Ahmed, Chahna Dixit, Robert E. Mercer, Atif Khan, Muhammad Rifayat Samee and Felipe Urra

In natural language processing, the performance of a semantic similarity task relies heavily on the availability of a large corpus. Various monolingual corpora are available (mainly English); but multilingual resources are very limited. In this work, we describe a semi-automated framework to create a multilingual corpus which can be used for the multilingual semantic similarity task. The similar sentence pairs are obtained by crawling bilingual websites, whereas the dissimilar sentence pairs are selected by applying topic modeling and an Open-AI GPT model on the similar sentence pairs. We focus on websites in the government, insurance, and banking domains to collect English-French and English-Spanish sentence pairs; however, this corpus creation approach can be applied to any other industry vertical provided that a bilingual website exists. We also show experimental results for multilingual semantic similarity to verify the quality of the corpus and demonstrate its usage.


CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

Changhan Wang, Juan Pino, Anne Wu and Jiatao Gu

Spoken language translation has recently witnessed a resurgence in popularity, thanks to the development of end-to-end models and the creation of new corpora, such as Augmented LibriSpeech and MuST-C. Existing datasets involve language pairs with English as a source language, involve very specific domains or are low resource. We introduce CoVoST, a multilingual speech-to-text translation corpus from 11 languages into English, diversified with over 11,000 speakers and over 60 accents. We describe the dataset creation methodology and provide empirical evidence of the quality of the data. We also provide initial benchmarks, including, to our knowledge, the first end-to-end many-to-one multilingual models for spoken language translation. CoVoST is released under CC0 license and free to use. We also provide additional evaluation data derived from Tatoeba under CC licenses.


A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking

Hideki Nakayama, Akihiro Tamura and Takashi Ninomiya

Visually-grounded natural language processing has become an important research direction in the past few years. However, majorities of the available cross-modal resources (e.g., image-caption datasets) are built in English and cannot be directly utilized in multilingual or non-English scenarios. In this study, we present a novel multilingual multimodal corpus by extending the Flickr30k Entities image-caption dataset with Japanese translations, which we name Flickr30k Entities JP (F30kEnt-JP). To the best of our knowledge, this is the first multilingual image-caption dataset where the captions in the two languages are parallel and have the shared annotations of many-to-many phrase-to-region linking. We believe that phrase-to-region as well as phrase-to-phrase supervision can play a vital role in fine-grained grounding of language and vision, and will promote many tasks such as multilingual image captioning and multimodal machine translation. To verify our dataset, we performed phrase localization experiments in both languages and investigated the effectiveness of our Japanese annotations as well as multilingual learning realized by our dataset.


Multilingual Dictionary Based Construction of Core Vocabulary

Winston Wu, Garrett Nicolai and David Yarowsky

We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries. Our newly developed core concept vocabulary list derived from these dictionary consensus methods achieves high overlap with existing widely utilized core vocabulary lists targeted at applications such as first and second language learning or field linguistics. Our in-depth analysis illustrates multiple desirable properties of our newly proposed core vocabulary set, including their non-compositionality. We employ a cognate prediction method to recover missing coverage of this core vocabulary in massively multilingual dictionary construction, and we argue that this core vocabulary should be prioritized for elicitation when creating new dictionaries for low-resource languages for multiple downstream tasks including machine translation and language learning.


Common Voice: A Massively-­Multilingual Speech Corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers and Gregor Weber

The Common Voice corpus is a massively-­multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla’s DeepSpeech Speech­-to-­Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end­-to­-end Automatic Speech Recognition.


Massively Multilingual Pronunciation Modeling with WikiPron

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy and Kyle Gorman

We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.


HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment

Anssi Yli-Jyrä, Josi Purhonen, Matti Liljeqvist, Arto Antturi, Pekka Nieminen, Kari M. Räntilä and Valtter Luoto

Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts (texts accompanied by a translation) were constructed manually in order to create an analytical concordance (Luoto et al., eds. 1997) for a Finnish Bible translation.  The creators of the bitexts recently secured the publisher's permission to release its fine-grained alignment, but the alignment was still dependent on proprietary, third-party resources such as a copyrighted text edition and proprietary morphological analyses of the source texts.  In this paper, we describe a nontrivial editorial process starting from the creation of the original one-purpose database and ending with its reconstruction using only freely available text editions and annotations.  This process produced an openly available dataset that contains (i) the source texts and their translations, (ii) the morphological analyses, (iii) the cross-lingual morpheme alignments.


ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English

Injy Hamed, Ngoc Thang Vu and Slim Abdennadher

In this paper, we present our ArzEn corpus, an Egyptian Arabic-English code-switching (CS) spontaneous speech corpus. The corpus is collected through informal interviews with 38 Egyptian bilingual university students and employees held in a soundproof room. A total of 12 hours are recorded, transcribed, validated and sentence segmented. The corpus is mainly designed to be used in Automatic Speech Recognition (ASR) systems, however, it also provides a useful resource for analyzing the CS phenomenon from linguistic, sociological, and psychological perspectives. In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language. We then provide a detailed description on how the corpus was collected, giving an overview on the participants involved. We also present statistics on the CS involved in the corpus, as well as a summary to the effort exerted in the corpus development, in terms of number of hours required for transcription, validation, segmentation and speaker annotation. Finally, we discuss some factors contributing to the complexity of the corpus, as well as Arabic-English CS behaviour that could pose potential challenges to ASR systems.


Cross-lingual Named Entity List Search via Transliteration

Aleksandr Khakhmovich, Svetlana Pavlova, Kira Kirillova, Nikolay Arefyev and Ekaterina Savilova

Out-of-vocabulary words are still a challenge in cross-lingual Natural Language Processing tasks, for which transliteration from source to target language or script is one of the solutions. In this study, we collect a personal name dataset in 445 Wikidata languages (37 scripts), train Transformer-based multilingual transliteration models on 6 high- and 4 less-resourced languages, compare them with bilingual models from (Merhav and Ash, 2018) and determine that multilingual models perform better for less-resourced languages. We discover that intrinsic evaluation, i.e comparison to a single gold standard, might not be appropriate in the task of transliteration due to its high variability. For this reason, we propose using extrinsic evaluation of transliteration via the cross-lingual named entity list search task (e.g. personal name search in contacts list). Our code and datasets are publicly available online.



Back to Top


Serial Speakers: a Dataset of TV Series

Xavier Bost, Vincent Labatut and Georges Linares

For over a decade, TV series have been drawing increasing interest, both from the audience and from various academic fields. But while most viewers are hooked on the continuous plots of TV serials, the few annotated datasets available to researchers focus on standalone episodes of classical TV series. We aim at filling this gap by providing the multimedia/speech processing communities with ``Serial Speakers'', an annotated dataset of 155 episodes from three popular American TV serials: ``Breaking Bad'', ``Game of Thrones'' and ``House of Cards''. ``Serial Speakers'' is suitable both for investigating multimedia retrieval in realistic use case scenarios, and for addressing lower level speech related tasks in especially challenging conditions. We publicly release annotations for every speech turn (boundaries, speaker) and scene boundary, along with annotations for shot boundaries, recurring shots, and interacting speakers in a subset of episodes. Because of copyright restrictions, the textual content of the speech turns is encrypted in the public version of the dataset, but we provide the users with a simple online tool to recover the plain text from their own subtitle files.


Image Position Prediction in Multimodal Documents

Masayasu Muraoka, Ryosuke Kohita and Etsuko Ishii

Conventional multimodal tasks, such as caption generation and visual question answering, have allowed machines to understand an image by describing or being asked about it in natural language, often via a sentence. Datasets for these tasks contain a large number of pairs of an image and the corresponding sentence as an instance. However, a real multimodal document such as a news article or Wikipedia page consists of multiple sentences with multiple images. Such documents require an advanced skill of jointly considering the multiple texts and multiple images, beyond a single sentence and image, for the interpretation. Therefore, aiming at building a system that can understand multimodal documents, we propose a task called image position prediction (IPP). In this task, a system learns plausible positions of images in a given document. To study this task, we automatically constructed a dataset of 66K multimodal documents with 320K images from Wikipedia articles. We conducted a preliminary experiment to evaluate the performance of a current multimodal system on our task. The experimental results show that the system outperformed simple baselines while the performance is still far from human performance, which thus poses new challenges in multimodal research.


Visual Grounding Annotation of Recipe Flow Graph

Taichi Nishimura, Suzushi Tomori, Hayato Hashimoto, Atsushi Hashimoto, Yoko Yamakata, Jun Harashima, Yoshitaka Ushiku and Shinsuke Mori

In this paper, we provide a dataset that gives visual grounding annotations to recipe flow graphs. A recipe flow graph is a representation of the cooking workflow, which is designed with the aim of understanding the workflow from natural language processing. Such a workflow will increase its value when grounded to real-world activities, and visual grounding is a way to do so. Visual grounding is provided as bounding boxes to image sequences of recipes, and each bounding box is linked to an element of the workflow. Because the workflows are also linked to the text, this annotation gives visual grounding with workflow's contextual information between procedural text and visual observation in an indirect manner. We subsidiarily annotated two types of event attributes with each bounding box: ``doing-the-action,'' or ``done-the-action''. As a result of the annotation, we got 2,300 bounding boxes in 272 flow graph recipes. Various experiments showed that the proposed dataset enables us to estimate contextual information described in recipe flow graphs from an image sequence.


Building a Multimodal Entity Linking Dataset From Tweets

Omar Adjali, Romaric Besançon, Olivier Ferret, Hervé Le Borgne and Brigitte Grau

The task of Entity linking, which aims at associating an entity mention with a unique entity in a knowledge base (KB), is useful for advanced Information Extraction tasks such as relation extraction or event detection. Most of the studies that address this problem rely only on textual documents while an increasing number of sources are multimedia, in particular in the context of social media where messages are often illustrated with images. In this article, we address the Multimodal Entity Linking (MEL) task, and more particularly the problem of its evaluation. To this end, we propose a novel method to quasi-automatically build annotated datasets to evaluate methods on the MEL task. The method collects text and images to jointly build a corpus of tweets with ambiguous mentions along with a Twitter KB defining the entities. We release a new annotated dataset of Twitter posts associated with images. We study the key characteristics of the proposed dataset and evaluate the performance of several MEL approaches on it.


A Multimodal Educational Corpus of Oral Courses: Annotation, Analysis and Case Study

salima mdhaffar, Yannick Estève, Antoine Laurent, Nicolas Hernandez, Richard Dufour, Delphine Charlet, Geraldine Damnati, Solen Quiniou and Nathalie Camelin

This corpus is part of the PASTEL (Performing Automated Speech Transcription for Enhancing Learning) project aiming to explore the potential of synchronous speech transcription and application in specific teaching situations. It includes 10 hours of different lectures, manually transcribed and segmented. The main interest of this corpus lies in its multimodal aspect: in addition to speech, the courses were filmed and the written presentation supports (slides) are made available. The dataset may then serve researches in multiple fields, from speech and language to image and video processing. The dataset will be freely  available to the research community. In this paper, we first describe in details the annotation protocol, including a detailed analysis of the manually labeled data. Then, we propose some possible use cases of the corpus with baseline results. The use cases concern scientific fields from both speech and text processing, with language model adaptation, thematic segmentation and transcription to slide alignment.


Annotating Event Appearance for Japanese Chess Commentary Corpus

Hirotaka Kameko and Shinsuke Mori

In recent years, there has been a surge of interest in natural language processing related to the real world, such as symbol grounding, language generation, and non-linguistic data search by natural language queries. Researchers usually collect pairs of text and non-text data for research. However, the text and non-text data are not always a “true” pair. We focused on the shogi (Japanese chess) commentaries, which are accompanied by game states as a well-defined “real world”. For analyzing and processing texts accurately, considering only the given states is insufficient, and we must consider the relationship between texts and the real world. In this paper, we propose “Event Appearance” labels that show the relationship between events mentioned in texts and those happening in the real world. Our event appearance label set consists of temporal relation, appearance probability, and evidence of the event. Statistics of the annotated corpus and the experimental result show that there exists temporal relation which skillful annotators realize in common. However, it is hard to predict the relationship only by considering the given states.


Offensive Video Detection: Dataset and Baseline Results

Cleber Alcântara, Viviane Moreira and Diego Feijo

Web-users produce and publish high volumes of data of various types, such as text, images, and videos. The platforms try to restrain their users from publishing offensive content to keep a friendly and respectful environment and rely on moderators to filter the posts. However, this method is insufficient due to the high volume of publications. The identification of offensive material can be performed automatically using machine learning, which needs annotated datasets. Among the published datasets in this matter, the Portuguese language is underrepresented, and videos are little explored. We investigated the problem of offensive video detection by assembling and publishing a dataset of videos in Portuguese containing mostly textual features. We ran experiments using popular machine learning classifiers used in this domain and reported our findings, alongside multiple evaluation metrics. We found that using word embedding with Deep Learning classifiers achieved the best results on average. CNN architectures, Naive Bayes, and Random Forest ranked top among different experiments. Transfer Learning models outperformed Classic algorithms when processing video transcriptions, but scored lower using other feature sets. These findings can be used as a baseline for future works on this subject.


Adding Gesture, Posture and Facial Displays to the PoliModal Corpus of Political Interviews

Daniela Trotta, Alessio Palmero Aprosio, Sara Tonelli and Annibale Elia

This paper introduces a multimodal corpus in the political domain, which on top of transcribed face-to-face interviews presents the annotation of facial displays, hand gestures and body posture. While the fully annotated corpus consists of 3 interviews for a total of 90 minutes, it is extracted from a larger available corpus of 56 face-to-face interviews (14 hours) that has been manually annotated with information about metadata (i.e. tools used for the transcription, link to the interview etc.), pauses (used to mark a pause either between or within utterances), vocal expressions (marking non-lexical expressions such as burp and semi-lexical expressions such as primary interjections), deletions (false starts, repetitions and truncated words) and overlaps. In this work, we describe the additional level of annotation relating to nonverbal elements used by three Italian politicians belonging to three different political parties and who at the time of the talk-show were all candidates for the presidency of the Council of Minister. We also present the results of some analyses aimed at identifying existing relations between the proxemics phenomena and the linguistic structures in which they occur in order to capture recurring patterns and differences in the communication strategy.


E:Calm Resource: a Resource for Studying Texts Produced by French Pupils and Students

Lydia-Mai Ho-Dac, Serge Fleury and Claude Ponton

The E:Calm resource is constructed from French student texts produced in a variety of usual contexts of teaching. The distinction of the E:Calm resource is to provide an ecological data set that gives a broad overview of texts written at elementary school, high school and university. This paper describes the whole data processing: encoding of the main graphical aspects of the handwritten primary sources according to the TEI-P5 norm; spelling standardizing; POS tagging and syntactic parsing evaluation.


Introducing MULAI: A Multimodal Database of Laughter during Dyadic Interactions

Michel-Pierre Jansen, Khiet P. Truong, Dirk K.J. Heylen and Deniece S. Nazareth

Although laughter has gained considerable interest from a diversity of research areas, there still is a need for laughter specific databases. We present the Multimodal Laughter during Interaction (MULAI) database to study the expressive patterns of conversational and humour related laughter. The MULAI database contains 2 hours and 14 minutes of recorded and annotated dyadic human-human interactions and includes 601 laughs, 168 speech-laughs and 538 on- or offset respirations. This database is unique in several ways; 1) it focuses on different types of social laughter including conversational- and humour related laughter, 2) it contains annotations from participants, who understand the social context, on how humourous they perceived themselves and their interlocutor during each task, and 3) it contains data rarely captured by other laughter databases including participant personality profiles and physiological responses. We use the MULAI database to explore the link between acoustic laughter properties and annotated humour ratings over two settings. The results reveal that the duration, pitch and intensity of laughs from participants do not correlate with their own perception of how humourous they are, however the acoustics of laughter do correlate with how humourous they are being perceived by their conversational partner.


The Connection between the Text and Images of News Articles: New Insights for Multimedia Analysis

Nelleke Oostdijk, Hans van Halteren, Erkan Bașar and Martha Larson

We report on a case study of text and images that reveals the inadequacy of simplistic assumptions about their connection and interplay. The context of our work is a larger effort to create automatic systems that can extract event information from online news articles about flooding disasters. We carry out a manual analysis of 1000 articles containing a keyword related to flooding. The analysis reveals that the articles in our data set cluster into seven categories related to different topical aspects of flooding, and that the images accompanying the articles cluster into five categories related to the content they depict. The results demonstrate that flood-related news articles do not consistently report on a single, currently unfolding flooding event and we should also not assume that a flood-related image will directly relate to a flooding-event described in the corresponding article. In particular, spatiotemporal distance is important. We validate the manual analysis with an automatic classifier demonstrating the technical feasibility of multimedia analysis approaches that admit more realistic relationships between text and images. In sum, our case study confirms that closer attention to the connection between text and images has the potential to improve the collection of multimodal information from news articles.


LifeQA: A Real-life Dataset for Video Question Answering

Santiago Castro, Mahmoud Azab, Jonathan Stroud, Cristina Noujaim, Ruoyao Wang, Jia Deng and Rada Mihalcea

We introduce LifeQA, a benchmark dataset for video question answering that focuses on day-to-day real-life situations. Current video question answering datasets consist of movies and TV shows. However, it is well-known that these visual domains are not representative of our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA, and we apply several state-of-the-art video question answering models to provide benchmarks for future research. The full dataset is publicly available at

Back to Top