Issue #6 | July 2023
- Language Resources
- Legal Issues
- ELRA/ELDA Projects
- Evaluation Campaigns
LRs in the ELRA Catalogue this month
Since April 2023, 1 new written corpus, 66 new monolingual lexicons and 1 new speech corpus are now available in our catalogue. Moreover, 4 speech resources are now available at reduced fees.
1. New Language Resources
Archives of "El Mundo" Newspaper – Years 2020-2022
This corpus consists of 45,658 articles in Spanish from electronic archives of "El Mundo" Newspaper between 2020 and 2022. A few articles also come from publications from other related media: El Mundo Alicante, El Mundo Andalucía, El Mundo Baleares, El Mundo Catalunya, El Mundo Valéncia et Expansión. The number of articles available per year is as follows:
- 2020: 15,073 articles
- 2021: 14,461 articles
- 2022: 16,124 articles
Total: 45,658 articles
All articles are provided in text format, including HTML tags. This data is released thanks to Unidad Editorial Información General, S.L.U., Spain.
This corpus may be also obtained as separate years as follows:
- Archives of "El Mundo" Newspaper – Year 2020
- Archives of "El Mundo" Newspaper – Year 2021
- Archives of "El Mundo" Newspaper – Year 2022
Bitext Lexical Datasets
The series of Bitext Lexical Datasets for the generic vocabulary includes Lemmas, POS tagging, Frequency, Named Entities and Offensive features. Depending on the dataset and language, other syntactic and morphological features are also provided. The following 15 languages are available.
As a complement to the datasets mentioned above, 11 datasets of Language Variants can also be obtained:
- Arabic (MSA) dataset and Arabic Language Variants dataset consisting of Arabic Gulf, Arabic Najdi, Arabic Egypt and Arabic MSA variants,
- Chinese (Simplified) dataset, Chinese (Traditional) dataset, and Chinese Language Variants dataset (Simplified + Traditional),
- Dutch dataset and Dutch Language Variants dataset consisting of Netherlands and Belgium variants,
- English dataset and English Language Variants dataset consisting of United States, United Kingdom and India variants,
- Finnish dataset and Finnish Language Variants dataset consisting of Standard and Colloquial Finnish variants,
- French dataset and French Language Variants dataset consisting of France, Canada and Switzerland variants,
- German dataset and German Language Variants dataset consisting of Germany and Switzerland variants,
- Indonesian dataset,
- Italian dataset and Italian Language Variants dataset consisting of Italy and Switzerland variants,
- Malay dataset,
- Norwegian (Bokmal) dataset and Norwegian Language Variants dataset consisting of Bokmal and Nynorsk variants,
- Portuguese dataset and Portuguese Language Variants dataset consisting of Portugal and Brazil variants,
- Spanish dataset and Spanish Language Variants dataset consisting of Spain, North America, Central America, Andes and Southern Cone variants,
- Ukrainian dataset.
Bitext Synthetic Data
The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for English and Spanish languages. They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for each utterance. Data is distributed as models or open text files.
For each language, the following verticals are available:
- Automotive: 52 intents (English, Spanish)
- Retail banking: 26 intents (English, Spanish)
- Education: 37 intents (English, Spanish)
- Event and ticketing: 25 intents (English, Spanish)
- Field Service: 27 intents (English, Spanish)
- Healthcare: 40 intents (English, Spanish)
- Hospitality: 24 intents (English, Spanish)
- Insurance: 38 intents (English, Spanish)
- Legal : 29 intents (English, Spanish)
- Manufacturing: 34 intents (English, Spanish)
- Media Streaming: 24 intents (English, Spanish)
- Mortgage and loans: 39 intents (English, Spanish)
- Moving and storage: 29 intents (English, Spanish)
- Real estate and construction: 28 intents (English, Spanish)
- Restaurant/ bar chains: 30 intents (English, Spanish)
- Retail Ecomm: 34 intents (English, Spanish)
- Telecommunication: 26 intents (English, Spanish)
- Travel: 33 intents (English, Spanish)
- Utilities: 21 intents (English, Spanish)
- Wealth management: 24 intents (English, Spanish)
Persian Kids’ Speech Corpus
The Persian Kids’ Speech Corpus consists of speech signals recorded by 286 children (141 girls, 145 boys), from 6 to 9 years old, through an Andreas Mic Anti-Noise microphone and a Premium Speechmike headphone. This recorded data was manually checked and labeled. Finally, a corpus containing 162,395 samples with a duration of 33 hours and 44 minutes was created. The samples are distributed as follows:
- 29,057 words (478 minutes)
- 17,429 sub-words (260 minutes)
- 43,838 syllables (485 minutes)
- 70,078 phonemes (765 minutes)
- 1,993 extra vocabulary (36 minutes)
The prepared speech corpus comprehensively contains all the 29 Persian phonemes, 118 syllables, 56 sub-words, and 711 words and is particularly applicable to speech recognition and linguistics studies.
2. Reduced fees for the following speech resources
- Chinese Mandarin (South) database
- Chinese Mandarin (North) database
- Japanese Kids Speech database (Lower Grade)
- Japanese Kids Speech database (Upper Grade)
The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.
- 16 new ISLRN numbers assigned between April and June 2023.
- A total of 3353 ISLRN numbers assigned since January 2014.
- A total of 270 distinct languages.
The latest LRs for which an ISLRN number was requested and accepted are as follows:
- ALLIES Corpus - ISLRN: 397-116-696-859-2
- LORELEI Indonesian Representative Language Pack - ISLRN: 426-032-969-008-0
- LORELEI Zulu Representative Language Pack - ISLRN: 957-076-577-258-4
- Moroccan Arabic - English Lexical Database - ISLRN: 107-292-828-045-8
- Penn Korean Universal Dependency Treebank - ISLRN: 522-574-570-040-8
- DEFT English Light and Rich ERE Annotation - ISLRN: 712-226-273-489-1
More about ISLRN.
ELRA Legal Issues Publications
The Committee version of the AI Act has been approved by the European Parliament
On May 11, 2023, the European Parliament’s Internal Market and Civil Liberties Committees voted to approve a new version of the upcoming AI Act which will now enter in its trilogue phase with the European Commission and European Council.
In this amended version, Members of the European Parliament included new transparency requirements for foundation models such as GPT to enhance guarantees on fundamental rights.
The press release is available on the website of the European Parliament and the voted text is available here.
US Secretary of Commerce claims United States' compliance with the EU-US Data Privacy Framework
On July 3, 2023, the Secretary of Commerce issued a statement claiming that the United States fulfilled its commitment for the implementation of the EU-US Data Privacy Framework that was negotiated following the Schrems II landmark case ruled by the Court of Justice of the European Union (CJEU).
This comes in addition with the designation made by the Attorney General to designate the European Union and countries of the European Economic Area as ‘qualifying states’ for the implementation of the redress mechanism established in the EU-US Data Privacy framework.
The full statement is available here.
The European Commission adopts the EU-US Data Privacy Framework
On July 10, 2023, the European Commission announced that it had adopted the new EU-US Data Privacy Framework. Following the ruling by the Court of Justice of the European Union who repealed the former Privacy Shield this framework aims to regulate data transfers from the European Union to the United States.
By this decision, the European Commission approves the changes made by the US administrations to ensure an adequate level of protection for data of European citizens.
The press release is available here and the full adequation decision is available here.
Meta fails to reject claims of GDPR compliance brought by Competition authorities
On July 4, 2023, the Court of Justice of the European Union (CJEU) found that a Competition authority, here the German competition authority could take into consideration claims of infringement of the GDPR while examining a claim of abuse of dominant position.
In this case, the Court found that the competition authority lawfully examined this infringement only as it allowed to establish an abuse of dominant position and that while doing so they must cooperate with national supervisory authorities.
Full text of the decision available here and press release following the decision here.
Stanford Researchers’ Review of Foundation Models with the Draft AI Act
Stanford researchers published a review on the compliance of the different foundation model providers, such as ChatGPT, with the proposed AI Act.
Their study presents a graduation assessment of the conduct of the AI models providers with respect to the current legal requirements.
A summary of their research and graduation assessment is available here.
DG Connect University on LLMs
On June 6, 2023, the DG CONNECT of the European Commission and the Language Data Space initiative jointly organised a University course and workshop on the subject of Large Language Models (LLMs). The workshop was entitled Large Language Models: Overview, Limitations and Opportunities.
This event featured several talks on various topics linked to LLMs: technical and legal aspects, ethical aspects and biases, challenges. The LLMs capacity, limitations and evaluation were also addressed.
The list of presentations and slides are available here and a recording of the event is also available on YouTube.
AI, Regulation and Decision Making Workshop
On June 27, 2023, PRAIRIE (Paris Artificial Intelligence Research Institute) held a workshop on Artificial Intelligence, Regulation and Decision Making.
This workshop featured 3 different presentations on the topic.
The first one was given by Dr. Anita Burgun and focused on the use of Artificial intelligence applications in the medical field and the potential harmful consequences. She gave out examples such as a chatbot used for nutrition advice that was unable to detect eating disorders, or the use of MRI images to detect suicidal ideation.
To face such progress, she proposed a framework where the medical doctors and medical professionals would stay in the loop to help in the development of the applications. She also presented the notion of “Augmented Intelligence” where the use of an application based on artificial intelligence would only assist to help the doctors make their decisions.
The second talk was given by Franziska Poszler on the impact of decision support systems in human ethical decision making. In her talk, she presented a literature review of the different decision support systems and showed such taxonomy of those decision support systems. She also presented the challenges of how human ethical decision making can be influenced by relying too much on decision support systems.
The final talk by Thierry Poibeau provided some insights on the notion of bias and representation in artificial intelligence. He gave out some proposals to address this issue such as providing data statements to document the gaps in data and clearly identify the population not represented in the data. Other proposals included providing statements to define the target use of the applications and debiasing the data.
JSALT 2023 - Workshop on data collection and annotation
On June 29, 2023, ELDA participated in the workshop organised by the University of Le Mans in the course of the JSALT Summer School. During this event, two presentations were given by ELDA. The workshop was entitled “Special Day on Data Collection and Annotation”.
The first one, given by Dr Victoria Arranz, described the activities led by ELDA in the production and annotation of datasets as well as its participation in European data sharing infrastructures.
The second one, given by Mickaël Rigault, focused on legal challenges of data production and distribution and touched upon topics of Intellectual property, Protection of personal data and a review of the current legal cases surrounding the use of large language models.
Information on the on-going projects
Common European Language Data Space (LDS)
The Common European Language Data Space (LDS) project was launched on January 19, 2023. This 3-year project aims at establishing a European platform and marketplace for the collection, creation, sharing and re-use of multilingual and multimodal language data.
The service contract has been established between the European Commission and a consortium with the four following partners who were distributed the tasks as follows:
German Research Center for Artificial intelligence (DFKI), coordinator
- coordination and support of the project,
- establishment of 2 governance bodies, namely CELT and CELT+ (Centre of Excellence for Language Technologies),
- promotional activities through conference attendance and Information channels,
- Language Data Space website
Evaluations and Language Resources Distribution Agency (ELDA)
- development of a multi-stakeholder data and services governance scheme,
- organisation and management of events
- data protection compliance;
Athena Research and Innovation Center in Information, Communication and Knowledge Technologies (ILSP)
- development and implementation of a sustainable language data ecosystem blueprint
- Language Data Space deployment
- implementation of multi-stakeholder data and services governance scheme
- proof-of-deployment-concept projects.
In this regard, ELDA has continued the work planned within the different tasks:Development of a Multi-Stakeholder Data and Services Governance Scheme The main objective of this task is to design a governance scheme for the development of a European Multi-Stakeholder Data and Services infrastructure. A collaboration has been established with other Data Spaces as well as with the Data Spaces Support Centre (DSSC) to ensure a large inter-Data Spaces synergy on the identification of governance schemes used by others. ELDA is participating actively in the DSSC's expert group meetings and focusing on the reviewing of the different governance schemes.
Event Organisation and Management
A large number of dedicated workshops and conferences are planned to take place throughout the LDS initiative. Two workshops have already been held:
- First DIGITAL Stakeholders’ Workshop: the aim of these workshops is to facilitate collaboration and exchange between the DIGITAL Programme stakeholders and the LDS, focusing on this Programme’s objectives that consist of bringing digital technology to businesses, citizens and public administrations. The first DIGITAL Stakeholders’ Workshop was held on May 16, 2023 and, as the first of the series, it was dedicated to the LDS and the DSSC.
- First Technology Workshop: this series of workshops aims to enable a thorough overview and practical demonstrations of the latest technology solutions which are available on the market or from the OS community and that can be applied by means of the LDS platform and marketplace. The first of this kind was organised jointly with the DG CONNECT as part of their University courses (cf. the event review presented in the previous section). The workshop was held on June 6, 2023.
Data Protection Compliance
This task aims to ensure Data Protection Compliance, with a particular focus on EUDPR and/or GDPR. ELDA has defined a Data Protection Concept (DPC), which is a document describing all personal data processing activities inherent to the Language Data Space (LDS) or enabled by the LDS. It will serve as reference documentation for the project’s compliance with GDPR and EUDPR rules. Version 1.0 of the DPC was submitted to the European Commission on March 31, 2023, and further work has followed towards an updated version of the document This will continue to evolve alongside with the development and deployment of the LDS infrastructure.
Language Technology Solutions - CNECT/LUX/2022/OP/0030
This call for tenders from the European Commission was published within the Digital Europe programme (DIGITAL). It aims to achieve three specific goals:
- facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites;
- support the creation of open-source European language speech recognition solutions;
- carry out market studies on language technologies and widely disseminate their results to foster the take-up of language technologies in Europe.
ELRA, through its operational body ELDA, is involved in two of the funded projects which are described below.
LOT 1 - Solutions Supporting the Use of Automated Translations on Websites
The European Multilingual Web (EMW) project is entering its second semester of activities (it started on December 12, 2022). Its main goal is to set up a set of ready-to-use open-source websites automated translations solutions.
In this framework, ELDA will be in charge of managing the Helpdesk team that will support users to report issues and ask questions about the solutions. This activity is foreseen to start in October 2023 . In June, ELDA carried out an analysis of the helpdesk technical solution that will need to be implemented to allow the smooth management of the task. This implies the use of a dedicated email address to be managed through a ticketing system. To enable the selection of the appropriate ticketing system, a detailed analysis of pre-selected tools was submitted to the EC, taking in particular GDPR issues into consideration, as well as a number of expected features that will be necessary for an optimal exploitation of the tool.
LOT 2 – Language Technologies Solutions
The project was officially launched on December 13, 2022.
The consortium operating in this project is coordinated by Brno University of Technology (BUT) with the participation of TILDE and ELDA. Three main tasks are being performed with the participation of all members of the consortium, which are:
Task 1: A comprehensive market study of the Automatic Speaker Recognition (ASR) solutions. This includes an overview of the main actors and techniques of the domain as well as the availability of speech and related transcription data for ASR. This task is mainly carried out by ELDA. In March 2023, a first draft of the study was submitted to the European Commission (EC). Since the validation of this study, ELDA and TILDE are collecting interviews with ASR market actors and experts. These interviews represent a new phase of the market study that will result in an updated document to be submitted to the European Commission in December 2023.
Task 2: Creation of an open-source speech recognition prototype solution for three under-represented European languages (Czech, Estonian, and Greek). This task is mainly carried out by BUT and TILDE, with ELDA holding an advisory position. As for Task 1, the European Commission has accepted a document describing the solution’s architecture and key features. An alpha version is undergoing development and test phases are planned using both data from open sources and Task 3.
Task 3: Collection and partial transcription (one third) of speech data for the three above-mentioned European under-resourced languages. A total of 4,500 hours will be packaged per language under the responsibility and coordination of ELDA. The data will be used to train the solution developed in Task 2 as well as to constitute three corpora that will be delivered to the European Commission. In March 2023, a first legal analysis was sent to the EC which has since then provided the expected right assessments to include in the final delivery. Both data collection and negotiation with data providers have started. Following the first data collection, the use of a partially automatic transcription step has been studied, with revision being carried out by native speakers of each language. The objective for this approach is to optimise the full transcription of the first 1,500 hours and make it faster with the testing phase of the solution developed in Task 2.
- Arabic Reverse Dictionary Shared Task 2023 - https://anlp.ai/sharedTask/
- ArAIEval - Persuasion Techniques and Disinformation Detection in Arabic Text - https://araieval.gitlab.io/
- CoCo4MT Workshop Shared Task - https://sites.google.com/view/coco4mt/shared-task
- HaSpeeDe 3 (Hate Speech Detection) shared task within Evalita 2023 -http://www.di.unito.it/~tutreeb/haspeede-evalita23/
- Multilingual Terminology Extraction Shared Task at BUCC 2023 - https://comparable.limsi.fr/bucc2023/bucc2023-task.html
- NADI (Nuance Arabic Dialect Identification) Shared Task 2023 - https://nadi.dlnlp.ai/
- Qur'an QA 2023 - https://sites.google.com/view/quran-qa-2023
- WAT2023 English-Hindi Multi-Modal Translation Task - https://ufal.mff.cuni.cz/hindi...
- WAT2023 English-Malayalam Multi-Modal Translation Task - https://ufal.mff.cuni.cz/malay...
- WAT2023 English-Bengali Multi-Modal Translation Task - https://ufal.mff.cuni.cz/benga...
- WojoodNER Shared Task 2023 - https://dlnlp.ai/st/wojood/
News from ELRA
First Call for Papers
Lingotto Conference Centre in Turin (Italy)
20-25 May, 2024
Two international key players in the area of computational linguistics, the ELRA Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL), are joining forces to organize the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) to be held in Torino, Italy on 20-25 May, 2024.
(All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”)
- 22 September 2023: Paper anonymity period starts
- 13 October 2023: Final submissions due (long, short and position papers)
- 13 October 2023: Workshop/Tutorial proposal submissions due
- 22–29 January 2024: Author rebuttal period
- 5 February 2024: Final reviewing
- 19 February 2024: Notification of acceptance
- 25 March 2024: Camera-ready due
- 20-25 May 2024: LREC-COLING 2024 conference
Language Resources and Evaluation Journal
Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications.
Volume 57 - Issue 2, June 2023
Regular issue including a number of papers in Open Access (493-944)
News from the community
CNRS Data Management Guide of good practices for research data
The CNRS and the Data Lab are making available a guide of good practices for research data on its website.
These guide aim to facilitate the management and production of research data and elaborates on the challenges of data management in open science contexts throughout the data lifecycle such as:
- Understanding and conceiving the data collection process
- Conception and planification of research projects
- Technical issues surrounding data collection
- Technical issues of data processing
- Preservation and archiving of data
- Publication and dissemination of datasets.
The guide provides concrete examples and solutions on how to tackle the different challenges of data management in research and provide valuable resources and knowledge for all research projects.
The full book (in French only) is available here.
In memoriam, Thierry Declerck (1959-2023)
The Board of ELRA and the ELDA team express their deep sadness at the sudden passing of Thierry Declerck on June 27, 2023 while he was attending the eLex 2023 Conference in Brno. Thierry was a consultant at DFKI and lecturer at the Language Science and Technology Department of Saarland University. He has been involved in numerous European projects and initiatives. He was also taking an active part in several scientific communities and associations including CLARIN, and EURALEX, where many praise his commitment, dedication and generosity.
From 2012 on, Thierry has joined the ELRA Board as a Board member and the LREC Programme committee, and he has held the position of Vice-President from 2016 to 2018.
For all of us, he was a great colleague, bright and witty, for some, he was a long standing friend who will be dearly missed. Our thoughts and condolences go to his family and his friends.