Latest News
- New LRs in the ELRA Catalogue July 25, 2024
- New LRs in the ELRA Catalogue June 5, 2024
- New LRs in the ELRA Catalogue Dec. 7, 2023
- New LRs in the ELRA Catalogue Nov. 13, 2023
- The LDS vision by Philippe Gelin Oct. 17, 2023
The notebook belows provides answers to some of the important questions that may arise when an institution or a company considers evaluating HLT systems.
Please scroll horizontally on the right arrow (or on the left arrow) to see the tabs that are not displayed.
Any evaluation has pragmatically chosen goals. In HLT evaluation, the goal can be summarized by the following questions:
In general, an evaluation starts with the description of
The EAGLES (Expert Advisory Group on Language Engineering Standards) evaluation working group identified the 7 major steps for a successful evaluation:
An evaluation can be user-oriented when an evaluation metric focuses on the users’ satisfaction. However, this kind of methodology is particularly suitable for the evaluation of ready-to-sell products. As HLT Evaluation portal focuses on the evaluation of research systems rather than that of commercial products ready for the market, we describe evaluation as a technology that could deal with both the need to validate the HLT system as a product and to produce useful feedback for further improvement of the system. There are many different types of evaluations depending on the object being evaluated and the purpose of the evaluation. (See Survey of the State of the Art in Human Language Technology. Editorial Board: R. A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen and V. Zue. Managing editors: G. B. Varile, A. Zampolli, 1996. )
When the performance processing of a system consists of several components associated with different stages, an additional distinction should be respected between intrinsic evaluation, designed to evaluate each component independently, and extrinsic evaluation to assess the overall performance of the system.
Usability evaluation
For evaluation design the emphasis has traditionally been put on measuring systems performance that meet specific functional requirements. Usability is generally ignored because there are no objective criteria for usability.
ISO 9241, one of ISO Standards that apply to usability and ergonomics, provides the information that needs to be taken into account when specifying or evaluating usability in terms of measures of user performance and satisfaction. ISO 13407 specifies the user-centered design process needed to achieve the usability and quality in use goals.
The Common Industry Format, developed within the NIST Industry USability Reporting (IUSR) Project for usability test report, has been approved as ISO Standard. The document will be called: “ISO 25062 Software Engineering- Software Quality and Requirements Evaluation- Common Industry Format for Usability Test Reports” (ISO Catalogue).
Comparative evaluation
Comparative evaluation is a paradigm in which a set of participants compare the results of their systems using the same data and control tasks with metrics that are agreed upon. Usually this evaluation is performed in a number of successive evaluation campaigns with open participation. For every campaign, the results are presented and compared in special workshops where the methods used by the participants are discussed and contrasted.
The experience with comparative evaluation in the USA and in Europe has shown that the approach leads to a significant improvement of the performance of the evaluated technologies. A consequence is often the production of high quality resources. The evaluation requires the development of annotated data and test sets since the participants need data for training and testing their systems. Also the availability of language resources during campaigns enables all researchers in a particular field to evaluate, benchmark and compare the performance of their system.
The general mission of HLT evaluation is to enable improvement of the quality of language engineering products. It is essential for validating research hypothesis, for assessing progress and for choosing between research alternatives.
Reasons to evaluate:
There are in general two main testing techniques for system measurement: glass box and black box which approximates to the intrinsic and extrinsic evaluations. In the former approach, the test data is built by taking into account the individual component of the tested system. In the latter approach, however, the test data is chosen, for a given application, only according to the specified relations between input and output without considering the internal component.