The notebook belows provides answers to some of the important questions that may arise when an institution or a company considers evaluating HLT systems.
Any evaluation has pragmatically chosen goals. In HLT evaluation, the goal can be summarized by the following questions:
- “Which one is better?” The goal of evaluation is to compare different systems for a given application.
- “How good is it?” The evaluation aims to determine the degree of desired qualities of a system.
- “Why is it bad?” The goal is to determine the weakness of a system for further development.
An evaluation can be user-oriented when an evaluation metric focuses on the users’ satisfaction. However, this kind of methodology is particularly suitable for the evaluation of ready-to-sell products. As HLT Evaluation portal focuses on the evaluation of research systems rather than that of commercial products ready for the market, we describe evaluation as a technology that could deal with both the need to validate the HLT system as a product and to produce useful feedback for further improvement of the system. There are many different types of evaluations depending on the object being evaluated and the purpose of the evaluation. (See Survey of the State of the Art in Human Language Technology. Editorial Board: R. A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen and V. Zue. Managing editors: G. B. Varile, A. Zampolli, 1996. )
- Research evaluation tends to validate a new idea or to assess the amount of improvement it brings on older methods.
- Usability evaluation aims to measure the level of usability of a system. Typically it enables users to achieve a specified goal in an efficient manner.
- Diagnostic evaluation attempts to determine how worthwhile a funding evaluation program has been for a given technology.
- Performance evaluation aims to assess the performance and relevance of a technology for solving a problem well defined.
When the performance processing of a system consists of several components associated with different stages, an additional distinction should be respected between intrinsic evaluation, designed to evaluate each component independently, and extrinsic evaluation to assess the overall performance of the system.
In general, an evaluation starts with the description of
- the object of the evaluation;
- classes of users;
- the measurable attributes of systems or evaluation criteria along with the metrics.
The EAGLES (Expert Advisory Group on Language Engineering Standards) evaluation working group identified the 7 major steps for a successful evaluation:
- Why is the evaluation being done?
- Elaborate a task model
- Define top level quality characteristics
- Produce detailed requirements for the system under evaluation, on the basis of 2 and 3
- Devise the metrics to be applied to the system for the requirements produced under 4
- Design the execution of the evaluation
- Execute the evaluation
There are in general two main testing techniques for system measurement: glass box and black box which approximates to the intrinsic and extrinsic evaluations. In the former approach, the test data is built by taking into account the individual component of the tested system. In the latter approach, however, the test data is chosen, for a given application, only according to the specified relations between input and output without considering the internal component.
The general mission of HLT evaluation is to enable improvement of the quality of language engineering products. It is essential for validating research hypothesis, for assessing progress and for choosing between research alternatives.
Reasons to evaluate:
- Validate research hypotheses
- Assess progress
- Choose between research alternatives
- Identify promising technology (market)
- Feedback to funding agencies (European Commission)
- Benchmark systems
For evaluation design the emphasis has traditionally been put on measuring systems performance that meet specific functional requirements. Usability is generally ignored because there are no objective criteria for usability.
ISO 9241, one of ISO Standards that apply to usability and ergonomics, provides the information that needs to be taken into account when specifying or evaluating usability in terms of measures of user performance and satisfaction. ISO 13407 specifies the user-centered design process needed to achieve the usability and quality in use goals.
The Common Industry Format, developed within the NIST Industry USability Reporting (IUSR) Project for usability test report, has been approved as ISO Standard. The document will be called: “ISO 25062 Software Engineering- Software Quality and Requirements Evaluation- Common Industry Format for Usability Test Reports” (ISO Catalogue).
Comparative evaluation is a paradigm in which a set of participants compare the results of their systems using the same data and control tasks with metrics that are agreed upon. Usually this evaluation is performed in a number of successive evaluation campaigns with open participation. For every campaign, the results are presented and compared in special workshops where the methods used by the participants are discussed and contrasted.
The experience with comparative evaluation in the USA and in Europe has shown that the approach leads to a significant improvement of the performance of the evaluated technologies. A consequence is often the production of high quality resources. The evaluation requires the development of annotated data and test sets since the participants need data for training and testing their systems. Also the availability of language resources during campaigns enables all researchers in a particular field to evaluate, benchmark and compare the performance of their system.