A brief introduction to summarisation | Deeper Insights™
One of the biggest challenges in automatic text summarisation is the evaluation of the system output. Evaluation is an important and necessary task since it allows the assessment of the performance of a summarisation system, as well as the comparison between results produced by different systems, enabling us to understand if computer-generated summaries are as good as human-generated ones.
The challenges of automatic evaluation of summaries
This is a very difficult task due to a number of reasons:
1. There is no clear notion of what constitutes a good summary. Although there are attempts to define criteria to guide the evaluation, these tend to be subjective and prone to criticism.
2. Human variability. The most common way to evaluate automatic summaries is to compare them with human-made model summaries (also known as reference summaries or gold standard summaries). But when creating extracts, different people (the annotators) do not usually agree on the importance of sentences and, thus, tend to choose different sentences when creating their summary. As a consequence, the inter-annotator agreement tends to be low. Besides, people are also not very consistent when producing summaries. The same person may choose a different set of sentences at different times, which results in low intra-annotator agreement.
3. Semantic equivalence. It is not uncommon to find two or more sentences in a document that express the same meaning using different words, a phenomenon known as paraphrasing. This makes evaluation even more difficult, because it implies that there may be more than one “good summary” for a given text document.
4. Heavy reliance on humans still. Humans are very good at identifying important content and producing well-formed summaries – meaning the evaluation process tends to rely on humans to create gold standard summaries or act as judges of the system’s output. This greatly increases the cost of an evaluation, making it a time-consuming and expensive task.
Despite these challenges, extensive evaluations of automatic summarisation systems have been carried out from as far back as the 1960’s, greatly helping to refine and compare existing systems. So now lets touch upon the evaluation methods that have been used.
Methods for evaluating text summarisation systems
Methods for automatic summarisation evaluation can be broadly classified along two dimensions:
1. how humans interact with the evaluation process (online VS offline evaluation)
2. what is measured (intrinsic VS extrinsic evaluation).
Online evaluation requires the direct involvement of humans in the assessment of the system’s results according to set guidelines.
In contrast, offline evaluation does not require direct human intervention as it usually compares the system’s output with a previously defined set of gold standard summaries, using measures such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation). This makes offline evaluation more attractive than online evaluation, because it is not directly influenced by human subjectivity, it is repeatable and faster, and allows evaluators to quickly notice if a change in the system leads to an improvement or deterioration of its performance.
Intrinsic evaluation methods, as the name suggests, test the summarisation system in of itself, either by asking humans to judge the linguistic quality, coherence and informativeness of the automatic summaries or by comparing them with human-authored model summaries.
Alternatively, extrinsic evaluation methods test the summarisation system based on the usefulness of the machine-generated summaries to help complete a certain task, such as relevance assessment and reading comprehension. This kind of task-oriented evaluation can be of extraordinary practical value to providers and users of summarisation technology, due to their focus on the application (e.g. producing an effective report or presentation based on summaries, finding relevant documents on a given topic from a large collection, correctly answering questions about the source document using summaries only).
If a human is able to perform such tasks using less time and without loss of accuracy, then the system is considered to have good performance. However, carrying out extrinsic evaluation requires careful planning and is resource-intensive, thus not being suitable for monitoring the performance of a summarisation system during development. In such contexts, intrinsic evaluation is usually preferred.