How faithful is text summarisation?

PostedJune 4, 2024

UpdatedJune 4, 2024

Byadmin

What's this about?

GenAI and Large Language Modells (LLMs) can be used for summarising reports, meeting notes, research papers and many more texts including book length works, provided the text is split into chuncks. This requirement is due to the inability of most LLMs at this point to ingest text over a certain number of words.

QuillBot's summarizer is trusted by millions worldwide to condense long articles, papers, or documents and into key summary paragraphs using state-of-the-art AI.

QuillBot website

Measuring the aithfulness of GenAI text summarising.

Researchers at the Allen Institute of AI and Princeton University, set out evaluate how faithful various LLMs were at summarising books. They chose books that were published after current LLMs had been trained to avoid data contamination. Human readers were used to assess how well each of the LLMs had summarised the books by evaluating claims, extracted from each of the summarised using ChatGPT, also an LLM based Gen AI system. The researchers called the process, Faithfulness Annotations in Book Length Summarisation (FABLES).

Pipeline for collecting faithfulness annotations in book-length summarization (FABLES).

SOURCE: Evaluating faithfulness and content selection in book-length summarization, Yekyung Kim metal, arXiv:2404.01261v1 [cs.CL] 1 Apr 2024

Extracting claims for evaluation.

The longform summaries of the fiction books were decomposed into “atomic claims” for human readers to evaluate. The claims were produced automatically by prompting ChatGPT4 to produce that had to be fully understandable on their own and as far as possible, “situated within its relevant temporal, locational, and causal context”.

Example summary from the “Romantic Comedy”, by Curtis Sittenfeld, output by Claude 3 Opus. Adapted from Figure 2 in Yekyung Kim et al, April 2024.

Extracted claims output by ChatGPT4 from the example text summary produced by Claude 3 Opus. Claim numbers correspond to the annotated portions of the text summary shown above. Prompts were engineered for ChatGPT to ensure the claims were understandable and situated within the relevant text. Adapted from Figure 2 in Yekyung Kim et al, April 2024.

Human validation by the authors of a random sample of 100 extracted claims demonstrated 100% precision (i.e., each claim can be traced to the summary without any extra or incorrect information).

Yekyung Kim et al

Allen Institute for AI & Princeton.

How LLM's performed.

Some LLMs did better than other with Claude 3 performing the best. None of them were completely faithful to the text of the books that they were summarising.

Faithfulness of various LLMs in summarising fiction books when evaluated by humans for accuracy of claims made that were extracted from the summaries by ChatGPT. Derived from data in Yekyung Kim et al, April 2024.

A qualitative analysis of FABLES reveals that the majority of claims marked as unfaithful are related to events or states of characters and relationships. Furthermore, most of these claims can only be invalidated via multi-hop reasoning over the evidence, highlighting the task‘s complexity and its difference from existing fact-verification settings

Yekyung Kim et al

Allen Institute for AI & Princeton.

Types of error.

The researchers at the Allen Institute for AI and Princeton University produced a taxonomy of errors. The percentage of summaries displaying each type of error levels are shown in the table below with omissions, factuality and chronology being the most problematic.

Percentage of summaries per model identified with specific issues shown in red boxes, based on annotator comments. The green boxes indicates categories where the models received compliments. Adapted from Table 6 Yekyung Kim et al, 2024.

... omission of key information plagues all LLM summarizers.

Yekyung Kim et al

Allen Institute for AI & Princeton.

Trustworthy or not?

This evaluation provides an important assessment of whether current GenAI systems can be trusted to produce a faithful summary of the texts that they have been fed. In particular it is alarming that such a high percentage of summaries were noted as having factual errors along with a not dissimilar percentage of omissions. This should cause us to pause and consider whether such systems should be relied upon at all in business, education and many other fields.

References

Evaluating faithfulness and content selection in book-length summarization Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella , Varun Manjunatha , Kyle Lo, Tanya Goyalº, Mohit Iyyer UMass Amherst, Adobe , Allen Institute for AI, Princeton. April 2024

Impact on Human Values

Human Values Risk Analysis for text summarising.

Truth & Reality

HIGH RISK

Significant percentage of Factual errors

Omissions

Authentic Relationships

LOW RISK

Dignity of Work

MEDIUM RISK

Replaces humans in producing summaries

Privacy & Freedom

HIGH RISK

LLMs use copyright data

Moral Autonomy

LOW RISK

Cognition & Creativity

MEDIUM RISK

Can impact critical thinking and creativity

Governance Pillars

Transparency

Companies opaque about what data they trained LLMs on although most acknowledge copyright data used.

Independent “faithfulness” metrics should be published.

Justice

Accountability

Companies should be held to account for infringement of copyright, and output of LLMs where there is consequential loss.

Policy Recommendations

Organisations deploying a chatbot for use by the public or clients must be accountable for the output of the chatbot where there is consequential loss due to unfaithful summarisation of documents. Legislation may be needed to assign ‘product’ liability where the chatbot is the ‘product’.

Copyright protection should be enforced and no exception made for AI companies. Chapter 8 of the House of Lords Report cited, deals with Copyright in some detail and highlights various policy options and limitations of different approaches, such licensing and opt in or opt out of data crawling on websites.

Developers and companies should be required to make information available on what data their system has been trained on and what accuracy they can be expected to achieve based on independent tests.