Out-of-Domain Finetuning to Bootstrap Hallucination Detection
How to use open-source, permissive-use data and collect less labeled samples for our tasks.
It’s easy to finetune models for our specific tasks; we just need a couple hundred or a few thousand samples. However, collecting these samples is costly and time-consuming. What if, we could bootstrap our tasks with out-of-domain data? We’ll explore this idea here.
The task is to finetune a model to detect factual inconsistencies aka hallucinations. We’ll focus on news summaries in the Factual Inconsistency Benchmark (FIB). It is a challenging dataset—even after finetuning for 10 epochs on the training split, the model still does poorly on the validation split (middle in image below).
To examine how out-of-domain data helps, we re-instantiate our model and finetune on Wikipedia summaries for 3 epochs before finetuning for 10 epochs on FIB. This led to PR AUC of 0.85 (right below)—a 23% improvement over finetuning on FIB alone. Also, the probability distributions are better separated, making the model more suitable for production where we have to pick a threshold to classify a summary as inconsistent.
Evals on non-finetuned vs. task-specific finetuning vs. out-of-domain + task-specific finetuning
Okay, that was the high-level overview. Next, let’s dive into the details and visuals.
Classifying factual inconsistencies via NLI
We can detect factually inconsistent summaries via the natural language inference (NLI) task. The NLI task works like this: Given a premise sentence and a hypothesis sentence, the goal is to predict if the hypothesis is entailed by, neutral, or contradicts the premise.
Entailment, neutral, and contradiction labels in NLI (source)
We can apply NLI to abstractive summaries too. Here, the source document is the premise and the summary is the hypothesis. Thus, if the summary contradicts the source, that’s a factual inconsistency aka hallucination.
Entailment, neutral, and contradiction labels in abstractive summaries (source)
For the model, we’ll use a variant of BART that has been finetuned on the MNLI dataset. By default, an NLI model outputs probabilities for entailment (label = 2), neutral (label = 1), and contradiction (label = 0). To get the probability of factual inconsistency, we’ll drop the neutral dimension, apply a softmax on the remaining two dimensions, and take the probability of contradiction. It’s really just a single line of code:
def get_prob_of_contradiction(logits: torch.Tensor) -> torch.Tensor: """ Returns probability of contradiction aka factual inconsistency.def get_prob_of_contradiction(logits: torch.Tensor) -> torch.Tensor: """ Returns probability of contradiction aka factual inconsistency.Args: logits (torch.Tensor): Tensor of shape (batch_size, 3). The second dimension represents the probabilities of contradiction, neutral, and entailment.
Returns: torch.Tensor: Tensor of shape (batch_size,) with probability of contradiction.
Note: This function assumes the probability of contradiction is in index 0 of logits. """
Drop neutral logit (index=1), softmax, and get prob of contradiction (index=0)
prob = F.softmax(logits[:, [0, 2]], dim=1)[:, 0]
return prob`
Finetuning to classify factual inconsistencies in FIB
The Factual Inconsistency Benchmark (FIB) is a dataset for evaluating whether language models can spot factual inconsistencies in summaries. It is based on news summaries from CNN/Daily Mail and XSUM, with 100 and 500 news articles from each respectively. Each news article comes with a pair of summaries where one is factually consistent with the source document while the other isn’t.
Given a gold summary and a distractor, can we assign a higher score to the factually consistent summary vs. the factually inconsistent summary? (source)
Four of the authors annotated the summaries, and each summary was annotated by two annotators. First, they annotated reference summaries as factually consistent or inconsistent. Then, they edited the factually inconsistent summaries to be factually consistent with an emphasis on making minimal edits. Most edits involved removing or changing keywords or phrases that were not found in the news article.
To prepare for finetuning and evaluation, we first excluded the CNN/Daily Mail documents and summaries—they were found to be low quality during visual inspection. Then, we balanced the dataset by keeping one consistent summary and one inconsistent summary for each news article. Finally, we split it into training and validation sets, where validation made up 20% of the data (i.e., 100 news articles). To prevent data leakage, the same news article doesn’t appear across training and validation splits.
Upon visual inspection, it seems challenging to distinguish between consistent vs. inconsistent summaries. The latter tends to contain words from or similar to the source document but phrased to be factually inconsistent. Here’s the first sample in the dataset:
Source: Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance. Isle of Wight Council said its new procedures were in response to a resident’s complaint. Councillor Shirley Smart said it would “initially result in a slower service”. Originally passengers and vehicles boarded or disembarked the so-called “floating bridge” at the same time. Ms Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry “in a safe manner”. However, it was “responding” to the MCA’s recommendations “following this complaint”. She added: “This may initially result in a slower service while the measures are introduced and our customers get used to the changes.” The service has been in operation since 1859.
Inconsistent summary: A new service on the Isle of Wight’s chain ferry has been launched following a complaint from a resident.
Consistent summary: Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.
Evaluating the non-finetuned model on FIB confirms our intuition: The model struggled and had low ROC AUC and PR AUC (below). Furthermore, probability distributions for consistent vs. inconsistent summaries weren’t well separated—this makes it unusable in production where we have to pick a threshold to classify outputs as inconsistent.
The non-finetuned model performs badly on FIB
After finetuning for 10 epochs via QLoRA, ROC AUC and PR AUC improved slightly to 0.68 - 0.69 (below). Nonetheless, the separation of probabilities is still poor. Taking a threshold of 0.8, recall is a measly 0.02 while precision is 0.67.
Even after finetuning for 10 epochs, performance isn't much better
Pre-finetuning on USB to improve performance on FIB
It seems that finetuning solely on FIB for 10 epochs wasn’t enough. What if we pre-finetune on a different dataset before finetuning on FIB?
The Unified Summarization Benchmark (USB) is made up of eight summarization tasks including abstractive summarization, evidence extraction, and factuality classification. While FIB documents are based on news, USB documents are based on a different domain—Wikipedia. Labels for factual consistency were created based on edits to summary sentences; inconsistent and consistent labels were assigned to the before and after versions respectively. Here’s the first sample in the dataset:
Source: Wendy Jane Crewson was born in Hamilton, Ontario, the daughter of June Doreen (née Thomas) and Robert Binnie Crewson. Also in 2012, Crewson began playing Dr. Dana Kinny in the CTV medical drama “Saving Hope”, for which she received Canadian Screen Award for Best Supporting Actress in a Drama Program or Series in 2013.
Inconsistent (before): Wendy Jane Crewson (born May 9, 1956) is a Canadian actress and producer.
Consistent (after): Wendy Jane Crewson is a Canadian actress.
Nonetheless, the labeling methodology isn’t perfect. Summaries that were edited due to grammatical or formatting errors were also labeled as non-factual. For example, in the second sample, the only difference between the consistent and inconsistent summary is that the latter is missing the word “the”. IMHO, both summaries are factually consistent.
Source: When she returned to Canada, Crewson landed a leading role in the television movie “War Brides” (1980) directed by Martin Lavut, for which she received her first ACTRA Award nomination. From 1980 to 1983, she starred in the CBC drama series, “Home Fires”, a family saga set in Toronto during World War II. In 1991, Crewson appeared in her first breakthrough role in the American drama film “The Doctor” starring William Hurt.
Inconsistent (before): She began her career appearing on Canadian television, before her breakthrough role in 1991 dramatic film “The Doctor”.
Consistent (after): She began her career appearing on Canadian television, before her breakthrough role in the 1991 dramatic film “The Doctor”.
Cleaning the USB data is beyond the scope of this write-up so we’ll just use the data as is. (Surprisingly—or unsurprisingly—finetuning on the imperfect USB data was still helpful.) For the training and validation sets, we’ll adopt the split provided by the authors.
First, we re-instantiate the model before finetuning on the USB training split for 3 epochs with the same QLoRA parameters. Then, we evaluate on the FIB validation split.
Finetuning on USB data didn’t seem to help with the FIB validation split. (To be clear, we’ve not finetuned on the FIB data (yet) so the poor performance is expected.) ROC AUC and PR AUC barely improved (below), and the separation of probabilities appears just as bad. Taking the same threshold of 0.8, recall is 0.10 while precision is 0.59. Considering that the FIB data is balanced, precision of 0.59 is barely better than a coin toss.
Finetuning on 3 epochs of USB didn't seem to help much with FIB...
But what happens when we add 10 epochs of finetuning on FIB? Our model now achieves PR AUC of 0.85—this is a 23% improvement over finetuning on FIB alone (PR AUC = 0.69). More importantly, the probability distributions of consistent vs. inconsistent summaries are better separated (right below), making the model more suitable for production. Relative to finetuning solely on FIB, at the same threshold of 0.8, we’ve increased recall from 0.02 to 0.50 (25x) and precision from 0.67 to 0.91 (+35%).
Or did it? Adding 10 epochs of FIB led to greatly improved performance
Considering that evals for the USB-finetuned (PR AUC = 0.60) and non-finetuned (PR AUC = 0.56) models weren’t that different, this improvement is unexpected. It suggests that, though finetuning on USB didn’t directly improve performance on FIB, the model did learn something useful that enabled the same 10 epochs of FIB-finetuning to achieve far better results. This also means that it may be tricky to directly assess the impact of data blending.
Takeaway: Pre-finetuning on Wikipedia summaries improved factual inconsistency classification in news summaries, even though the former is out-of-domain.
In other words, we bootstrapped on Wikipedia summaries to identify factually inconsistent news summaries. Thus, we may not need to collect as much finetuning data for our tasks if there are open-source, permissive-use datasets that are somewhat related. While this may be obvious for some, it bears repeating that transfer learning for language modeling can extend beyond pretraining to finetuning (also see InstructGPT and its predecessor).
(Also, it’ll likely work just as well if we blend both datasets and finetune via a single stage instead of finetuning in multiple stages on each dataset separately. That said, doing it in stages helped with understanding and visualizing the impact of each dataset.)
• • •
I hope you found this as exciting to read as it was for me to run these experiments. The code is available here. What interesting findings and tricks have you come across while finetuning your own models? Also, what other approaches work well for factual inconsistency detection? Please DM me or leave a comment below!
References
-
Tam, Derek, et al. “Evaluating the factual consistency of large language models through summarization.” arXiv preprint arXiv:2211.08412 (2022).
-
Krishna, Kundan, et al. “USB: A Unified Summarization Benchmark Across Tasks and Domains.” arXiv preprint arXiv:2305.14296 (2023).
-
Lewis, Mike, et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).
-
Williams, Adina, Nikita Nangia, and Samuel R. Bowman. “A broad-coverage challenge corpus for sentence understanding through inference.” arXiv preprint arXiv:1704.05426 (2017).
-
Nallapati, Ramesh, et al. “Abstractive text summarization using sequence-to-sequence rnns and beyond.” arXiv preprint arXiv:1602.06023 (2016).
-
Narayan, Shashi, Shay B. Cohen, and Mirella Lapata. “Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.” arXiv preprint arXiv:1808.08745 (2018).
-
Dettmers, Tim, et al. “Qlora: Efficient finetuning of quantized llms.” arXiv preprint arXiv:2305.14314 (2023).
-
Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in Neural Information Processing Systems 35 (2022): 27730-27744.
-
Stiennon, Nisan, et al. “Learning to summarize with human feedback.” Advances in Neural Information Processing Systems 33 (2020): 3008-3021.
Appendix
You may be wondering how the model performed on USB. I excluded it from the main write-up so as not to distract from the FIB dataset—there were already so many graphs!
Performance of non-finetuned model on USB: The non-finetuned model struggled with the USB validation split and was barely able to distinguish between factually consistent vs. factually inconsistent summaries. The bulk of the probability density for both labels is at 0.0. This may be due to the labeling methodology where some grammatical or formatting errors were incorrectly labeled as factually inconsistent. See the second sample here.
Similarly, the non-finetuned model performs poorly on USB
Performance of finetuned model on USB: After three epochs, PR AUC increased from 0.62 to 0.92. Also, the separation of distributions is closer to what we’d like to see in production. Nonetheless, given the concerns with the labels, it’s hard to say if it’s learned to classify factual inconsistencies or other errors such as grammar and formatting.
But after 3 epochs it does much better
Thanks to Shreya Shankar for reading drafts of this—the highlights in the sample summaries were her brilliant idea.
If you found this useful, please cite this write-up as:
Yan, Ziyou. (Nov 2023). Out-of-Domain Finetuning to Bootstrap Hallucination Detection. eugeneyan.com. https://eugeneyan.com/writing/finetuning/.
or
@article{yan2023finetune, title = {Out-of-Domain Finetuning to Bootstrap Hallucination Detection}, author = {Yan, Ziyou}, journal = {eugeneyan.com}, year = {2023}, month = {Nov}, url = {https://eugeneyan.com/writing/finetuning/} }@article{yan2023finetune, title = {Out-of-Domain Finetuning to Bootstrap Hallucination Detection}, author = {Yan, Ziyou}, journal = {eugeneyan.com}, year = {2023}, month = {Nov}, url = {https://eugeneyan.com/writing/finetuning/} }Share on:
Join 11,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
open-sourceLLM Cost Tracking and Spend Management for Engineering Teams
<p>Your team ships a feature using GPT-4, it works great in staging, and then production traffic hits. Suddenly you are burning through API credits faster than anyone expected. Multiply that across three providers, five teams, and a few hundred thousand requests per day. Good luck figuring out where the money went.</p> <p>We built <a href="https://git.new/bifrost" rel="noopener noreferrer">Bifrost</a>, an open-source LLM gateway in Go, and cost tracking was one of the first problems we had to solve properly. This post covers what we learned, how we designed spend management into the gateway layer, and what the alternatives look like. You can get started with the <a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer">setup guide</a> in under a minute.</
Top 5 Enterprise AI Gateways to Reduce LLM Cost and Latency
<h2> TL;DR </h2> <p>If you're running LLM workloads in production, you already know that cost and latency eat into your margins fast. An AI gateway sits between your app and the LLM providers, giving you caching, routing, failover, and budget controls in one layer. This post breaks down five enterprise AI gateways, what each one does well for cost and latency, and where they fall short. <a href="https://git.new/bifrost" rel="noopener noreferrer">Bifrost</a> comes out ahead on raw latency (less than 15 microseconds overhead per request), but each tool has its own strengths depending on your stack.</p> <h2> Why Cost AND Latency Matter Together </h2> <p>If you're building with LLMs, you have probably already noticed that optimizing for cost alone can tank your latency, and vice versa. Switchi
How Bifrost Reduces GPT Costs and Response Times with Semantic Caching
<h2> TL;DR </h2> <p>Every GPT API call costs money and takes time. If your app sends the same (or very similar) prompts repeatedly, you are paying full price each time for answers you already have. <a href="https://git.new/bifrost" rel="noopener noreferrer">Bifrost</a>, an open-source LLM gateway, ships with a <a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer">semantic caching</a> plugin that uses dual-layer caching: exact hash matching plus <a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer">vector similarity search</a>. Cache hits cost zero. Semantic matches cost only the embedding lookup. This post walks you through how it works and how to set it up.</p> <h2> The cost problem with GPT API calls </h2> <p
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Releases
Sweden’s Redpine launches with €1.1 million backing to reduce AI hallucinations through licensed data - EU-Startups
<a href="https://news.google.com/rss/articles/CBMi0wFBVV95cUxOcHdKbWh4N042Z19NVThCNUdVdGFPWkY4ZE56V1NyeE1WR2d6dGY2cG1QU1RUc0syUUJ6bUIzUEktQ1pKSHZJdEo3QUJTLS1IVFk1Rkx5bml6eVZqc2V4QnpTelVPN2ppX05ldllwS2VsN2dtZVlYMEZUVm1feDRtMkQyb1d2NEFHWW9nUFBfS3Y5WEE3UG44SVNFV3hvNE9xUzQyclhoaV9jeFk3UnNoRmp1Tm9IckduRmZMdmZQWDJuVGRGQTdWOVpsZTBTa2FiREgw?oc=5" target="_blank">Sweden’s Redpine launches with €1.1 million backing to reduce AI hallucinations through licensed data</a> <font color="#6f6f6f">EU-Startups</font>
Herd MSL and Publicis Groupe ANZ launch AI search visibility service - Mi-3.com.au
<a href="https://news.google.com/rss/articles/CBMipAFBVV95cUxQdUhMSzZGTUFxdUFHZ1E1TTl3QWdSQmxNX2U1ejZTWndzcEhKendqb3hvUFlDLUUzMEkxQW5yamFHR1M5R3VlbnlueGdreVAwLXBNM25zc2lFRnpEZFJvcEF1bVZKdnI4N3hvd19TdEZDeHFfQTJIbjZqS094c2xSY3ZOMXB1S1hXQ0xLYlI0MjhEa2M4TTZXMVBvQTVwV0dRdmRGMA?oc=5" target="_blank">Herd MSL and Publicis Groupe ANZ launch AI search visibility service</a> <font color="#6f6f6f">Mi-3.com.au</font>
AI Search Barely Cites Syndicated News Or Press Releases - Search Engine Journal
<a href="https://news.google.com/rss/articles/CBMiogFBVV95cUxObGg0U2ZLMDNLTy0yZ1hsYmFIYW9QVnBvcThGdWVRNWVvVUtqS0oyUW5SWWFvVmN5ZDVtOEdYQXVvdmdKbnduQkhJVy1TLXQ3U1ltVThrOWVESDhfeE5YeFlIVWw0TVNWZXBGWXVETGJObmNTZ1pOeXJGYU9hRzNsel9qdEV2cmJjSV93LWJCZTdRV3VjdVh4cE1kanhDcDVwX0E?oc=5" target="_blank">AI Search Barely Cites Syndicated News Or Press Releases</a> <font color="#6f6f6f">Search Engine Journal</font>
White House Releases National AI Legislative Framework, as Competing Congressional Proposals Sharpen the Federal-State Divide — AI: The Washington Report - mintz.com
<a href="https://news.google.com/rss/articles/CBMivAFBVV95cUxPNEdTTWVFS004SGxOLXZYY0Z4TmRfWkpJMGMyVnlQcWR0cW1HSHRIYmN1U1czWjMzMmlyM1ZsSjcxU19lLV93bnhraXRRVGp5Z251TWtTSFFHNmFYdVRhenF1MXNpQmN1TnpJV2E1ZVhTRXJfb1pWWjZBd2tXWWpyaDA5a1p3YWEzdTlMWjdTM2xnX18tVDhGcTFWM19zYm5Pd19UdlVmSXkxRGc2R0c4TTNwYVZiTFJRNlZ3Uw?oc=5" target="_blank">White House Releases National AI Legislative Framework, as Competing Congressional Proposals Sharpen the Federal-State Divide — AI: The Washington Report</a> <font color="#6f6f6f">mintz.com</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!