AI models just can’t seem to stop making things up. As two recent studies point out, that proclivity underscores prior warnings not to rely on AI advice for anything that really matters.
One thing AI makes up quite often is the names of software packages.
As we noted earlier this year, Lasso Security found that large language models (LLMs), when generating sample source code, will sometimes invent names of software package dependencies that don’t exist.
That’s scary, because criminals could easily create a package that uses a name produced by common AI services and cram it full of malware. Then they just have to wait for a hapless developer to accept an AI’s suggestion to use a poisoned package that incorporates a co-opted, corrupted dependency.
Researchers from University of Texas at San Antonio, University of Oklahoma, and Virginia Tech recently looked at 16 LLMs used for code generation to explore their penchant for making up package names.
In a preprint paper titled “We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs,” the authors explain that hallucinations are one of the unresolved shortcomings of LLMs.
That’s perhaps not lost on the lawyers who last year used generative AI to cite non-existent court cases in legal briefs, and then had to make their own apologies to affected courts. But among those who find LLMs genuinely useful for coding assistance, it’s a point that bears repeating.
“Hallucinations are outputs produced by LLMs that are factually incorrect, nonsensical, or completely unrelated to the input task,” according to authors Joseph Spracklen, Raveen Wijewickrama, A H M Nazmus Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala. “Hallucinations present a critical obstacle to the effective and safe deployment of LLMs in public-facing applications due to their potential to generate inaccurate or misleading information.”
Maybe not “we’ve bet on the wrong horse” critical – more like “manageable with enough marketing and lobbying” critical.
LLMs already have been deployed in public-facing applications, thanks to the enthusiastic sellers of AI enlightenment and cloud vendors who just want to make sure all the expensive GPUs in their datacenters see some utilization. And developers, to hear AI vendors tell it, love coding assistant AIs. They apparently improve productivity and leave coders more confident in the quality of their work.
Even so, the researchers wanted to assess the likelihood that generative AI models will fabulate bogus packages. So they used 16 popular LLMs, both commercial and open source, to generate 576,000 code samples in JavaScript and Python, which rely respectively on the npm and PyPI package repositories.
The results left something to be desired.
“Our findings reveal that the average percentage of hallucinated packages is at least 5.2 percent for commercial models and 21.7 percent for open source models, including a staggering 205,474 unique examples of hallucinated package names, further underscoring the severity and pervasiveness of this threat,” the authors state.
The 30 tests run from the set of research prompts resulted in 2.23 million packages being created – about 20 percent of which (440,445) were determined to be hallucinations. Of those, 205,474 were unique non-existent packages that could not be found in PyPI or npm.
What’s noteworthy here – beyond the fact that commercial models are four times less likely than open source models to fabricate package names – is that these results show four to six times fewer hallucination than Lasso Security’s figures for GPT-3.5 (5.76 percent vs 24.2 percent) and GPT-4 (4.05 percent vs. 22.2 percent). That counts for something.
Reducing the likelihood of package hallucinations comes at a cost. Using the DeepSeek Coder 6.7B and CodeLlama 7B models, researchers implemented a mitigation strategy via Retrieval Augmented Generation (RAG), to provide a list of valid package names to help guide prompt responses, and Supervised Fine-Tuning, to filter out invented packages and retain the model. The result was reduced hallucination – at the expense of code quality.
“The code quality of the fine-tuned models did decrease significantly, -26.1 percent and -3.1 percent for DeepSeek and CodeLlama respectively, in exchange for substantial improvements in package hallucination rate,” the researchers wrote.
Size matters too
In the other study exploring AI hallucination, José Hernández-Orallo and colleagues at the Valencian Research Institute for Artificial Intelligence in Spain found that LLMs become more unreliable as they scale up.
The researchers looked at three model families: OpenAI’s GPT, Meta’s LLaMA and BigScience’s open source BLOOM. They tested the various models against scaled-up versions (more parameters) of themselves, with questions about addition, word anagrams, geographical knowledge, science, and information-oriented transformations.
They found that while the larger models – those shaped with fine-tuning and more parameters – are more accurate in their answers, they are less reliable.
That’s because the smaller models will avoid responding to some prompts they can’t answer, whereas the larger models are more likely to provide a plausible but wrong answer. So the portion of non-accurate responses consists of a greater portion of incorrect answers, with a commensurate reduction in avoided answers.
This trend was noticed particularly for OpenAI’s GPT family. The researchers found that GPT-4 will answer almost anything, where prior model generations would avoid responding in the absence of a reliable prediction.
Further compounding the problem, the researchers found that humans are bad at evaluating LLM answers – classifying incorrect answers as correct from around 10 to 40 percent of the time.
Based on their findings, Hernández-Orallo and his co-authors argue, “relying on human oversight for these systems is a hazard, especially for areas for which the truth is critical.”
This is a long-winded way of rephrasing Microsoft’s AI boilerplate, which warns not to use AI for anything important.
“[E]arly models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook,” the researchers conclude.
“These findings highlight the need for a fundamental shift in the design and development of general-purpose artificial intelligence, particularly in high-stakes areas for which a predictable distribution of errors is paramount.” ®