New Study Shows ChatGPT Invents or Botches Most Citations in Research
A team of Australian scientists has delivered a stark warning to academics leaning on AI to speed up their work: ChatGPT’s newest model, GPT-4o, still produces an alarming number of fake or flawed citations. The Deakin University researchers found that more than half of the references the system generated for mental-health literature reviews were either fabricated outright or riddled with inaccuracies.
In their experiment, the researchers asked GPT-4o to craft six literature reviews across three psychiatric conditions. The chatbot produced 176 citations. Of those, 19.9 percent were entirely made up. And even among the 141 references that actually existed, nearly half—45.4 percent—contained mistakes ranging from incorrect publication years to bogus page numbers and broken digital object identifiers.
Just 77 citations, or 43.8 percent, were both real and correct. The rest—56.2 percent—were unusable for scientific purposes, a finding the authors say should trouble any academic who relies on generative AI to support scholarship. The study, published in JMIR Mental Health, also examined when and why the model was especially prone to errors.
The fake citations frequently appeared legitimate at first glance. GPT-4o provided DOIs for 33 of the 35 fabricated entries, and 64 percent of those links sent users to actual published papers that had nothing to do with the AI-generated claims. Another 36 percent were pure fiction—non-functioning or invalid DOIs that went nowhere. In either case, the references were completely disconnected from the content ChatGPT had written.
Lead researcher Jake Linardon and his team tested how the accuracy shifted depending on topic familiarity and specificity. They chose major depressive disorder, binge eating disorder, and body dysmorphic disorder—conditions with dramatically different public profiles and research volume. Depression is widely studied, with hundreds of clinical trials on digital therapies. Body dysmorphic disorder, by contrast, has far fewer digital-treatment publications.
The differences were striking. When GPT-4o wrote about major depressive disorder, only 6 percent of the citations were fabricated. But when it covered binge eating disorder and body dysmorphic disorder, those numbers shot up to 28 percent and 29 percent. Even among the citations that were real, accuracy varied wildly: 64 percent for depression, 60 percent for binge eating disorder, and a mere 29 percent for body dysmorphic disorder.
The researchers then compared general summaries with narrowly focused reviews. For binge eating disorder, the specificity mattered enormously—fabrication jumped to 46 percent for specialized requests, compared to 17 percent when the AI wrote general overviews. This pattern did not hold uniformly across all disorders, but it demonstrated that precision prompts can dramatically increase the hallucination rate in some areas.
These findings come at a time when AI use in scientific work is exploding. Nearly 70 percent of mental-health researchers report using ChatGPT for tasks like literature summarization and early manuscript writing. Many praise the efficiency boost, but the risk of misleading content remains a serious concern.
The authors warn that citation errors aren’t minor inconveniences—they damage scientific integrity. Citations are the scaffolding of academic discourse, guiding readers to supporting evidence and linking new work to existing knowledge. When references point to unrelated or nonexistent material, the entire chain of scholarship falters.
The study highlighted that DOIs were the most error-prone element of AI-generated references, with a 36.2 percent failure rate. Problems with author lists were least common at 14.9 percent, but publication dates, journals, volume numbers, and page ranges all showed significant error levels.
Linardon’s team stresses that every AI-generated reference requires verification against original sources. They encourage academic journals to adopt more stringent safeguards—such as using plagiarism-detection software to flag citations that don’t match any known database entry. Universities, they add, should create clear rules around AI use in scholarly writing, including training researchers to spot fabricated references and requiring transparency about AI involvement.
Importantly, the study found no sign that more advanced AI models have solved the hallucination problem. While direct comparisons across versions are difficult, citation fabrication remained prevalent across every test condition, despite expectations that GPT-4o would perform more reliably.
The authors argue that topic maturity and public familiarity heavily shape citation reliability. AI may be safer for well-established subjects but becomes increasingly unreliable when handling niche or newly emerging research fields. Accuracy, in other words, is not random—it is tightly linked to the strength of the underlying training data.
For now, the researchers say ChatGPT should function only as a starting tool, one that can help outline ideas or generate draft material, but never as a source of dependable citations. Human oversight remains essential, and verification cannot be outsourced.
The study also raises broader questions about how generative AI systems should be designed for academic use. If topic-based predictability can indicate when hallucinations are more likely, AI platforms might incorporate built-in alerts or verification prompts for specialized or sparse research areas.
As journals and funding agencies increasingly require explicit AI-use disclosures, the findings underscore why these policies matter. Without strong editorial safeguards, fabricated references could pass through peer review, seep into published work, distort future research, and create long-term damage across scientific disciplines.
The researchers caution that the challenge isn’t merely individual—it’s systemic. Once false citations enter the academic ecosystem, they can spread through citation networks like contaminants. Preventing that outcome requires institutional policies, editorial vigilance, and a clear understanding that while AI can accelerate research tasks, it cannot yet be trusted to anchor the scientific record.
