British geneticist interested in splicing, RNA decay, and synthetic biology. This is my blog focusing on my adventures in computational biology. 

CompBio 029: It is January 2025 and AI hallucinations of academic references is still a huge problem

So I have been impressed with the ability of AI tools to take write code (see blog post here) and more recently, their ability to take in document you have written and write an accurate summary. Not the free tire of chatGPT from OpenAI, but the paid versions of chatGPT and Claude AI from Anthropic. So I wanted to revisit something that I knew them to be terrible at before to see if 2025 was really the year that I could stop writing literature reviews and manuscript introductions and let the Large Language Models (LLMs) have all the fun.

I took a subject matter that I know very well: synthetic gene circuit platforms in plants and asked the different LLMs this:

Prompt: Please write me a paragraph that summaries the latest advances in synthetic gene circuits for plants with references (Harvard style) and a reference list at the end of the paragraph. 

Free chatGPT

First reference is VERY wrong (wrong year, first and last author is ok but middle author names seem off. The journal name and article title is at least OK). Second ref is wrong (wrong author list, wrong issue and wrong pages, year, title, journal and some authors is ok).

Paid ChatGPT (4o mini)

I do not have a paid account for chatGPT so thank you to Patrick Gong (Blue Sky or LinkedIn) for him running my prompt through his account and sharing the results:

All of these papers are complete fabrications. Astonishing!

Claude AI

I have a free account with Anthropic so I can get a few answers from their Claude 3.5 sonnet model, see below:

Claude gave a lot of references but all are a complete fabrication. The output from Claude did come with a warning:

Note: As my knowledge cutoff is April 2024, I should mention that you should verify these references, particularly those from 2024, as I may have generated plausible but incorrect citations.

But 5/6 of the references Claude generated were dated pre-2024, and yet all were still complete fabrications.

So somehow, the old (free) chatGPT model appeared to be the best. It only had two incorrect references, and both had some basis in reality.

My personal take-home message is that LLMs have no concept of truth: LLMs are not trained to tell the truth and are not punished for lying. They are a tool, a tool with clear limitations and we must recognise these limitations if we are going to use them. Whether a student or a tenured professor, do not just accept their output as truth. Always verify.

CompBio 028: Python vs R: an endless war