How General-Purpose LLMs Are Deepening the Reproducibility Crisis in Life-Science Research
Reviewed
How General-Purpose LLMs Are Deepening the Reproducibility Crisis in Life-Science Research
General-purpose large language models fabricate roughly one in five biomedical citations, fail to produce deterministic outputs for regulated tasks, and arrive in a landscape where 80% of legacy bioinformatics workflows already cannot execute. The evidence paints a dual picture: LLMs can revive decayed computational pipelines and automate documentation - but without citation grounding, they accelerate the very crisis they promise to fix.
Bioinformatics Workflow Decay: 80% of Taverna Pipelines Are Dead
Before LLMs entered the picture, the reproducibility crisis was already severe at the infrastructure level. An analysis of the myExperiment repository - once the primary hub for sharing bioinformatics pipelines - found that nearly 80% of tested Taverna workflows fail to execute or reproduce their original results (DOI: 10.48550/arXiv.2511.19510). Workflows published between 2007 and 2009 exhibit failure rates exceeding 80%, and the official retirement of the Taverna system by Apache in 2020 left thousands of validated scientific methods stranded.
The root cause is rarely incorrect science. Approximately 50% of workflow decay is attributed to volatile third-party resources: unavailable web services, inaccessible databases, and unannounced API changes. Many Taverna-era pipelines relied on SOAP-based web services that have since been replaced by RESTful APIs, rendering original service calls inoperable. Missing example data, insufficient environment documentation, and incomplete parameter records compound the problem.
CodeR3 - LLM-driven workflow revival
The CodeR3 framework uses large language models to parse legacy .t2flow XML files, reconstruct the conceptual experiment, replace deprecated SOAP endpoints with modern equivalents, and translate the result into Snakemake or Python pipelines suitable for Docker containerisation. Automated revival covers 80–90% of the effort; human domain expertise validates the scientific plausibility of outputs when original ground truth no longer exists (DOI: 10.48550/arXiv.2511.19510).
A parallel study found that roughly 75% of code released alongside published research papers could not run without errors (DOI: 10.48550/arXiv.2511.19510). Even when researchers shared their code and data on request, external teams successfully reproduced scientific results in only about 60% of cases - and 93% of requests for data sharing went unanswered entirely (DOI: 10.48550/arXiv.2210.02593).
Citation Fabrication: 19.9% of GPT-4o References in Biomedical Research Are Invented
The most quantified failure mode of general-purpose LLMs in biomedicine is citation fabrication. In a systematic within-domain test of GPT-4o generating literature reviews on mental health topics, 19.9% (35 of 176) of citations were entirely fabricated - no identifiable source existed in any database (PMID: 41223407). Among the remaining real citations, only 54.6% were fully accurate; specific error types included incorrect DOIs (37.8%), wrong volume numbers (30%), and erroneous issue numbers (27.9%).
The fabrication rate was not uniform. Less-studied disorders like body dysmorphic disorder and binge eating disorder showed fabrication rates of 29% and 28% respectively, compared to 6% for the well-documented major depressive disorder. Specialised prompts amplified the problem: for binge eating disorder, specialised reviews had a 46% fabrication rate versus 17% for general overviews (PMID: 41223407).
DOI integrity failure in fabricated citations
Among fabricated GPT-4o citations that included a DOI, 64% were valid DOIs that resolved to irrelevant papers - meaning the model attached real identifiers to invented claims. The remaining 36% were completely non-functional DOIs. This is worse than random noise: a valid DOI pointing to the wrong paper creates a false trail of evidence that can survive cursory verification (PMID: 41223407).
Probabilistic Inconsistency: Why LLMs Fail Deterministic Biomedical Tasks
Reproducibility requires determinism: the same input must produce the same output. LLMs are probabilistic by design - identical prompts can yield different responses across runs unless parameters like temperature are strictly controlled (PMID: 38722813). In regulated environments, this is disqualifying.
A direct evaluation of GPT-3.5 and GPT-4 for medical named entity recognition (NER) found that even with fixed seeds and controlled parameters, both models failed to demonstrate reproducible results. The study concluded that the lack of reproducibility, combined with the opacity of externally hosted systems, undermines the use of proprietary models in GxP-validated workflows (PMID: 39661234).
| Failure dimension | General-purpose LLM | Citation-grounded system (e.g. BioSkepsis) |
|---|---|---|
| Citation fabrication rate | ~20% (GPT-4o; PMID: 41223407) | 0% by design - retrieves real PMIDs only |
| Output determinism | Non-reproducible across runs (PMID: 39661234) | Anchored to retrieved literature; verifiable per run |
| DOI accuracy | 45.4% error rate among real citations | DOIs verified against PubMed at retrieval |
| Reasoning faithfulness | Gemma-7b faithfulness <0.1 (DOI: 10.48550/arXiv.2410.14399) | Claims tied to specific PMIDs; auditable |
| Harmful clinical statements | 2.4% flagged as potentially harmful (PMID: 41874150) | Expert-level verification pipeline |
Detecting LLM Confabulations in Biomedical Reasoning
LLMs do not merely make errors - they produce "confabulations," defined as claims that are wrong, arbitrary, and often scientifically plausible (PMID: 38898292). Semantic entropy methods, developed to detect these confabulations, compute uncertainty at the level of meaning rather than specific word sequences. The approach works across datasets and tasks without task-specific training data, making it applicable to novel biomedical questions.
However, the reasoning gap extends beyond text generation. The SylloBio-NLI benchmark tested LLMs on formal syllogistic reasoning with biomedical content and found that the Gemma-7b model achieved a faithfulness score below 0.1 - meaning it almost never adjusted its predictions appropriately when the truth value of a premise was altered (DOI: 10.48550/arXiv.2410.14399). Zero-shot accuracy on complex tasks like molecular cloning scenarios remained near or below random guessing (DOI: 10.48550/arXiv.2407.10362).
Multi-agent systems amplify - not reduce - hallucination risk
Multi-agent AI architectures (MAS) can improve diagnostic accuracy in oncology from 30.3% to 87.2%, but at a cost: token consumption rises 15–50× compared to standalone models, and initial hallucinations can cascade across the agent collective. In a gastrointestinal oncology evaluation, 2.4% of MAS statements were flagged as potentially harmful by human experts (PMID: 41874150).
Emerging Safeguards: From RAG Pipelines to Standardised Reporting
The research community is converging on a set of complementary safeguards. Retrieval-Augmented Generation (RAG) architectures force models to link claims to verifiable primary sources - PMIDs and DOIs - rather than relying on parametric memory. This approach underlies tools that automate the generation of BioCompute Objects (BCOs) from published papers and code repositories, reducing the overhead of retroactive compliance with documentation standards (DOI: 10.48550/arXiv.2409.15076).
A seven-step "Safe and Transparent" workflow for LLM-assisted clinical trials mandates expert sign-off at checkpoints for literature selection and statistical verification (PMID: 41111869). Proposed reporting extensions - COREQ+LLM for qualitative research (PMID: 40991937) and PRISMA-AI for systematic reviews - aim to require disclosure of model versions, temperature settings, and prompting strategies so that AI-assisted syntheses become auditable.
At the infrastructure level, the Model Context Protocol (MCP) provides a standardised semantic layer that allows LLMs to query fragmented bioinformatics web services reliably. Implementations across GEO, STRING, and UCSC Cell Browser demonstrate that MCP can operationalise FAIR principles for autonomous agents (PMID: 41729821). Meanwhile, scholars advocate for open-source LLM infrastructure to prevent proprietary API costs from creating academic "caste systems" (PMID: 38722813).
Snakemaker - non-invasive pipeline tracking
Snakemaker leverages generative AI to non-invasively track terminal activity and convert ad-hoc analysis scripts into sustainable, modular Snakemake pipelines. The tool lowers the activation energy required to move from prototype to production-quality code, addressing the 75% failure rate of published research code (DOI: 10.48550/arXiv.2409.15076).
The Research Landscape: From BioBERT to Autonomous Lab-Pilots
The scientific landscape has evolved through three distinct phases. The early phase (2019–2022) was dominated by domain-specific pre-training: BioBERT demonstrated that masked language modelling on PubMed abstracts was essential for effective named entity recognition (PMID: 31501885), and BioGPT established records in relation extraction by generating natural-language triplets rather than structured text (PMID: 36156661).
The stable phase (2022–2024) shifted to zero-shot clinical reasoning and standardised benchmarking. Med-PaLM, built on a 540-billion-parameter architecture, achieved state-of-the-art accuracy on the USMLE at 67.6% on MedQA - surpassing prior models by more than 17 percentage points (PMID: 37438534). This period also consolidated the hallucination problem: systematic reviews catalogued ethical, copyright, transparency, and legal concerns alongside the risk of bias and fabricated content (PMID: 36981544).
The emerging phase (2025–present) focuses on "agentic" systems and the scAInce paradigm. Research is entering a "co-pilot to lab-pilot" transition in which AI no longer merely interprets knowledge but increasingly generates and executes it (PMID: 40951330). The MCPmed initiative provides a standardised, machine-actionable layer for bioinformatics web services, preparing the infrastructure for next-generation research agents (PMID: 41729821).
Who Benefits from Citation-Grounded Biomedical AI
BioSkepsisSystematic reviewers and meta-analysts
Citation grounding eliminates the 19.9% fabrication rate at source. Every claim is tied to a verifiable PMID or DOI, and automated verification rounds flag misattributed references before they enter the review. General-purpose LLMs require manual cross-checking of every reference - a process that scales poorly across hundreds of citations.
BioSkepsisBioinformatics researchers and pipeline maintainers
RAG-grounded synthesis paired with standardised documentation (BCOs, Snakemake) helps researchers move from decayed workflows to reproducible, containerised pipelines. BioSkepsis anchors every analytical claim to the literature, making the documentation audit trail verifiable rather than aspirational.
BioSkepsisJournal editors and peer reviewers
Automated citation verification detects the DOI-integrity failures that general-purpose LLMs introduce - the 64% of fabricated DOIs that resolve to irrelevant papers. BioSkepsis provides an auditable evidence chain from claim to source, reducing the burden on reviewers screening AI-assisted submissions.
Frequently asked questions
What percentage of LLM-generated citations in biomedical research are fabricated?
In a systematic test of GPT-4o across mental health research topics, 19.9% (35 of 176) citations were entirely fabricated. Among those that were not fabricated, only 54.6% were fully accurate - the rest contained errors in DOIs, volume numbers, or issue numbers (PMID: 41223407).
How many legacy bioinformatics workflows still execute successfully?
Approximately 80% of Taverna bioinformatics workflows in the myExperiment repository fail to execute or reproduce original results. The primary causes are volatile third-party web services, API drift from SOAP to REST, and missing contextual documentation (DOI: 10.48550/arXiv.2511.19510).
Can LLMs produce reproducible outputs for regulated biomedical tasks?
No - not reliably. Both GPT-3.5 and GPT-4 failed to produce reproducible results in medical named entity recognition tasks, even when using fixed seeds and controlled parameters. This probabilistic inconsistency undermines their use in GxP-validated systems (PMID: 39661234).
What is the "scAInce" paradigm in life-science research?
scAInce describes a shift where scientific practice is optimised for machine interpretability and rigor. It encompasses the transition from LLMs as "co-pilots" to "lab-pilots" - systems that not only interpret knowledge but increasingly generate and execute experiments autonomously (PMID: 40951330).
How does Retrieval-Augmented Generation reduce LLM hallucination in biomedicine?
RAG architectures force models to link every claim to a verifiable primary source (PMID or DOI), rather than relying on the model's internal parameterised memory. This anchors outputs to the published literature and makes fabrication detectable through automated verification.
What reporting standards are emerging for AI-assisted biomedical research?
Extensions such as COREQ+LLM mandate the disclosure of model versions, temperature settings, and prompting strategies used in qualitative research. A seven-step Safe and Transparent workflow proposes expert sign-off at checkpoints for literature selection and statistical verification (PMID: 40991937; PMID: 41111869).
How does BioSkepsis address the citation fabrication problem differently from general-purpose LLMs?
BioSkepsis uses a citation-grounded RAG pipeline that retrieves real PMIDs and DOIs from the primary literature, then runs automated verification rounds to flag unverified or misattributed references before they reach the user. General-purpose LLMs generate citations from parametric memory, producing the ~20% fabrication rate documented in the literature.
Stop Citing Hallucinations - Ground Your Biomedical Research
BioSkepsis retrieves real PMIDs, verifies every citation, and flags misattributions before they reach your manuscript. Replace the ~20% fabrication rate with zero-fabrication, auditable evidence synthesis.
Start freeSources & further reading
- Linardon J, Jarman HK, McClure Z, et al. Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models. JMIR Ment Health. 2025;12:e80371. PMID: 41223407 · DOI
- Farquhar S, Kossen J, Kuhn L, Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625-630. PMID: 38898292 · DOI
- Bail CA. Can Generative AI improve social science? Proc Natl Acad Sci U S A. 2024;121(21):e2314021121. PMID: 38722813 · DOI
- Dietrich J, Hollstein A. Performance and Reproducibility of Large Language Models in Named Entity Recognition. Drug Saf. 2024;48(3):287-303. PMID: 39661234 · DOI
- Frutuoso J. Building a Safe and Transparent Workflow for LLM-Assisted Clinical Trials. Cureus. 2025;17(9):e92571. PMID: 41111869 · DOI
- Hartung T. AI, agentic models and lab automation for scientific discovery - the beginning of scAInce. Front Artif Intell. 2025;8:1649155. PMID: 40951330 · DOI
- Fehring L, Frings J, Rust P, et al. Extension of the Consolidated Criteria for Reporting Qualitative Research Guideline to Large Language Models (COREQ+LLM). JMIR Res Protoc. 2025;14:e78682. PMID: 40991937 · DOI
- Flotho M, Diks IF, Flotho P, et al. MCPmed: a call for Model Context Protocol-enabled bioinformatics web services. Brief Bioinform. 2026;27(1). PMID: 41729821 · DOI
- Spieser J, Balapour A, Meller J, Patra KC, Shamsaei B. A Review of Multi-Agent AI Systems for Biological and Clinical Data Analysis. Methods Protoc. 2026;9(2):33. PMID: 41874150 · DOI
- Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model. Bioinformatics. 2020;36(4):1234-1240. PMID: 31501885 · DOI
- Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23(6). PMID: 36156661 · DOI
- Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. PMID: 37438534 · DOI
- Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review. Healthcare (Basel). 2023;11(6):887. PMID: 36981544 · DOI
- CodeR3: Automated Revival of Legacy Bioinformatics Workflows. arXiv. DOI: 10.48550/arXiv.2511.19510
- RAG-Driven BioCompute Object Generation. arXiv. DOI: 10.48550/arXiv.2409.15076
- SylloBio-NLI: Evaluating Biomedical Syllogistic Reasoning. arXiv. DOI: 10.48550/arXiv.2410.14399
- LAB-Bench: Benchmarking LLMs on Practical Life-Science Tasks. arXiv. DOI: 10.48550/arXiv.2407.10362
- Reproducibility and Reusability of Scientific Artifacts. arXiv. DOI: 10.48550/arXiv.2210.02593