Reproducibility in Biomedical AI: A Three-Run Test of BioSkepsis, and How to Use It

June 17, 2026

Reviewed

Reproducibility in Biomedical AI: A Three-Run Test, and How to Use It

We asked BioSkepsis the same mechanistic question three times: what is the evidence that SGLT2 inhibitors protect the heart independent of glucose lowering. Then we measured what stayed the same and what changed. The mechanisms, the foundational papers, and every verified PMID held; the ranked list order and the specific per-claim citations drifted. This is the honest result, and a practical guide to working with it.

Why reproducibility matters for biomedical literature AI

An evidence tool that gives a different answer every time is hard to trust and impossible to cite. But for a synthesis engine running over a living literature, perfect run-to-run identity is neither achievable nor, by itself, the right target. What matters is which layers are stable. A stable conclusion built on a fixed foundational base, with verifiable citations, is reproducible in the sense a scientist actually needs, even if the exact list of supporting papers shifts.

So we measured BioSkepsis at the level of layers, not a single pass-or-fail. We ran one question three times, two sessions back to back and a third several hours later, on the Pro tier, and compared the output along six dimensions. The query was held constant: what is the mechanistic evidence that SGLT2 inhibitors confer cardiovascular benefit independent of glucose lowering, with key supporting papers and PMIDs.

Results: what stayed stable and what drifted across runs

The table below summarises every dimension we measured. A general-purpose LLM run the same way typically gives a different prose answer each time with no fixed evidence base and no way to audit which sources are real; the contrast is the point.

Reproducibility by layer across three SGLT2 cardioprotection runs
Dimension Stability What we observed
PMID correctness Stable Every checked PMID across all three runs resolved to a real PubMed record matching the claim, including 2026 papers.
Full-text download status Stable Deterministic by publisher. No paper changed downloadable/failed status between runs.
Mechanistic conclusions Stable Same five pillars every run: SGLT2-independence, NHE1 ion homeostasis, metabolic reprogramming, anti-inflammatory action, hemodynamic/sympathetic effects.
Foundational-paper core Stable Six papers appeared in the foundational panel of all three runs.
Full corpus membership Moderate Core stable; margins differed. 126 vs 113 vs 119 papers retrieved.
Per-claim citations Variable Only three PMIDs were cited in the answer text of all three runs; the rest of the supporting set differed.
Ranked list order Weak No two runs shared a top-of-list ordering.

The stable mechanistic core, every run

All three runs converged on direct cardiac NHE1 inhibition as the central glucose-independent mechanism, anchored by the same papers showing empagliflozin lowers myocardial cytoplasmic Na+ (PMID 27752710) as a class effect across empagliflozin, dapagliflozin, and canagliflozin (PMID 29197997), with benefit preserved in SGLT2-knockout mice via an NHE1-nitric oxide pathway (PMID 39046464).

The variable layer: which supporting papers get named

Run 1 emphasised CGRP-mediated vasodilation, erythropoiesis, and epicardial adipose tissue. Run 2 surfaced a STAT1-STING senescence axis (PMID 39044275) and central sympathoinhibition via the hypothalamus (PMID 41658515). Run 3 introduced late sodium current and NaV1.5, and Parkin-independent mitophagy. Same conclusions, different evidence drawn from a deep redundant literature.

The foundational-paper layer is the most stable biomedical signal

BioSkepsis computes its foundational-papers panel from how often papers are co-cited across the full retrieved corpus, not from which papers the language model happens to name in prose. Because co-citation structure is a stable property of the literature itself, this layer reproduces well at its core.

Six papers appeared in the foundational panel of every run: the two mechanistic anchors (PMID 27752710, PMID 29197997) and the four landmark outcome trials, DECLARE-TIMI 58 (PMID 30415602), DAPA-HF (PMID 31535829), EMPEROR-Reduced, and EMPA-REG OUTCOME (PMID 26378978). For a question about glucose-independent cardioprotection, that is the correct anchor set, and it held three times. What varied was the periphery: which secondary trials and mechanistic reviews rounded out the panel, tracking the slightly different corpus each run retrieved. The co-citation counts themselves shift between runs, so treat the panel as a stable membership of core papers, not an identical ranking.

Full-text availability is deterministic, not random

BioSkepsis reads the full text of open-access papers to ground its answers. A fair question is whether the same paper is read in one run but skipped in another. In this test it was not. Across papers that appeared in more than one run, none changed status. Open-access papers downloaded every time; paywalled papers failed every time with the same reason; open-access papers whose publishers block automated retrieval stayed blocked every time.

One caveat for honest reading of the numbers: the "failed" count mixes three different causes that should not be conflated. Genuine paywall is true unavailability. "Open access but the publisher blocked automatic retrieval" means the content exists but the crawler was refused. And "too many papers for your plan" is a tier quota, not a retrieval failure at all. A headline failure count overstates true unavailability if these are lumped together.

Honest pros and cons of this behaviour

This design has real strengths and real limits. Both belong in the open.

  • Conclusions are stable: you get the same mechanistic answer to the same question, run to run.
  • Citations are real and auditable: every checked PMID resolved correctly, with sentence-level grounding to the source passage.
  • The verification layer is reproducible: across runs it consistently excluded contradictory and overreaching papers, including one paper that argues empagliflozin does not inhibit NHE1, correctly kept out of support.
  • Full-text availability is deterministic, so a paper's evidence either contributes consistently or not at all.
  • The foundational core reproduces, giving a fixed anchor set for a field.
  • The specific per-claim citation list varies between runs; two runs can produce two different reference lists for the same conclusion.
  • The ranked order of retrieved results is not reproducible, so list position should not be read as a fixed importance score.
  • The full corpus is curated, not exhaustive; this complements, and does not replace, a comprehensive database search for a formal systematic review.
  • Some open-access papers cannot be auto-retrieved because publishers block crawling, which narrows what full text can be read in a given run.

A guide to reproducible biomedical evidence retrieval, run to run

The variability above is manageable once you know which layer to trust. These practices apply to BioSkepsis specifically, and most apply to any AI literature tool.

Step 1Anchor on the foundational panel, not the list order

Treat the foundational-papers panel and the verified citations as your stable signal. Do not treat the ranked order of the retrieved list as an importance score; it shifts between runs and is the weakest layer.

Step 2Run the query two or three times and take the union

Because a redundant literature supports the same conclusion through different papers, a second or third run surfaces additional valid supporting evidence. Take the union of cited PMIDs across runs for the most complete, defensible reference set, and note which papers recur in all runs as your highest-confidence core.

Step 3Read the unverified-citations panel

BioSkepsis lists papers it excluded and why. This is where the engine shows its working: a paper removed for contradicting the claim, or for not reporting the specific data asserted. Reading it tells you how conservative the answer is and surfaces genuine counter-evidence worth checking.

Step 4Export the corpus to fix your reference set

Once you have a run you trust, export the citations (RIS, BibTeX, CSV, or Zotero sync). Exporting freezes the evidence base for downstream writing, so run-to-run variability no longer affects the manuscript you build on it.

Step 5Verify the load-bearing PMIDs yourself

Every PMID we checked resolved correctly, but good practice is to open the two or three citations a claim actually rests on and confirm them in PubMed before you publish. The sentence-level links make this fast, and it is the habit that separates defensible synthesis from blind trust.

Frequently asked questions

Does BioSkepsis return identical output every time I run the same query?

No. The mechanistic conclusions, the foundational-paper core, and the citation-verification behaviour are stable across runs. The full retrieved corpus, the ranked order of results, and the specific papers cited for each claim vary, because the targeted sub-queries are regenerated each run and the literature is large and redundant.

Were any of the cited PMIDs fabricated?

No. In an independent check of a sample spanning all three runs, every PMID resolved to a real PubMed record whose title, authors, and findings matched the claim it was attached to, including very recent 2026 articles. Verification was confirmed against PubMed.

Why does the list of papers change between runs if the answer is the same?

Because a mature evidence base is redundant: many papers support the same mechanism. Each run regenerates its targeted sub-queries, so it surfaces a different valid subset of supporting evidence for the same conclusion rather than a different conclusion.

Is full-text availability reproducible across runs?

Yes. Downloadability is deterministic by publisher. An open-access paper downloads every run; a paywalled paper fails every run; a crawler-blocked open-access paper stays blocked. No paper flipped status between runs in this test.

How do I get the most reproducible answer from BioSkepsis?

Anchor on the foundational-papers panel and the verified citations rather than transient list order, run the query two or three times and take the union of cited PMIDs, read the unverified-citations panel to see what was excluded and why, and export the corpus so your reference set is fixed for downstream work.

What is the single most stable signal?

Two are tied: PMID correctness and full-text availability. Both are deterministic. The mechanistic conclusions and the six-paper foundational core are close behind.

Run the SGLT2 test yourself on BioSkepsis

Open the three runs above, then ask the same question on your own account and compare the foundational core, the verified PMIDs, and the excluded citations. The free tier is enough to reproduce this test.

Start free

Sources & further reading

  1. Baartscheer A, et al. Empagliflozin decreases myocardial cytoplasmic Na+ through inhibition of the cardiac Na+/H+ exchanger in rats and rabbits. Diabetologia, 2016. PMID: 27752710.
  2. Uthman L, et al. Class effects of SGLT2 inhibitors in mouse cardiomyocytes and hearts: inhibition of Na+/H+ exchanger, lowering of cytosolic Na+ and vasodilation. Diabetologia, 2017. PMID: 29197997.
  3. Chen S, et al. Empagliflozin prevents heart failure through inhibition of the NHE1-NO pathway, independent of SGLT2. Basic Research in Cardiology, 2024. PMID: 39046464.
  4. Wiviott SD, et al. Dapagliflozin and Cardiovascular Outcomes in Type 2 Diabetes (DECLARE-TIMI 58). NEJM, 2019. PMID: 30415602.
  5. McMurray JJV, et al. Dapagliflozin in Patients with Heart Failure and Reduced Ejection Fraction (DAPA-HF). NEJM, 2019. PMID: 31535829.
  6. Zinman B, et al. Empagliflozin, Cardiovascular Outcomes, and Mortality in Type 2 Diabetes (EMPA-REG OUTCOME). NEJM, 2015. PMID: 26378978.
  7. Mourad O, et al. Single cell transcriptomic analysis of SGLT2 expression supports an indirect or off-target role for the cardioprotective benefits of empagliflozin in heart failure. Scientific Reports, 2025. PMID: 40065073.
  8. Voorrips SN, et al. Myocardial ketone body oxidation contributes to empagliflozin-induced improvements in cardiac contractility in murine heart failure. European Journal of Heart Failure, 2025. PMID: 40069113.
  9. Run 1 research thread: app.bioskepsis.ai/research/reproducibility-test1-mechanistic-evidence-that-sglt2/AEpnDzra4v08XoxQSQoFig
  10. Run 2 research thread: app.bioskepsis.ai/research/reproducibility-test2-mechanistic-evidence-that-sglt2/hubbedUzDZA-DqlWxVRIRA
  11. Run 3 research thread: app.bioskepsis.ai/research/reproducibility-test3-mechanistic-evidence-that-sglt2/4okGfQJXe387tCLUdvQX4w