Jonathan Jacobs

Reproducible Does Not Mean Correct

Jonathan Jacobs — Fri, 12 Jun 2026 01:41:02 GMT

First off - thanks to everyone who read and restacked my first article on SubStack last week (Metadata as Infrastructure). Over a decade ago I used to blog quite often, and it’s honestly pretty refreshing to be getting back into it. Thanks for your support and subs. 🙏.

Science has spent the last several years talking a lot about reproducibility. Fair enough. It matters. If someone publishes a result and nobody else can reproduce it, that’s obviously a problem. Then there’s a whole separate issue where someone tries and fails to reproduce the results of a published work, but they themselves fail to report it.

But in genomics (and “omics” more broadly), I think there’s a slightly more uncomfortable problem sitting right in front of us: you can perfectly reproduce the same results as a previously published work but still be wrong.

I think this distinction matters more than most people realize, and it also has huge implications for the future of the AIxBio space.

To clarify a bit here - a lot of the discussion around reproducibility tends to flatten a few concepts that are not actually the same thing. Let me count the ways…

Reproducibility means you can rerun the same analysis on the same data and get the same answer. Win! Everyone is happy right? Not necessarily.
Replicability means an independent effort more or less lands in the same place. Like, another labs orders the same materials and tries to do the same experiment and (more or less) sees the same thing or gets the same results. Most scientists would also count this as a Win!
Correctness, however, is a different question entirely. Correctness is like asking: is the underlying biological claim actually true?

My recent pre-print1 on genomics data provenance and accuracy goes into this in a more formal way but essentially, I’m saying that in genomics, reproducibility is necessary, but it is absolutely not sufficient to assert the correctness of any particular biological finding. I’ve started thinking about this as sort of a “reproducible wrongness”.

That’s when a workflow is technically stable, computationally recoverable, maybe even elegant but is still wrong. It could be wrong for a lot of reasons too: input data was borked, reference data was contaminated, the underlying assumptions of the tool was just wrong, etc. (call me old fashioned, but I personally love elegant computational biology algorithms that are now useless - have you heard of Nussinov’s 1978 paper on RNA folding? It’s gorgeous, but just wrong2). Essentially, when this happens you can get the same answer over and over again not because you found truth, but because you built a system that is consistently engineered to produce the wrong result. +1 Win for Reproducibility! but also a +1 Fail for Science.

Surprisingly this is not a hypothetical edge case either. It happens all the time. A really good recent example is the cancer microbiome mess that many of you may have seen in Substack feeds or socials recently. If you missed it here’s the TLDR follows:

A 2020 paper in Nature reported that many cancer types had distinct microbial signatures and that machine-learning models could classify cancers with extremely high accuracy using those signals3. It was a big deal. The paper got a lot of attention, it was cited widely, other groups reused the data (🤦‍♂️), and it even helped justify commercial enthusiasm around microbiome-based cancer diagnostics. On the surface, this looked like exactly the kind of “AI finds hidden biology!” story people (and investors) LOVE to hear.

Enter Stephen Salzberg. (I can almost hear the boxing ring bell…)

Now, if you’re not familiar with Salzberg’s work - you should be. He’s not only one of my heroes of bioinformatics, but he’s also made it kind of a hobby to publish papers that throw shade on the work of other scientists for being - well… just sloppy. To name a few, there was a 2014 paper showing tons of microbial “reference” genomes had pieces of cow and sheep genomes in them (among others)4. He must have been like “Oh wow, this is kind of fun… people hate this shit…” because later came a 2018 paper showing how microbial reference genome databases were still massively contaminated with eukaryotic genome sequences5. Then this 2020 paper that showed over 2 MILLION entries in GenBank were contaminated6. There are others, but I’ll stop there.

Regarding the Microbiome = Cancer Diagnostic BS paper above… Salzberg spared no words once he took a look. In mBio, Gihawi, Salzberg, and colleagues went back through the data in Rob Knight’s paper and argued that the original result was built on two major problems7. First, a lot of the reads being called “microbial” were apparently human reads that slipped through filtering and then matched contaminated (AGAIN! didn’t they fix this? Sheesh) draft microbial genomes in the reference database. Second, the normalization process introduced artificial cancer-type-specific patterns that ML models could then exploited. In other words, the classifier did not fail to produce a result, but instead it produced a very strong result from the wrong signal. Go follow Ref 7 above and read the full paper… it’s glorious TBH. It’s also almost a perfect illustration of what I mean here with “reproducible wrongness”

The issue here was that the workflow was stable enough to generate a compelling answer from contaminated inputs and computational artifacts. And because the answer looked clean and high-performing, other people built on it.

Think about this for minute…

17 authors
multiple peer reviewers reviewing for Nature
at least one Editor

all got it wrong. It wasn’t until a separate set of scientists tried to reproduce the results and took a careful look at HOW those results were produced was it later revealed there was a reproducible, but incorrect, result.

And again this is just one example, but unfortunately there are many similar stories. I won’t burden this post with all of them (i maintain a list TBH…), but the HeLa cell story is another one your should be aware of - potentially 20,000+ peer reviewed papers were invalidated by the discovery of widespread contamination of HeLa cells being passed around between labs (instead of being sourced from a verified culture collection…)8

Here’s the part I think the genomics community still struggles to say out loud:

Getting the same answer twice is not the same as getting the biology right.

So, when we’re all high on the power of AIxBio models building incredible models on (in some cases “trillions” of genes), can we also please just pump the brakes a bit and ask: but does this actually make sense? Does the result fit the biology? Where did the training data come from? How was it produced? Is this data what we think it is?9 How can we test this using an orthogonal, independent method to really determine if we have both a reproducible AND a correct result?

Sometimes the sample is wrong. It was mislabeled. It was contaminated. It was a derivative passed around five labs instead of the original material. It was taxonomically outdated! (Hello Scott - are you reading this?) It was missing enough provenance that nobody can say with confidence what it actually was.

Sometimes the reference layer is wrong. Public archives are full of indispensable data, but don’t be lulled into the comfort zone and start thinking they are some kind of magical truth engine. They are not. If a reference genome is contaminated, misclassified, or obsolete, then every downstream alignment, taxonomic classifier, or benchmark using it can become wrong in a highly reproducible way. Please repeat that outloud 10 times.

And sometimes the pipeline is borked in a stable but unrealized way. Bad normalization. Batch effects. Data leakage (Leaky Pipes! John are you reading this?). thresholding choices. preprocessing steps that quietly attach labels or distort the signal. Machine learning is especially good at exploiting whatever stable pattern is available, whether that pattern is biology or nonsense. The model does not care. If the artifact separates the classes, it will happily learn the artifact. Win! (no… that is NOT a win).

At a smaller scale, reproducible wrongness wastes time, burns money10, and sends people chasing false leads. In AIxBio, it can become infrastructure. If models are trained on datasets with provenance gaps, contaminated references, mislabeled samples, or hidden technical biases, then the models may look performant while encoding failure modes inherited from the corpus itself. A big enough training set does not save you from this. In some cases it just industrializes the problem. That is part of why so many serious groups are already filtering, curating, and rebuilding training corpora rather than treating all public genomics data as interchangeable raw material.

This is also why I keep coming back to the same point from my last post: public sequence archives are essential, but they are mostly archives, not universally validated reference resources. An accession number means something was submitted, formatted well enough to get through, and stored. It does not mean the biological material was authentic, the metadata were complete, the taxonomy is current, the workflow was sound, or the downstream claim is correct.

Thank you for reading. Please leave a comment with your thoughts and if you like what I’m doing here - please subscribe to my substack!

Subscribe now

(… and for the record - I have great respect for the work of Rob Knight and his group. He’s truly one of the pioneers of the microbiome / metagenomics field. It’s just unfortunate he got this one so wrong)

Jacobs, J. (2026) “The Case for Data Provenance and Authenticity in Genomics”. Zenodo. doi:10.5281/zenodo.20565193.

Nussinov, Ruth, George Pieczenik, Jerrold R. Griggs, and Daniel J. Kleitman. “Algorithms for Loop Matchings.” SIAM Journal on Applied Mathematics 35, no. 1 (1978): 68–82. https://doi.org/10.1137/0135006.

Poore, Gregory D., Evguenia Kopylova, Qiyun Zhu, et al. “RETRACTED ARTICLE: Microbiome Analyses of Blood and Tissues Suggest Cancerdiagnostic Approach.” Nature 579, no. 7800 (2020): 567–74. https://doi.org/10.1038/s41586-020-2095-1.

Merchant, Samier, Derrick E. Wood, and Steven L. Salzberg. “Unexpected Cross-Species Contamination in Genome Sequencing Projects.” PeerJ 2 (November 2014): e675. https://doi.org/10.7717/peerj.675.

Lu, Jennifer, and Steven L. Salzberg. “Removing Contaminants from Databases of Draft Genomes.” PLOS Computational Biology 14, no. 6 (2018): e1006277. https://doi.org/10.1371/journal.pcbi.1006277.

Steinegger, Martin, and Steven L. Salzberg. “Terminating Contamination: Large-Scale Search Identifies More than 2,000,000 Contaminated Entries in GenBank.” Genome Biology 21, no. 1 (2020): 115. https://doi.org/10.1186/s13059-020-02023-1.

Gihawi, Abraham, Yuchen Ge, Jennifer Lu, et al. “Major Data Analysis Errors Invalidate Cancer Microbiome Findings.” mBio 14, no. 5 (2023): e01607-23. https://doi.org/10.1128/mbio.01607-23.

Horbach, Serge P. J. M., and Willem Halffman. “The Ghosts of HeLa: How Cell Line Misidentification Contaminates the Scientific Literature.” PLOS ONE 12, no. 10 (2017): e0186281. https://doi.org/10.1371/journal.pone.0186281.

Earlier in my career, I used to lead a pathogen genomics and microbial forensics group. I had a plaque on the wall of my office one of my team members had done that said “what’s in your tube is not in your database”. This was a daily reminder that whatever is in your DB is almost always wrong to begin with.

Freedman, Leonard P., Iain M. Cockburn, and Timothy S. Simcoe. “The Economics of Reproducibility in Preclinical Research.” PLoS Biology 13, no. 6 (2015): e1002165. https://doi.org/10.1371/journal.pbio.1002165.

Metadata as Infrastructure

Jonathan Jacobs — Sat, 06 Jun 2026 11:31:12 GMT

This is my first Substack post. So be nice 😊 and naturally I’m starting with the fun stuff: Metadata.

Please try to contain yourself.

I recently wrote a review paper on data provenance and authenticity in genomics. It’s currently a pre-print over on Zenodo, but I’m hoping to see it in a peer reviewed journal soon (I’ll update here when it ships). It’s also the product of an ongoing labor of love of mine over the last year or so: ~110+ papers read and reviewed, two half-baked manuscripts written and abandoned along the way, and then finally over spring break this year I buckled down and just hammered it out. One of those rare victories against writer’s block on topics that are pretty central to my main areas of research: genomics data quality, data provenance, authenticity, and ultimately trust. Of course, the academic version linked above has all the citations, caveats, examples, and the usual careful phrasing we like to use when we want to say “this is kind of a mess” without writing “this is kind of a mess.”

This Substack is sort of the snarky TLDR version. (although it’s sort of long too… 😐)

My main point is actually pretty simple: genomics has built an enormous global data commons, but we often treat (public) sequence data as more trustworthy than it actually is.

To be clear - I’m not dunking on GenBank, SRA, ENA, DDBJ, or any of the other public databases that basically hold modern genomics together with duct tape, Perl scripts (yes, Perl scripts), and frankly heroic amounts of institutional memory. These resources are essential. They are one of the great scientific infrastructure wins of the last 40 years. But they are mostly archives, not verified sources of truth. In fact, INSDC’s own policies specifically call out that the accuracy of the data is not the responsibility of the database admins - but of the submitters of the data. This small, but important, point is overlooked by the most people IMHO.

Put another way - any sequence having an accession number does not mean an organism from which it was obtained was correctly identified, nor that the sample came from a verified source, or that the lab methods were captured well enough to reproduce the data. There’s no certainty that the sequencing methods or bioinformatics tools used to produce or assembly the sequence are reproducible. Let alone if the metadata is complete enough for someone else to make sense of it later.

Having an accession number just means someone submitted a string of [A, T, G, C] characters, and provided the minimum metadata needed to get it through the gates, and nothing about the sequence looked “strange” (e.g., contaminated, etc.).

Obviously, it’s still useful. But at the end of the day this is not the same thing as saying the data is trustworthy or holds the right amount of provenance.

For over two decades we’ve been piling up this data (over an exabyte now?) with this noble vision that it will be preserved and reused for the benefit of all. And now, we’re on the doorstep of this AIxBio revolution, feeding this data into training models, and… surprised when these models efficiently learn the wrong things about our data because the labels (e.g., the metadata tied to these sequences) are borked in one way or another.

Michael Koeris recently wrote that “data is the moat,” but that the moat has to exist at the national capability level, not just as another VC-backed drug-discovery wedge. I totally agree with the main points in his work. But I’d add: a moat that is protecting noisy, poorly described biological data is not what we want, but it’s the moat we have at the moment. He proposes building “data factories” to create high value datasets, properly labeled and structured, and designed a priori for model building (which is different than data stockpiled after a slew of publications). This is a fantastic idea, and one that some groups (like mine at ATCC tbh) fortunately have already started doing. Focus on the data provenance, the process, maintaining quality standards, comparability, document everything, operate like a production factory. I think the idea is starting to take off too-as some much larger groups are beginning to do this behind very high VC / biopharma backed paywalls as well. Why? Because there’s a realization taking place across the AIxBio space that genomics data in the public domain is so “sparse and clumpy” (something Niall Lennon said during his talk at SFA2F in May) that it may just be better to systematically recreate it, than waste time trying to use it for training.

Bauer LeSavage made a related point in a recent Dimension Research post on training data for bio-AI models: the LLM playbook of just scaling up data does not port cleanly into biology. (it’s a fantastic Substack post btw) Biology needs quality, context, diversity, and actual ground truth. “Scalemaxxing” your way through messy biology is a great way to build a very confident model that has memorized everyone’s artifacts.

This is where provenance becomes more than paperwork. Provenance is the evidence chain. What do I mean by “provenance”? It’s basically all the answers to the questions like:

where did the biological material come from?
Was it the original strain, a lab derivative, a contaminated culture, a mislabeled sample, or something passed around between 10 labs before it was sequenced?
What instrument produced the data?
What basecaller did they use?
What assembler was used to make that genome? and with what parameters?
What reference database did they use for contamination screen? what version of that database? What QC threshold did they use for IS contaminated vs. ISNOT contaminated?
Was the physical material preserved anywhere, or did it disappear into the great freezer of vibes? (ok - i know this isn’t possible for everything, esp. some clinical, microbiome, environmental, etc. samples… but it’s an omnipresent question for my group: where can I get the original material that was used to make this genome? and very often the answer comes up: Not Possible. 😔)

These questions are not administrative trivia. They determine whether a dataset can be reproduced, replicated (not the same thing as reproduced btw…), reused, compared, and ultimately: are these results trustworthy?

One of the needs in biotech as making implicit assumptions explicit - something Dr. Shelby recently specifically called out as well. Perhaps relevant to his post is her
”Fork 2” in her post on whether Frontier labs will build vs. buy vs. acquihire (lol) proprietary data. It also sort of loops in Lennon’s comment (above). The AIxBio space is grappling with a very big assumption right now: is the public genomics layer “good enough” to serve as the substrate for the next era of AI biology? Maybe it is in some places. In other places, maybe not. That assumption needs to be dragged into the light and poked with a stick IMHO. We need to curate and close metadata and provenance gaps where we can and perhaps, as Lennon and Koeris have proposed - just recreate the foundational data “correctly” to set things right needed.

For most of my career, at nearly every conference I’ve attended, colleagues complain about the quality and issues associated with public data, especially as it relates to ingesting, re-curating, “data wrangling”, etc. and the time and inherently error prone aspects of it. It’s not an impossible problem, it’s just labor intensive to fix and it doesn’t scale well. This second point becomes a real serious problem for AI models that need HUGE datasets, except in some cases (like using the incredibly well maintained and curated data in the PDB to train AlphaFold).

I’m almost done - but there’s two more points I want to make.

Metadata is (admittedly) sort of boring.

(ok to some people)

it’s often the stuff that is perceived as having minimal value on the individual sample level. On the bit level so to speak. But at scale… it becomes hugely important. So, in a way, it’s a matter of perception.

Interestingly, when Henry Lee writes about how execution is now the bottleneck in science - he’s actually touching on this same issue. Ideas are cheap. Lab execution is expensive, slow, and full of friction. I’d argue provenance (or lack of it TBH) lives exactly in that friction layer. It is the annoying operational substrate that lets you know what happened, why it happened, and whether anyone should believe it happened. It’s also the part where, when done manually, one-sample-at-a-time, it becomes hugely tedious. Suddenly some aspects of the experiment aren’t so necessary to keep track of (labels get lost), or we don’t pay attention to the details when we should (labels are wrong). Automation, even on small scales, can have a huge positive impact on improving metadata capture and provenance quality across the entire life cycle from materials to data to results. So too, dare I say, can sticking to established standards for documentation (e.g., ISO 9000) even when you’re doing basic research. How many freshly minted graduate students have I hired that had previous experience working under an ISO quality assurance framework? Zero. (which alone says quite a bit about the priority for trainees around quality, standards, and reproducibility)

OK - my last point: Culture collections and biorepositories still and should matter, for quite a while still.

Yes, I am biased here (I work for ATCC). But I am also right.

One of the things fun things I dug up for my background research in the review paper (here’s the link again) was the original 1978 proposal by William Goade to create the Los Alamos DNA Data Bank. This was the predecessor to GenBank (it was renamed in 1984). It was super interesting to read - but what stood out to me was Goade’s assertion as to the importance and value of maintaining wherever possible physical reference materials from which the DNA sequence data had originally been derived. He writes about the importance of these “voucher specimens” as being essential for long term reproducibility studies and a requirement for any sequence to get a “verified” status (everything else in the data bank was labeled “unverified”). Now - interestingly… in 1984 when GenBank was created, because they were being overwhelmed with so much data even then, they dropped the requirement for “voucher specimens” and the “verified” / “unverified” status labels for all sequences were dropped. That was the point where the DNA Data Bank became a DNA Data Archive. This important change I think was lost on many researchers at the time, and remains an obscure but important point today.

I’m of the strong opinion that culture collection and biorepositories should systematically resequence everything in their collections to establish ground-truth for the physical materials held within them, curate the current and historical metadata records associated with those materials, and identify (and potentially remove) sequences in the public domain that are able to be “verified”. There should be a large scale, national effort to convert these resources into the “Data Factories” that Michael Koeris and others have called for, producing authenticated high-quality ground truth data for everything held within their freezers and LN tanks.

Fortunately, on a relatively small scale at least, self-funded projects like the ATCC Genome Portal and the NCTC3000 project are examples where these repositories are building these ground truth datasets. “Authenticated genomes” are being created where physical materials are matched with genome references and end to end provenance is maintained throughout. But, honestly, these are at the scale needed to truly accelerate AIxBio model development despite having access to massive diversity of materials (primarily due to funding limitations).

The AIxBio community needs more of these authenticated reference “omics” datasets to seed model training. These are datasets where the metadata and provenance linkages from materials to data to results are maintained in a structured and standardized way - far better than what has been done historically.

If AI-enabled biology is going to be trained on the public genomics commons, then it needs more than scale. It needs provenance. It needs curation. It needs versioning. It needs better metadata captured at the point of submission, not recovered five years later by someone digging around in supplemental tables from the original publication (I’ve been there…).

The boring metadata about the sample, how it was handled and processed, and about the data and how it was created, and (importantly) how it is structured, is now a critical part for leap frogging AIxBio model development. not just “more data”.

Metadata is now infrastructure. Let’s just keep that front and center.