All articles

AI Parallelized Biomedical Research, But The Verification Layer Needs To Match The Throughput

Weill Cornell's Olivier Elemento deploys sub-agents to audit every AI-generated finding. He told The Read Replica why layered verification is the only safeguard against scaling flawed science.

Credit: The Read Replica

I prompt it to verify every reference and it will spin off five sub-agents. Each of them does a web search for a particular reference, pulls it from PubMed and brings it back. If you don't do that, I can guarantee that hallucinations will occur. The verification step is crucial.

Olivier Elemento

Director, Englander Institute for Precision Medicine

Weill Cornell Medicine

Science's replication track record was shaky long before AI entered the picture. A 2016 Nature survey found that more than 70% of researchers had tried and failed to reproduce another scientist's experiments, and the Open Science Collaboration's landmark 2015 study replicated fewer than 40 out of 100 published psychology experiments. The causes have always been structural: publication bias toward positive results, institutional pressure to publish volume, and selective reporting. AI removed the friction that kept the gap manageable.

Olivier Elemento is the Director of Weill Cornell Medicine's Englander Institute for Precision Medicine, where his research group combines big data, AI, and genomic profiling to accelerate the discovery of cancer treatments. With more than 450 published papers and the development of New York State's first approved whole-exome sequencing test for oncology behind him, he's now using AI to push through his hardest problems faster than at any point in his career. He's also watching the verification gap widen with every new capability. "These tools make it look like everything is always working," Elemento told The Read Replica. "And the mistakes propagate themselves."

When Elemento talks about what AI is doing to research, he's talking about work where computational output has clinical consequences and where a flawed conclusion doesn't just waste grant money, but can misdirect treatment decisions for real patients.

With speed comes surface area

"I've rarely been able to do as much research as I've been able to do in the past few months in my entire scientific life," Elemento said. "I'm throwing the hardest problems at it, and it seems like there's nothing it can't do. It's a revolution for dry lab research." Where research once meant training a student, waiting for results, and iterating over months on a single thread, Elemento now fires off analyses across multiple directions and pivots based on what comes back in minutes. He doesn't log into a cluster anymore. The AI logs in, runs the analysis, and returns results. If the analysis needs external data, the AI pulls it, merges it, and processes it without Elemento touching a terminal. "Instead of testing one hypothesis at a time, you can test many," he said. "I think of research as a tree of exploration. It used to be that you could follow just a few branches, but now you can broadly investigate multiple directions at the same time."

But the parallel exploration creates a verification surface that scales with every new branch. Each simultaneous investigation generates its own conclusions, its own intermediate results, its own citations. A researcher checking five branches at once has far less time per branch than one who spent three months on a single thread. Multiply that across a lab, and the verification debt compounds fast.

The pattern maps onto a problem enterprise teams already know. Security operations centers have dealt with a version of this for years: SIEM platforms that ingest everything, flag everything, and leave analysts drowning in alerts they can't meaningfully triage. The volume itself isn't the problem. It's the lack of structured validation at the point of generation. In science, the equivalent is a result that looks plausible, gets included in a paper, and propagates through the literature because nobody verified it before it shipped. A Nature analysis from earlier this year found that tens of thousands of 2025 publications may contain invalid references generated by AI, and a separate review of NeurIPS 2025, one of the world's top machine learning conferences, found over 100 hallucinated citations across 51 accepted papers that had survived full peer review. "I'm concerned that we're going to see investigations that are flawed because of a lack of this verification step."

Elemento's response to the verification gap is architectural. When he uses AI to draft papers, hallucinated citations are a known failure mode, so rather than catch them manually, he builds agent-based verification directly into the workflow. "I prompt it to verify every reference and it will spin off five sub-agents," Elemento said. "Each of them does a web search for a particular reference, pulls it from PubMed and brings it back. And if it doesn't match what the initial draft says, it corrects it. If you don't do that, I can guarantee that hallucinations will occur. The verification step is crucial."

The principle underneath maps to a pattern showing up across industries. At Snap, engineering teams run six agents on every pull request, each with a scoped responsibility, from security scanning to documentation review. No single agent has final authority. The code ships only after the full pipeline validates it. Elemento's workflow follows the same logic: the drafting agent writes, the verification agents audit, and the human reviews the audited output rather than the raw output.

Verification has to be automated with the same intensity as the research itself. A researcher exploring a hundred branches simultaneously needs a verification architecture that can keep pace. Human review alone doesn't scale, for the same reason manual alert triage doesn't scale in a SOC processing tens of thousands of events per day.

Certainty as a default setting

Even with layered agent verification, AI tools present their output with a confidence that doesn't correlate with accuracy. "These AI models tend to be very optimistic about their own work," Elemento said. "They can make you feel like a result is very high confidence and reliable while in fact it's not. That's a psychological adjustment."

The scientist's job shifts accordingly. Instead of spending weeks deep in a single dataset, the researcher now switches context across dozens of concurrent investigations, scanning each one with enough skepticism to catch what the verification agents missed. Elemento draws a parallel to checking a trainee's work: a senior scientist develops intuition over years for what results should look like, when confidence intervals are missing, when a figure doesn't match the underlying data. AI compresses that dynamic. The trainee used to come back after a week; the AI comes back after a minute. The verification instinct needs to be the same, applied at a cadence that didn't exist before.

There's a reason for optimism underneath the challenge. The bias that drives most replication failures, whether it's attachment to a hypothesis or selective reporting of results, runs into a structural obstacle when every interaction with an AI tool is recorded. "It maintains a log of everything that you've done. Every prompt, every answer, every result. If the results don't match up with what you want, you can't remove or ignore it. It's there in the record."

If semi-autonomous AI research systems become widespread, they could build something that has never existed in science: large-scale databases of negative results. Journals don't publish failed experiments. There's no centralized repository of what's been tried and didn't work. AI systems that log everything by default could change that, not through a policy mandate but as a byproduct of how the tools work.

From papers to provenance

The current model was designed for a world where analysis was the hard part. Generating data was expensive; interpreting it was where the intellectual contribution lived. AI inverts that. "The real value is shifting toward the data itself," said Elemento. "Once you have data, using data science and AI to analyze it has become close to free. What's going to make scientific groups unique is the data they can produce."

He envisions a future where the unit of scientific exchange is a computational model rather than a static paper. Researchers would publish data alongside machine-readable interpretations, structured so that other AI tools can access, validate, and build on the work. The paper becomes less central. The model, the data, and the full provenance chain become what matters. "It could become an exchange of models as opposed to an exchange of publications."

That vision has infrastructure requirements the research community hasn't built for. If data and models become the unit of exchange, the layer that stores, versions, and governs access to that data becomes the critical infrastructure for scientific credibility. Provenance tracking has to be native. Access controls have to be granular enough to govern who and what can query the data, especially as AI agents do more of the querying autonomously. Versioning has to capture the relationship between a model and the specific data it was trained on. Audit logs have to be immutable, because the entire value proposition depends on the community trusting the record.

These are database-layer requirements. Engineering teams across industries are converging on the same ones as AI agents generate workloads that outpace human oversight: correctness guarantees, lineage tracking, scoped access, governance enforced where the data lives. The research world is arriving at the same conclusion through a completely different path.

The field that couldn't reliably reproduce results at human speed now has the tools to make every result reproducible by default. Whether that happens depends on whether the verification architecture, from agent-checked references all the way down to the database that stores the underlying data, gets built with the same urgency as the tools generating the results.

From Banking to Blackboard: An Engineer's Case For CI/CD-Level Governance Of AI Code

Why The Success Of AI In Regulated Industries Depends On Compliance-As-Architecture