GenAI is Instagram science. Are we training machines on academic fiction?

A shaky confidence

GenAI like chatGPT, Claude etc. is Instagram science at scale—polished, confident, and built on shaky foundations. We’ve created systems that synthesize scientific knowledge with the same superficial authority that influencers use to sell wellness products, except these systems are now guiding medical treatments and engineering decisions, and used to educate future researchers world-wide.

The problem isn’t the LLMs, but the foundation on which they are trained and the system of scientific publication. Publication biases and statistical shortcuts that created the reproducibility crisis are now being encoded into artificial intelligence at unprecedented scale.

Replication vs recital

The scientific literature feeding our AI systems has well-documented reliability problems:

And the list of publications showcasing the replication crisis in science goes on. This isn’t a case of widespread scientific misconduct, more a result of misaligned incentives and an academic / research system that promotes spectacular findings with grants, career advancement and fame. The problem is that this very replication crisis is now being amplified by AI system without any check or balances.

P-hacking as statistical optimism

Because of very real constraints (limited budgets, recruitment bottlenecks, small animal cohorts) most studies are underpowered. And underpowered studies create deep uncertainty about what “statistical significance” really means. The lower the power, the higher the false discovery rate. In practice, that means underpowered studies are far more likely to label random noise as a meaningful signal.

But the academic system doesn’t reward caution. Journals almost never publish “we found nothing,” and careers don’t advance on null results. So researchers are incentivized to probe their data until something crosses the p<0.05 threshold—whether by testing subgroups, swapping statistical models, or slicing the variables a different way. Clinical trials try to curb this by requiring pre-registered protocols, but outside that domain flexibility reigns.

The result is a published record swollen with false positives. The effects that survive peer review are systematically biased toward the largest, most dramatic, and therefore least reliable estimates. And statistical significance itself says nothing about the size of an effect—or whether the finding matters in the real world.

When AI systems are trained on this literature, they don’t inherit the structure of reality. They inherit the statistical fingerprints of motivated reasoning, dressed up in the language of science.

But surely, multiple studies found that…

Thomas Kuhn observed that “normal science” tends to protect dominant theories rather than rigorously test them. What we have now makes Kuhn’s observations look quaint. We’ve built institutional machinery that systematically filters out inconvenient evidence:

  • Publication bias eliminates contradictory results before they reach the literature
  • Peer review naturally favors findings that confirm existing beliefs
  • Career advancement depends on producing “significant” results
  • Statistical flexibility allows determined researchers to find supporting evidence in almost any dataset

This isn’t malicious—it’s the natural result of well-intentioned people operating within poorly designed incentive structures. But the cumulative effect is a scientific literature that systematically over-represents confirming evidence while underrepresenting challenges to established thinking.

An LLM is Kuhn on steroids

LLMs have become the ultimate paradigm amplifiers. They don’t just learn from biased literature—they become extraordinarily confident advocates for whatever patterns emerge from their training data, regardless of the underlying evidence quality. LLMs simply regurgitate patterns from an inherently flawed literature and generate endless confident variations on existing themes.

Transformers (the basis for LLMs) are sophisticated autocomplete, not reasoning engines. They excel at mimicking the language of scientific authority without possessing any actual understanding of scientific methods or uncertainty.

A transformer outputs “Studies show 73% efficacy with p<0.05” because that phrase appears frequently in papers, not because it understands what statistical power means or whether the underlying studies were competently designed. It’s pattern matching all the way down.

LLMs operate in pure information space, they cannot conduct experiments and learn from failures or create empirical evidence to support a given hypothesis. Without intervention false beliefs can persist indefinitely, amplified and refined through each generation of training. This becomes increasingly problematic as researchers start to rely on LLMs for their new research. We have the potential of creating a downward spiral of propagating falsehoods.

The Bayesian AI

While “traditional” statistics is not great at expressing uncertainty, Bayesian statistics is different. It lets us explicitly state our assumptions and calculate the probability that (given the underlying data and assumption) a signal is a true signal.

Imagine how working with such an AI system would look like. Once you have formulated your priors it could output something like this: “Given typical statistical power in this field (12%), publication bias patterns (85% positive results), and effect size inflation (2.3x average), I estimate a 23% probability this finding represents reality.”

On paper we have the tools to build such a system. We can extract findings with LLMs, we have great tools for Bayesian inference, we can create the UX/UI for inputting priors and other sources of data to take into consideration. While paper is forgiving on technical requirements, reality is a different beast. Especially, when it comes to incentives and economy.

Our entire research culture, from academic grants to preclinical pharmaceutical research is geared to promote positive confirmation bias. What will happen is uncertainty takes front and center stage?

Are we more comfortable with a culture of supposedly data-driven decision making, even when we know the data is flawed and cherry-picked to fit the agenda of the day? What is the economics of intellectual honesty and uncertainty?

Build the arXiv for Failures

Maybe we don’t need to build the Bayesian AI to find the answer to the questions above. Maybe we can build an experiment that would simultaneously generate the training data for our future AI. The arXiv for failures. Digital publication is “free” and failures to replicate a given study can make for a very short and templated finding.

The Falsification Journal could have structured templates for replication attempts, automated diffs with the original papers methods and materials. Aggregation across multiple failures to replicate, access to raw data, and content moderation tools to combat weaponizing “false falsifications” between competing research groups.

From own experience in pharmaceutical research, all of our failures to replicate academic results simply stayed as institutional knowledge. Imagine how much effort across many other labs could be saved if there was an outlet for these failures. How we could direct additional research capital (both money, time and brainpower) to answer new hypotheses.

Such a repository of our accumulated failures is also our accumulated learnings, and the basis for developing better methodologies and awareness. And it could serve as the training data for our Bayesian AI systems.

Bayesian or autonomous

As long as LLMs are not connected to the real world and can conduct experiments, they will promote whatever flawed example of reality we present them with. We can either choose to acknowledge this problem and build hybrid systems that incorporate and present this uncertainty – or we can wait for our robotic overlords to actually be able to generate empirical evidence and create their own feedback loops.

Irrespectively of what we choose, the current LLM paradigm is one that solely promotes “normal science” as Kuhn on steroids, there is no Popper with bold conjectures and overthrowing the status quo.