molecule.one logo

molecule.one team responsible for the winning entry in the $1M challenge

Trustworthy Retrosynthesis: How a Culture of Chemist-AI Collaboration Won $1M and Led to a Strategic Partnership with W.R. Grace

December 11, 2025

In chemical manufacturing, a single breakthrough route to a molecule can be worth hundreds of millions of dollars.

Deep neural networks hold tremendous promise for retrosynthesis—the key strategic phase of designing a new synthesis—by augmenting and accelerating human intuition with creative ideas. We pioneered this category commercially with our first deep-learning-based solution in 2019.

Against this backdrop, Standard Industries launched a $1M challenge for AI-assisted retrosynthesis planning— which we won.

On the back of that success, we signed a multi-year strategic partnership with W.R. Grace, a Standard Industries company, to jointly develop novel routes for peptide building blocks using our retrosynthesis platform and high-throughput laboratory.

But why did we win the Competition in the first place? In this blog-post we unveil for the first time the tech behind our retrosynthesis, which was the starting point for our competition entry that ultimately led to the partnership.

You can find more details in our preprint: "Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers".

Hallucinations in Chemistry are Everywhere

Hallucinations are now at the core of the debate about AI impact. They erode trust, and the need to double-check AI outputs cancels much of the time-saving benefit.

However, much less attention is paid to hallucinations in chemistry. For us, hallucinations are the most important problem for the field. We learned that as pioneers, having commercialized the first generative-deep-learning product for chemistry in 2019.

Indeed, a single hallucinated reaction can undermine trust in any synthesis-planning tool.

Example of a hallucinated reaction produced by a retrosynthesis system
Figure 1. Example of a hallucinated reaction produced by a retrosynthesis system. It has no known precedent in major reaction classes or databases.

In our tests, five state-of-the-art systems (IBM RXN; AiZynthFinder/LocalRetro; RootAligned; RetroChimera) still produced a notable share of hallucinated reactions.

Secret #1: A Culture Centered on Chemist Evaluations

Our first "secret" is that we have developed internal processes and software to evaluate chemical correctness. Together with our chemists, we have used these tools to rate over 4,500 reactions to date.

We evaluate every reaction into one of four categories:

  • Nonsense – effectively infeasible, a hallucination.
  • Rather not – technically possible but doubtful or unattractive.
  • Worthwhile – promising but not guaranteed.
  • Safe bet – considered fully reliable.

This process has become a cornerstone of our culture: any change to the system over the last 5 years was compared against this living dataset.

Secret #2: RetroTrim: An Ensemble of Chemistry- and DL-Inspired Models

These evaluations have guided our development over the last five years. Our core insight is that chemists judge reactions in interpretable ways that can be simulated computationally but are hard to learn directly from existing reaction corpora.

We built RetroTrim, based on three novel models, each capturing a different aspect of how chemists evaluate reactions for correctness. Let's describe them in turn.

  • Reaction Prior scores reactions by estimating the likelihood that a generative model would produce them (token-by-token). We also estimate probabilities of alternative disconnections at different atom sites to assess regioselectivity.
  • GraphScore is a classifier trained to discriminate between real reactions and synthetically generated negative reactions.
  • Finally, Retrieval Score retrieves nearest-neighbor precedents from the training set to flag low-precedent, high-mismatch proposals. It is designed to catch hallucinations of the more gross kind: reactions with no precedents and significant mismatches between the target and reactants.

Each of the models blends together deep learning and chemical insights that were contributed by different people. Please refer to our preprint for technical details.

These models are then ensembled to produce a single estimation of reaction plausibility, see Figure 2.

Our core innovation is ensembling three novel models to jointly evaluate reaction plausibility
Figure 2. Our core innovation is ensembling three novel models to jointly evaluate reaction plausibility.

Results: The Ensemble Is Required

Only the full ensemble reduced hallucinations below detection on our benchmark set.

The results are shown below. Five systems representing the state-of-the-art in the field (IBM RXN, AiZynthFinder and LocalRetro, RootAligned, RetroChimera) all produce a significant amount of hallucinated reactions.

Evaluation of state-of-the-art synthesis planning systems
Figure 3: Evaluation of state-of-the-art synthesis planning systems against 32 molecules using chemists' feedback. Other systems generated multiple hallucinations. Our system generated more highly accurate pathways, while returning no hallucinations.

Challenge Final and thoughts from the core team

The technology described above was the starting point for our Challenge entry.

In the final stage of the Challenge, we had to generate routes live for challenging targets that we had not seen before. The screenshot below shows the generated route for one of the tested targets.

Pathway for one of the target tested during the Challenge.
Figure 4. Pathway for one of the target tested during the Challenge.

Our focus on reliability has paid off given the novelty and difficulty of the targets that were tested in the final phase. We can read from the announcement:

"One of the things that set Molecule.One apart from the other finalists was the relative lack of hallucinations in the chemical pathways that it generated".

We wanted to close this blogpost by asking the core team about their thoughts on the challenge.

As our CTO, Stan Jastrzębski recalls:

"The judges picked very complex targets for which our system, initially, returned no routes using our default settings. We realized we had little time to improve the system."

So what happened that allowed us to turn the ship? Here is what our Machine Learning Lead, Michał Sadowski, says:

"Crucial to our success was that we established a closed loop with the cheminformatics team, which let us iterate quickly."

It was indeed crucial that we had built an established process for developing the system using both deep learning and chemistry expertise.

Fresh perspectives also played a key role. Maria Wyrzykowska joined us in early 2025 and started immediately on the project. She had a tremendous impact, which was remarkable given she had just started. As she recalls:

"I loved being able to have an impact from day one."

Our CEO remembers that everyone advised against applying for the final due to other commitments. We are happy he persisted!

All in all, what was a common theme is that our success was directly due to listening and integrating many perspectives.

Conclusions

Externally, the takeaway is simple: our method produced the most accurate and creative routes, with no hallucinations detected.

Internally, we believe we succeeded because of the culture we have built. One of our core values is React Ideas: we use everyone's ideas, regardless of seniority or background, to find solutions combining deep learning and chemistry expertise.

You can read the full paper on arXiv.

Next up: discovering chemistry that no human would propose in a synthesis plan.

Stay in touch, join our newsletter