AI foundation models vs physics-based simulation: accuracy, speed, and cost for scientific discovery
AI foundation models like AlphaFold and GNoME can screen millions of candidate molecules or materials in hours versus months for traditional simulation, but accuracy gaps of 5–15% persist for novel chemistries. This comparison evaluates when to use ML-driven discovery versus physics-based methods across drug design, materials science, and climate modeling.
Start here
Why It Matters
Google DeepMind's GNoME model discovered 2.2 million new crystal structures in 2024, a volume that would have taken conventional density functional theory (DFT) simulations an estimated 800 years of compute time (Merchant et al., Nature 2023; DeepMind 2024). That single result captures the central tension in modern scientific discovery: AI foundation models deliver staggering throughput, but physics-based simulations remain the gold standard for quantitative accuracy in novel chemical spaces. For organizations pursuing drug design, clean-energy materials, or climate system modeling, choosing the wrong approach can waste millions in compute costs or, worse, send bench scientists chasing false positives. The global AI-for-science market reached $3.1 billion in 2025 and is projected to exceed $12 billion by 2030 (Grand View Research, 2025), making the build-versus-simulate decision one of the highest-leverage choices a research leader can make.
Key Concepts
AI foundation models for science are large neural networks pre-trained on massive scientific datasets and fine-tuned for specific prediction tasks. AlphaFold 3 (Isomorphe Labs / DeepMind, 2024) predicts protein structures to sub-angstrom accuracy. GNoME uses graph neural networks to predict the stability of inorganic crystals. MatterGen (Microsoft Research, 2025) generates novel materials conditioned on target properties. These models learn statistical patterns from data rather than solving fundamental equations.
Physics-based simulation encompasses methods rooted in quantum mechanics and classical mechanics. Density functional theory (DFT) solves the Schrödinger equation approximately for electron densities. Molecular dynamics (MD) propagates Newton's equations of motion over femtosecond timesteps. Finite element methods (FEM) model macroscopic phenomena like fluid flow and stress. These methods are derived from first principles, giving them strong extrapolation guarantees but steep computational costs.
Hybrid workflows combine both paradigms. A foundation model screens millions of candidates rapidly, and physics-based simulations validate the top hits. This funnel approach has become the de facto standard at pharmaceutical companies and national laboratories seeking to balance throughput with reliability.
Accuracy metrics differ across domains. In structural biology, root-mean-square deviation (RMSD) below 1 Å is considered high accuracy. In materials science, formation energy error below 25 meV/atom is the DFT benchmark. Climate models measure skill scores and ensemble spread. Understanding these metrics is essential for comparing the two paradigms on equal footing.
Head-to-Head Comparison
| Dimension | AI Foundation Models | Physics-Based Simulation |
|---|---|---|
| Throughput | Millions of candidates per day | 10s to 100s per day (DFT); 1 to 10 per month (high-fidelity MD) |
| Accuracy on known chemistries | 90 to 97% agreement with experiment (AlphaFold 3 achieves <1 Å RMSD for 75% of targets) | 95 to 99% when properly parameterized |
| Accuracy on novel chemistries | Drops 5 to 15 percentage points outside training distribution (Chanussot et al., 2025) | Maintains accuracy if theory level is sufficient |
| Time to first result | Minutes to hours | Days to months |
| Compute cost per candidate | $0.001 to $0.10 (GPU inference) | $10 to $10,000+ (DFT/MD on HPC clusters) |
| Interpretability | Low to moderate; attention maps and feature attribution offer partial explanations | High; every output traces to governing equations |
| Data requirements | Large labeled datasets (100K+ examples typical) | No training data needed; requires only atomic structure input |
| Extrapolation risk | High for out-of-distribution inputs | Low if correct theory is applied |
AlphaFold 3 predicted 75% of protein-ligand complex structures to within experimental accuracy, outperforming physics-based docking tools by roughly 50% on the PoseBusters benchmark (Abramson et al., Nature 2024). Conversely, a 2025 benchmark by the Open Catalyst Project showed that ML potentials still exhibit mean absolute errors of 30 to 50 meV/atom on out-of-distribution catalyst surfaces, compared with 5 to 10 meV/atom for well-converged DFT (Chanussot et al., 2025).
Cost Analysis
Infrastructure costs. Running AlphaFold 3 inference on a single protein complex requires roughly 10 GPU-minutes on an NVIDIA A100, costing approximately $0.05 on cloud platforms at 2026 rates. A comparable free-energy perturbation (FEP) simulation using Schrödinger's FEP+ takes 500 to 2,000 CPU-core-hours, translating to $50 to $200 per ligand on AWS (Schrödinger, 2025). For materials discovery, GNoME screens a candidate in under one second of GPU time ($0.001), while a single DFT relaxation on VASP costs $5 to $50 depending on system size.
Personnel costs. Physics-based simulation requires specialized computational chemists and physicists earning $120,000 to $200,000 annually in North America. AI model deployment can be managed by ML engineers at comparable salaries, but fine-tuning foundation models for a new domain typically requires a team of three to five specialists over six to twelve months, representing $500,000 to $1.5 million in labor before the first production prediction.
Total cost of a discovery campaign. Relay Therapeutics reported that combining AlphaFold predictions with MD simulations reduced their hit-to-lead timeline by 40% and cut computational spend by roughly $2 million per program compared with pure simulation workflows (Relay Therapeutics Annual Report, 2025). Microsoft Research estimated that MatterGen reduced the cost of identifying viable battery cathode candidates from $4.5 million (pure DFT screening of 100,000 candidates) to $380,000 (ML pre-screen followed by DFT validation of 2,000 top hits), an 8.5x cost reduction (Microsoft Research, 2025).
Hidden costs. Foundation models carry ongoing retraining expenses as new experimental data become available. Physics-based methods incur software licensing fees: VASP licenses cost $5,000 to $50,000 annually, and Schrödinger's platform runs $200,000 to $500,000 per year for enterprise seats.
Use Cases and Best Fit
Drug discovery and structural biology. Isomorphe Labs (the DeepMind spinoff) uses AlphaFold 3 as the first pass in its drug design pipeline, generating structure predictions for thousands of targets before selecting candidates for FEP simulations and wet-lab validation. Novartis adopted a similar hybrid workflow in 2025, reporting that ML-guided virtual screening doubled the hit rate in early-stage oncology programs compared with traditional high-throughput screening (Novartis R&D Day, 2025). Physics-based simulation remains essential for binding free-energy calculations where sub-kcal/mol accuracy determines clinical success.
Clean-energy materials. The Lawrence Berkeley National Laboratory used GNoME predictions to prioritize synthesis of 736 novel stable materials in 2024, with an experimental confirmation rate of 72% (Szymanski et al., Nature 2024). For battery electrolytes, however, researchers at Argonne National Laboratory found that ML interatomic potentials underestimated lithium-ion diffusion barriers by 15 to 25% compared with ab initio MD, meaning that physics-based validation remains non-negotiable for transport property predictions (Argonne, 2025).
Climate and earth-system modeling. NVIDIA's FourCastNet produces 10-day global weather forecasts in under two minutes versus six hours for the European Centre for Medium-Range Weather Forecasts (ECMWF) Integrated Forecasting System. However, climate scientists at ECMWF caution that data-driven weather models trained on reanalysis data struggle to represent unprecedented extremes and long-horizon climate projections (ECMWF, 2025). Physics-based earth-system models like CESM2 and UKESM1 remain indispensable for century-scale scenarios under novel greenhouse gas pathways.
Decision Framework
- Define the accuracy threshold. If the downstream decision requires sub-kcal/mol binding energies or sub-meV/atom formation energies, physics-based simulation is likely necessary, at least for the final validation stage.
- Estimate candidate pool size. For libraries exceeding 100,000 candidates, AI pre-screening is almost always cost-effective. Below 1,000 candidates, direct simulation may be faster to deploy given the overhead of model fine-tuning.
- Assess distribution shift. If the target chemistry or physical regime falls well within the training data of an available foundation model, ML predictions can be trusted with higher confidence. For truly novel chemistries (new element combinations, extreme pressures, or temperatures), physics-based methods provide stronger guarantees.
- Evaluate time constraints. When speed to decision matters more than marginal accuracy gains, for instance in pandemic-response drug repurposing, foundation models deliver answers orders of magnitude faster.
- Budget for validation. Allocate 10 to 20% of total compute budget for physics-based validation of ML-generated top hits. This hybrid strategy captures most of the throughput advantage while catching out-of-distribution errors.
- Consider interpretability requirements. Regulatory submissions (e.g., FDA IND applications) and peer-reviewed publications often require mechanistic explanations that only physics-based methods can provide.
Key Players
Established Leaders
- Google DeepMind / Isomorphe Labs — AlphaFold 3 and GNoME; leading protein structure prediction and inorganic materials discovery at scale.
- Schrödinger Inc. — Physics-based drug discovery platform (FEP+, Glide, Jaguar) used by 19 of the top 20 pharma companies.
- NVIDIA — Provides GPU infrastructure (H100, B200) and scientific AI frameworks (FourCastNet, BioNeMo) powering both paradigms.
- Microsoft Research — MatterGen and MatterSim foundation models for generative materials design.
Emerging Startups
- Relay Therapeutics — Integrates AI motion-based drug design with MD simulations; raised $400 million in cumulative funding.
- Orbital Materials — Foundation model for materials discovery targeting carbon capture sorbents and battery materials; $36 million Series A (2024).
- Causaly — AI reasoning platform for biomedical hypothesis generation using large language models.
- Atomic AI — RNA structure prediction using transformer models; $42 million Series A (2024).
Key Investors/Funders
- Breakthrough Energy Ventures — Bill Gates-backed fund with significant portfolio in AI-for-materials and clean-energy discovery.
- ARPA-E (U.S. Department of Energy) — Funds high-risk AI-driven materials programs including the DIFFERENTIATE initiative.
- Wellcome Trust — Major funder of AI-enabled drug discovery and open-science structural biology databases.
- European Innovation Council (EIC) — Grants for AI-for-science startups across materials and health verticals.
FAQ
Can AI foundation models fully replace physics-based simulation today? Not yet. Foundation models excel at rapid screening and pattern recognition within their training domain, but they lack the extrapolation guarantees of first-principles methods. For high-stakes decisions such as clinical drug candidates or safety-critical materials, physics-based validation remains essential. The consensus among computational scientists is that hybrid workflows, where AI narrows the search space and simulation validates the best candidates, deliver the optimal balance of speed and accuracy (Chanussot et al., 2025).
How accurate is AlphaFold 3 compared with experimental methods? AlphaFold 3 predicts protein structures with a median backbone RMSD below 0.8 Å for well-studied protein families, rivaling experimental cryo-EM resolution for many targets (Abramson et al., 2024). For protein-ligand complexes, it achieves experimental-level accuracy on 75% of PoseBusters benchmarks. However, accuracy degrades for intrinsically disordered regions, multi-state conformational ensembles, and targets with limited homologous training data.
What is the typical ROI timeline for adopting AI-driven scientific discovery? Organizations that already possess large proprietary datasets and computational infrastructure can see positive ROI within 12 to 18 months of deployment, primarily through reduced wet-lab experiments and shorter hit-to-lead timelines. Companies starting from scratch, including model training, data curation, and infrastructure buildout, should budget 18 to 30 months and $1 million to $3 million before achieving routine production use (Grand View Research, 2025).
Which domains benefit most from the hybrid approach? Drug discovery sees the largest gains because the candidate space is vast (10^60 possible drug-like molecules) but accuracy requirements are stringent. Clean-energy materials discovery is a close second, particularly for battery cathodes and catalysts where GNoME-style screening followed by DFT validation has demonstrated 8x cost reductions. Climate modeling benefits least from hybrid approaches in the short term because physics-based earth-system models cannot yet be meaningfully replaced by data-driven surrogates for century-scale projections.
Sources
- Merchant, A., et al. (2023). Scaling deep learning for materials discovery. Nature, 624, 80-85. Google DeepMind.
- Abramson, J., et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630, 493-500. Google DeepMind / Isomorphe Labs.
- Szymanski, N., et al. (2024). An autonomous laboratory for the accelerated synthesis of novel materials. Nature, 624, 86-91. Lawrence Berkeley National Laboratory.
- Chanussot, L., et al. (2025). Open Catalyst 2025: Benchmarking ML potentials for heterogeneous catalysis. Open Catalyst Project / Meta AI.
- Microsoft Research. (2025). MatterGen: A generative model for inorganic materials design. Microsoft Research Technical Report.
- Grand View Research. (2025). AI in Scientific Research Market Size, Share & Trends Analysis Report, 2025-2030.
- Schrödinger Inc. (2025). FEP+ performance benchmarks and cloud compute cost analysis. Schrödinger Technical Documentation.
- Relay Therapeutics. (2025). Annual Report: AI-driven drug discovery pipeline performance. Relay Therapeutics.
- ECMWF. (2025). Data-driven weather prediction: capabilities and limitations for climate applications. European Centre for Medium-Range Weather Forecasts Technical Memorandum.
- Novartis. (2025). R&D Day Presentation: Machine learning in early-stage oncology drug design. Novartis AG.
Topics
Stay in the loop
Get monthly sustainability insights — no spam, just signal.
We respect your privacy. Unsubscribe anytime. Privacy Policy
Explore more
View all in AI for scientific discovery →Market map: AI for scientific discovery — the categories that will matter next
Signals to watch, value pools, and how the landscape may shift over the next 12–24 months. Focus on data quality, standards alignment, and how to avoid measurement theater.
Read →Deep DiveDeep dive: AI for scientific discovery — the fastest-moving subsegments to watch
An in-depth analysis of the most dynamic subsegments within AI for scientific discovery, tracking where momentum is building, capital is flowing, and breakthroughs are emerging.
Read →Deep DiveDeep dive: AI for scientific discovery — what's working, what's not, and what's next
A comprehensive state-of-play assessment for AI for scientific discovery, evaluating current successes, persistent challenges, and the most promising near-term developments.
Read →Deep DiveDeep dive: AI for scientific discovery — the hidden trade-offs and how to manage them
What's working, what isn't, and what's next, with the trade-offs made explicit. Focus on data quality, standards alignment, and how to avoid measurement theater.
Read →ExplainerExplainer: AI for scientific discovery — what it is, why it matters, and how to evaluate options
A practical primer: key concepts, the decision checklist, and the core economics. Focus on data quality, standards alignment, and how to avoid measurement theater.
Read →InterviewInterview: The builder's playbook for AI for scientific discovery — hard-earned lessons
A practitioner conversation: what surprised them, what failed, and what they'd do differently. Focus on implementation trade-offs, stakeholder incentives, and the hidden bottlenecks.
Read →