Trend analysis: AI for scientific discovery — where the value pools are (and who captures them)
Strategic analysis of value creation and capture in AI for scientific discovery, mapping where economic returns concentrate and which players are best positioned to benefit.
Start here
AI for scientific discovery has moved from academic curiosity to a primary driver of commercial R&D strategy across materials science, drug development, climate modeling, and energy technology. The global market for AI-driven scientific research tools reached $14.2 billion in 2025, growing at 34% annually according to McKinsey. But the distribution of value creation is profoundly uneven: platform providers and data infrastructure companies capture 55 to 65% of economic returns, while the research organizations generating breakthrough discoveries often retain less than 15% of the downstream commercial value their work enables. Understanding this value chain is essential for sustainability professionals evaluating where AI for science investments will deliver returns and where they will not.
Why It Matters
Scientific discovery is the upstream engine of every sustainability transition. New battery chemistries, carbon capture sorbents, drought-resistant crops, and low-carbon construction materials all originate in research that AI is now accelerating by orders of magnitude. DeepMind's AlphaFold predicted the structures of 200 million proteins in 2022, a task that would have taken experimental structural biologists centuries. Insilico Medicine used AI to identify a novel drug candidate for idiopathic pulmonary fibrosis in 18 months, compared to the typical four to six year timeline. Microsoft Research's AI-driven materials screening identified 32 million potential new materials, with 500,000 predicted to be stable, in a project that compressed decades of computational chemistry into weeks.
For North American sustainability professionals, the implications are direct. The US Department of Energy's 2025 AI for Science initiative allocated $1.8 billion over five years for AI-accelerated clean energy research, spanning advanced nuclear, hydrogen production, grid optimization, and carbon removal. The National Science Foundation's National AI Research Institutes program invested $440 million across 25 institutes, seven of which focus explicitly on sustainability-related scientific domains. Canada's Pan-Canadian AI Strategy committed CAD 443 million through 2028, with climate and environmental science among priority areas.
The competitive dynamics are intensifying. China published 42% of all AI-for-science research papers in 2024, compared to 28% from North America, though North American institutions lead in high-impact citations and commercialization. The EU's Destination Earth initiative is building digital twins of the Earth system for climate prediction, while South Korea and Japan are investing heavily in AI-driven materials discovery for batteries and semiconductors. Organizations that fail to understand where value concentrates in this ecosystem risk investing in capabilities that generate scientific insights but capture none of the economic returns.
Key Concepts
Foundation Models for Science are large-scale pre-trained models adapted for scientific domains. Unlike general-purpose language models, scientific foundation models are trained on domain-specific corpora including molecular structures, protein sequences, crystallographic databases, and experimental measurement datasets. Examples include Meta's ESM-2 for protein sequences, Google DeepMind's GNoME for crystal structures, and Microsoft's MatterGen for generative materials design. These models encode learned representations of physical and chemical relationships that enable rapid screening, property prediction, and generative design of novel compounds and materials.
Autonomous Laboratories combine AI-driven experimental design with robotic instrumentation to execute closed-loop research cycles without continuous human intervention. The AI system proposes experiments, robotic platforms execute them, sensors measure outcomes, and the AI updates its models and proposes the next experimental iteration. This approach compresses research cycles from weeks to hours and enables exhaustive exploration of parameter spaces that human researchers would never attempt manually.
AI-Guided Simulation uses machine learning to accelerate physics-based simulations by orders of magnitude. Neural network potentials replace expensive quantum mechanical calculations with fast, approximate predictions that maintain near-quantum accuracy. This enables molecular dynamics simulations of millions of atoms over nanosecond timescales, making it practical to screen billions of candidate molecules for properties such as catalytic activity, thermal stability, or carbon capture affinity.
Scientific Data Infrastructure encompasses the databases, APIs, data standards, and compute platforms that underpin AI-driven research. Key resources include the Materials Project (containing computed properties for 150,000+ inorganic materials), the Cambridge Structural Database (1.2 million crystal structures), PubChem (115 million chemical compounds), and national laboratory high-performance computing allocations. Access to curated, standardized scientific data is the primary bottleneck for most AI-for-science applications.
Value Pool Mapping
Where Value Concentrates
The AI-for-scientific-discovery value chain has four distinct layers, each with different economics and competitive dynamics.
Layer 1: Compute and Cloud Infrastructure (25 to 30% of total value)
The most capital-intensive layer is dominated by hyperscalers. AWS, Google Cloud, and Microsoft Azure collectively control 65% of scientific computing workloads that have migrated to cloud. NVIDIA captures approximately 80% of GPU revenue for scientific AI training and inference. These players benefit from massive scale economies and lock-in effects: once research groups build workflows around specific cloud platforms and GPU architectures, switching costs are prohibitive. Annual revenue from scientific AI compute exceeded $4 billion in North America alone in 2025.
Layer 2: Foundation Models and Platforms (20 to 25% of total value)
Companies and labs that develop and host domain-specific foundation models capture substantial value through API access fees, licensing, and premium features. Google DeepMind's AlphaFold and GNoME, Meta's ESM suite, and Microsoft's scientific AI tools represent the current frontier. Emerging commercial players include Recursion Pharmaceuticals (biology), Citrine Informatics (materials science), and Kebotix (chemistry). Platform providers benefit from network effects: as more researchers use and fine-tune models, the platforms accumulate proprietary data and feedback that improve model quality, creating defensible competitive positions.
Layer 3: Specialized Applications and Vertical Solutions (15 to 20% of total value)
Companies that translate general-purpose AI capabilities into domain-specific workflows for particular industries capture a smaller but growing share of value. Examples include Schrödinger (computational chemistry for pharma), Exscientia (AI drug design), and Aionics (battery materials screening). These companies add value through domain expertise, proprietary training data from industry partnerships, and integration with existing R&D workflows. Margins are typically 40 to 60%, reflecting the specialized knowledge required but also the smaller addressable markets compared to platform plays.
Layer 4: Discovery and IP Generation (10 to 15% of total value)
Research institutions, national laboratories, and university spin-outs that generate actual scientific discoveries capture the smallest share of economic value despite creating the foundational knowledge that the entire ecosystem monetizes. This structural undervaluation reflects several factors: the public-good nature of basic research, the long timelines between discovery and commercialization (typically 7 to 15 years for materials science, 10 to 20 years for drug development), and the difficulty of patenting AI-generated discoveries under current intellectual property frameworks.
Who Is Capturing Value Today
Big Tech dominates infrastructure and platform layers. Google DeepMind, Microsoft Research, and Meta AI collectively invested over $3 billion in AI-for-science R&D in 2025. Their strategy is to provide scientific AI capabilities as loss leaders or open-source tools that drive adoption of their commercial cloud platforms. AlphaFold is free, but the compute required to run AlphaFold-scale predictions on custom datasets generates substantial cloud revenue.
Venture-backed startups compete in the application layer. Between 2022 and 2025, North American AI-for-science startups raised $8.7 billion in venture funding, according to PitchBook. The largest rounds went to Recursion ($500 million Series D), Insilico Medicine ($395 million Series D), and Exscientia ($525 million in cumulative funding). However, startup economics are challenging: customer acquisition costs are high in scientific markets, sales cycles are long (12 to 24 months for enterprise R&D contracts), and competition from free academic tools and big tech loss leaders compresses pricing power.
National laboratories serve as critical infrastructure. The US DOE's 17 national laboratories operate 28 major user facilities including the world's fastest supercomputers, synchrotron light sources, and neutron scattering instruments. These facilities generate the experimental data that trains AI models, provide compute resources that supplement cloud, and serve as neutral convening platforms where industry, academia, and government collaborate. The 2025 DOE AI for Science initiative explicitly aims to strengthen this public infrastructure role.
What's Working
AI-Accelerated Materials Discovery at Scale
Google DeepMind's GNoME project screened 2.2 million candidate crystal structures and predicted 381,000 novel stable materials, expanding the known stable inorganic materials database by an order of magnitude. Lawrence Berkeley National Laboratory's A-Lab subsequently used robotic synthesis to physically create 41 of these AI-predicted materials in 17 days, validating the computational predictions with a 71% success rate. This end-to-end pipeline from AI prediction to robotic synthesis to experimental validation represents the most mature demonstration of autonomous scientific discovery. For sustainability, the identified materials include novel solid-state electrolytes for lithium-ion batteries and new photocatalysts for solar fuel production.
Protein Engineering for Industrial Biotechnology
AI-driven protein engineering is delivering commercial results in industrial sustainability applications. Solugen, based in Houston, uses AI-designed enzymes to produce commodity chemicals from plant-derived sugars rather than petroleum feedstocks, achieving 90% reduction in carbon emissions compared to conventional chemical synthesis. Their AI platform screens millions of enzyme variants computationally before selecting candidates for experimental testing, compressing development timelines from years to months. The company's bio-formic acid production facility in Marshall, Texas, processes 10,000 tons per year and represents the first commercial-scale deployment of AI-designed enzymes for bulk chemical manufacturing.
Climate Model Acceleration
NVIDIA and the Allen Institute for AI developed FourCastNet, a neural network weather prediction model that generates global forecasts 45,000 times faster than traditional numerical weather prediction while maintaining comparable accuracy for short-range predictions. Google DeepMind's GenCast subsequently demonstrated superior probabilistic weather forecasting using diffusion models. For climate science, these approaches enable ensemble simulations that characterize uncertainty across thousands of scenarios, something previously impossible due to computational constraints. The National Oceanic and Atmospheric Administration (NOAA) began operational testing of AI weather models in 2025, with plans to integrate them into the National Weather Service forecast pipeline.
What's Not Working
Reproducibility and Validation Gaps
A 2024 Nature survey found that only 34% of AI-for-science papers provided sufficient code and data for independent reproduction. The problem is particularly acute in materials science, where AI models trained on computational databases (which contain idealized, zero-temperature calculations) frequently fail when predictions are tested against experimental measurements conducted under real-world conditions. The gap between computational prediction accuracy (often reported at 85 to 95%) and experimental validation rates (typically 40 to 70%) undermines confidence in AI-generated discoveries and slows commercialization.
Intellectual Property Uncertainty
Current US patent law requires human inventorship, creating ambiguity about the patentability of AI-generated discoveries. The US Patent and Trademark Office's 2024 guidance clarified that AI-assisted inventions are patentable when a human makes a "significant contribution," but the threshold remains contested. Several high-profile patent applications for AI-discovered materials have been challenged, creating uncertainty that chills investment in AI-first discovery programs. Companies are responding by emphasizing human involvement in the discovery process, sometimes artificially, to satisfy inventorship requirements.
Data Access Bottlenecks
Despite the open-science ethos of the AI research community, critical scientific datasets remain locked behind institutional paywalls, proprietary databases, or incompatible formats. The Materials Project provides free access to computed properties, but experimental measurement databases (such as ICSD for crystal structures or Reaxys for chemical reactions) charge annual subscription fees of $50,000 to $500,000 that exclude smaller organizations. National laboratory user facilities generate petabytes of experimental data annually, but only 10 to 15% is curated and published in machine-readable formats suitable for AI training.
Myths vs. Reality
Myth 1: AI will replace human scientists within a decade
Reality: AI excels at pattern recognition, screening, and optimization within defined parameter spaces. It does not formulate novel hypotheses, design experimental frameworks, or interpret results in broader scientific context. The most productive research groups use AI to amplify human creativity, not substitute for it. A 2025 MIT study found that AI-augmented research teams produced 2.4 times more high-impact publications than either AI-only or human-only approaches, but only when researchers had sufficient AI literacy to direct the tools effectively.
Myth 2: Open-source AI models eliminate competitive advantages
Reality: While foundation models like AlphaFold and ESM-2 are publicly available, competitive advantage increasingly resides in proprietary fine-tuning data, domain-specific workflows, and integration with experimental infrastructure. Companies with exclusive access to high-quality training data from industrial R&D partnerships maintain significant performance advantages over organizations using only public datasets. The model itself is necessary but not sufficient; the data moat matters more.
Myth 3: AI-for-science investments deliver quick commercial returns
Reality: The timeline from AI-generated discovery to commercial product remains long, typically 5 to 15 years for materials and chemicals. While AI compresses the discovery phase by 50 to 80%, the subsequent stages of process development, scale-up, regulatory approval, and market development are not significantly accelerated by AI. Investors should calibrate return expectations to total development timelines, not just the AI-accelerated discovery component.
Myth 4: More compute always yields better scientific AI results
Reality: Scaling laws for scientific AI differ from language models. Beyond certain thresholds, additional compute yields diminishing returns without corresponding improvements in training data quality and diversity. The 2025 DOE assessment found that data curation and standardization investments yielded 3 to 5 times greater performance improvements per dollar than equivalent investments in additional compute for most scientific AI applications.
Key Players
Established Leaders
Google DeepMind leads in scientific foundation models with AlphaFold (proteins), GNoME (materials), and GenCast (weather), backed by Google Cloud's compute infrastructure.
Microsoft Research operates AI for Science divisions focused on molecular simulation, climate modeling, and scientific reasoning, integrated with Azure cloud services.
NVIDIA provides both hardware (A100, H100 GPUs) and software (CUDA, cuDNN, Modulus for physics-informed AI) that form the computational backbone of scientific AI.
Emerging Startups
Recursion Pharmaceuticals (Salt Lake City) operates the world's largest proprietary biological dataset generated through automated cell biology experiments, using AI to discover drugs for rare diseases and oncology.
Citrine Informatics (Redwood City) provides a materials informatics platform used by Panasonic, BASF, and other manufacturers to accelerate development of batteries, polymers, and specialty chemicals.
Aionics (Berkeley) specializes in AI-driven battery electrolyte discovery, with partnerships with major automotive OEMs seeking next-generation lithium-ion and solid-state battery formulations.
Key Investors and Funders
US Department of Energy committed $1.8 billion through 2030 for AI-accelerated clean energy research across national laboratories.
ARCH Venture Partners has deployed over $1 billion in AI-for-science companies, with particular emphasis on biological sciences and materials discovery.
Lux Capital focuses on deep technology investments including AI-driven scientific discovery, with notable investments in autonomous laboratory and computational chemistry companies.
Action Checklist
- Map your organization's position in the AI-for-science value chain and identify whether you are generating, enabling, or consuming AI-driven discoveries
- Assess data assets: proprietary experimental datasets are the primary source of competitive advantage in AI-for-science
- Evaluate build vs. buy decisions for AI capabilities, recognizing that foundation models are increasingly commoditized while domain-specific fine-tuning is not
- Establish partnerships with national laboratories and user facilities for access to experimental infrastructure and curated datasets
- Develop IP strategies that account for AI inventorship uncertainties, ensuring human involvement is documented throughout discovery processes
- Invest in AI literacy for research staff, as human-AI collaboration outperforms either approach alone
- Calibrate investment return expectations to full commercialization timelines (5 to 15 years), not just discovery acceleration
- Monitor foundation model releases from big tech, as free tools may commoditize capabilities you are building in-house
FAQ
Q: Where should sustainability-focused organizations invest in AI for scientific discovery? A: The highest-return opportunities for sustainability organizations are in the application layer, where domain expertise creates defensible positions. Specifically, AI-accelerated materials screening for clean energy technologies (battery materials, carbon capture sorbents, catalysts for green hydrogen), AI-guided protein engineering for industrial biotechnology, and climate model acceleration for risk assessment. Avoid competing directly with big tech in the infrastructure or platform layers unless you have unique data assets that create a moat.
Q: How should we evaluate AI-for-science vendors and partners? A: Demand experimental validation rates, not just computational prediction accuracy. Ask for the percentage of AI-predicted materials or compounds that have been physically synthesized and confirmed to have predicted properties. Top performers demonstrate 60 to 70% experimental validation rates; claims above 80% warrant skepticism. Also evaluate data provenance: models trained exclusively on computational databases without experimental calibration have systematically different failure modes than models incorporating experimental measurements.
Q: What is the realistic timeline for AI-discovered materials to reach commercial deployment? A: For sustainability applications, expect 3 to 5 years from AI discovery to pilot-scale validation and 7 to 15 years to commercial-scale deployment. AI typically compresses the discovery and screening phase by 50 to 80%, saving 2 to 5 years. However, process development, scale-up engineering, supply chain establishment, and (where applicable) regulatory approval proceed at conventional timescales. The fastest paths to market are in applications where existing manufacturing infrastructure can accommodate new materials with minimal process changes.
Q: How is value likely to shift across the AI-for-science ecosystem over the next five years? A: We expect three structural shifts: (1) compute and platform layers will see margin compression as competition intensifies between hyperscalers and open-source alternatives proliferate; (2) the application layer will consolidate around companies with the strongest proprietary data assets and deepest industry partnerships; and (3) autonomous laboratory capabilities will create a new value layer at the intersection of AI and physical experimentation, with early movers establishing barriers through specialized hardware and workflow integration. Organizations that combine AI capabilities with unique experimental data will capture disproportionate value.
Q: What role do national laboratories play, and how can private organizations access their capabilities? A: US national laboratories are essential partners for AI-for-science initiatives. The 17 DOE laboratories provide access to supercomputing (through INCITE and ALCC allocation programs), synchrotron light sources, neutron scattering facilities, and specialized instrumentation. Private companies can access these resources through cooperative research and development agreements (CRADAs), Strategic Partnership Projects, and user facility proposals. Application processes are competitive but accessible; approximately 6,000 external researchers use DOE user facilities annually. For sustainability-focused companies, the ARPA-E OPEN programs and Office of Energy Efficiency and Renewable Energy (EERE) funding opportunities specifically target industry-lab collaborations.
Sources
- McKinsey & Company. (2025). The State of AI in Scientific Research: Market Sizing and Value Chain Analysis. New York: McKinsey Global Institute.
- US Department of Energy. (2025). AI for Science Initiative: Strategic Plan and Investment Roadmap. Washington, DC: DOE Office of Science.
- Nature Editorial Board. (2024). Reproducibility in AI-Driven Science: A Systematic Assessment. Nature, 625, 14-18.
- PitchBook. (2025). AI for Drug Discovery and Materials Science: Venture Capital Trends Q4 2024. Seattle: PitchBook Data.
- Google DeepMind. (2024). Scaling Deep Learning for Materials Discovery: GNoME Technical Report. Available at: https://deepmind.google/discover/blog/
- Lawrence Berkeley National Laboratory. (2024). A-Lab: Autonomous Materials Synthesis and Validation. Berkeley, CA: LBNL.
- MIT Technology Review. (2025). AI-Augmented Research Teams: Productivity Analysis Across 500 Laboratories. Cambridge, MA: MIT.
Stay in the loop
Get monthly sustainability insights — no spam, just signal.
We respect your privacy. Unsubscribe anytime. Privacy Policy
Explore more
View all in AI for scientific discovery →Deep dive: AI for scientific discovery — the fastest-moving subsegments to watch
An in-depth analysis of the most dynamic subsegments within AI for scientific discovery, tracking where momentum is building, capital is flowing, and breakthroughs are emerging.
Read →Deep DiveDeep dive: AI for scientific discovery — what's working, what's not, and what's next
A comprehensive state-of-play assessment for AI for scientific discovery, evaluating current successes, persistent challenges, and the most promising near-term developments.
Read →Deep DiveDeep dive: AI for scientific discovery — the hidden trade-offs and how to manage them
What's working, what isn't, and what's next, with the trade-offs made explicit. Focus on data quality, standards alignment, and how to avoid measurement theater.
Read →ExplainerExplainer: AI for scientific discovery — what it is, why it matters, and how to evaluate options
A practical primer: key concepts, the decision checklist, and the core economics. Focus on data quality, standards alignment, and how to avoid measurement theater.
Read →InterviewInterview: The builder's playbook for AI for scientific discovery — hard-earned lessons
A practitioner conversation: what surprised them, what failed, and what they'd do differently. Focus on implementation trade-offs, stakeholder incentives, and the hidden bottlenecks.
Read →ArticleAI for scientific discovery costs in 2026: platform licensing, compute, and integration economics
Running a single large-scale AI-driven drug discovery campaign costs $2–10 million in compute alone, while materials science screening runs $500K–$3 million. This guide details platform licensing fees, cloud compute pricing, data curation costs, and ROI timelines showing 3–7× returns when AI reduces R&D cycles from years to months.
Read →