SandboxAQ Releases Public Database Containing 5.2 Million Synthetic Protein-Ligand Structures
SandboxAQ has launched what it claims is the largest public dataset of protein-ligand complexes with associated binding affinity data. The database, called SAIR (Structurally Augmented IC50 Repository), contains approximately 5.2 million synthetic 3D structures covering over 1 million protein-ligand systems, designed to support the development and benchmarking of AI models in drug discovery.
Context
Spun out of Alphabet in 2022, SandboxAQ operates at the intersection of AI and quantum simulation, with a dedicated biopharma arm known as AQBioSim. Backed by over $300 million in funding (~$5B valuation) and partnerships with AstraZeneca, Sanofi, and the Michael J. Fox Foundation, the company develops Large Quantitative Models (LQMs) to accelerate molecular design and biological simulation.
The Database
The resource combines data from public affinity datasets such as BindingDB and ChEMBL with predicted 3D complex structures generated using Boltz-1, a co-folding foundation model (Boltz-2 was released recently). Rather than relying on a single prediction, SAIR includes multiple candidate structures for each protein-ligand pair to better represent structural uncertainty. These were filtered using affinity prediction models to retain only those matching experimentally observed binding data.
See also: The Infrastructure Layer: Platforms Powering Human-Relevant Drug Development
SAIR is intended to address one of the central bottlenecks in AI-driven drug discovery: the lack of large, high-quality datasets that pair protein-ligand 3D structures with experimentally measured binding affinities. Existing tools like AlphaFold and OpenFold primarily focus on protein structures, while most machine learning models for binding prediction are trained on 2D chemical representations or sequences due to the scarcity of structural-affinity pair data.
The SAIR dataset was built using SandboxAQ’s Large Quantitative Model (LQM) infrastructure and trained on Nvidia DGX Cloud, with workflow-level optimization delivering a 2x improvement in GPU utilization, according to the company.
The dataset is freely available under a CC BY-NC-SA 4.0 license for non-commercial use. Commercial use is also permitted at no cost, pending submission of a usage request. The developers believe that training AI models on SAIR could enable binding affinity predictions up to 1000x faster than physics-based simulation methods.
According to SandboxAQ, SAIR is expected to support applications such as training new structural AI models, benchmarking biofoundation models, and calibrating affinity predictors. The team is also exploring potential extensions of the resource, including the generation of parallel datasets beyond small molecules, as part of a broader roadmap toward whole-cell modeling.
In October 2024, SandboxAQ announced a collaboration with Sanofi to identify biomarkers during clinical development using its LQMs. The project uses causal filtration of biomedical knowledge graphs to surface mechanistically supported hypotheses from literature and data. The goal is to support identification of safety and efficacy biomarkers, especially in later-stage development. According to SandboxAQ, these models avoid the scale and precision constraints of natural language LLMs by relying on internally generated synthetic training data.
Topics: AI & Digital