Tahoe, Arc Institute, and Biohub Collaborate to Release Largest Open Dataset for Virtual Cell Modeling
Tahoe Therapeutics (formerly Vevo), Arc Institute, and CZI’s Biohub have announced a joint initiative to generate what is reportedly the largest and most perturbation-rich single-cell dataset available for virtual cell model development. The dataset will comprise over 120 million single-cell profiles and 225,000 drug–patient perturbation interactions and will be released open source as part of a shared open science commitment.
The collaboration merges Tahoe’s Mosaic platform with Arc’s scBaseCount and Biohub’s CELLxGENE datasets, extending previous efforts that have powered foundational AI models such as STATE, Tahoe-x1, and TranscriptFormer. The new dataset is expected to be more than four times richer in perturbation diversity than Tahoe-100M, a 2025 release that has seen over 250,000 downloads and broad adoption for model training and benchmarking.
See also: Vevo Therapeutics Open-Sources Largest Single-Cell Dataset with Arc Institute

"With 50 cell lines in each Mosaic pool, we measure how various cell lines, with varying baseline genetics and gene expression, respond to various treatments. And using a cell village means we eliminate batch effects and allow extreme parallelization."
Image credit: Tahoe Therapeutics
All three organizations are active developers of virtual cell technologies—machine learning systems trained to simulate cellular behavior, drug response, and disease biology:
- Tahoe Therapeutics develops AI-based models of human cells using large-scale perturbative single-cell datasets;
- Arc Institute focuses on multi-omics and computational biology to investigate disease mechanisms;
- Biohub builds AI-integrated biological measurement platforms to study and reprogram cellular systems at scale.
This marks the first time major contributors in the field have pooled resources to create a foundational training corpus at this scale.
Data will be generated using Tahoe’s Mosaic high-throughput platform, and the full dataset will be accessible via the Arc Virtual Cell Atlas following internal access and integration. According to public statements, the project is backed by a multi-million dollar commitment from all three groups.
The initiative reflects broader institutional efforts to align large-scale biological experimentation with modern AI research infrastructure, aiming to accelerate development of generalizable in silico cell models for biomedical discovery.
In June 2025, Arc Institute ran the Virtual Cell Challenge, an open AI benchmarking competition focused on predicting cellular responses to genetic perturbations. The challenge, published in Cell, offered $175,000 in prizes for models that can generalize predictions from one set of cell types to held-out types, using high-depth single-cell RNA sequencing data. The competition was backed by NVIDIA, 10x Genomics, and Ultima Genomics.
For further context on the broader landscape of AI-powered virtual cell modeling—including foundation models, data infrastructure, and the evolution of in silico experimentation—see our deep dive, Building the Virtual Cell: AI Foundation Models and Billion-Cell Datasets.
Topic: AI in Bio