BioPharmaTrend
News
All Topics
  • AI in Bio
  • Tech Giants
  • Next-Gen Tools
  • Biotech Ventures
  • Industry Movers
  • NeuroTech
  • Interviews
  • Business Intelligence
  • Case Studies
Intelligence
  • Business Intelligence
  • Case Studies
  • Lab
  • Membership
About
  • At a Glance
  • Our Team
  • Advisory Board
  • Citations and Press Coverage
  • Partner Events Calendar
  • Advertise with Us
  • Write for Us
Newsletter
Login/Join
  • AI in Bio
  • Tech Giants
  • Next-Gen Tools
  • Biotech Ventures
  • Industry Movers
  • NeuroTech

  News

Tahoe, Arc Institute, and Biohub Collaborate to Release Largest Open Dataset for Virtual Cell Modeling

by Anastasiia Rohozianska   •   Jan. 13, 2026

Disclaimer: All opinions expressed by Contributors are their own and do not represent those of their employers, or BiopharmaTrend.com.
Contributors are fully responsible for assuring they own any required copyright for any content they submit to BiopharmaTrend.com. This website and its owners shall not be liable for neither information and content submitted for publication by Contributors, nor its accuracy.

# AI in Bio   
Share:   Share in LinkedIn  Share in Bluesky  Share in Reddit  Share in Hacker News  Share in X  Share in Facebook  Send by email

Tahoe Therapeutics (formerly Vevo), Arc Institute, and CZI’s Biohub have announced a joint initiative to generate what is reportedly the largest and most perturbation-rich single-cell dataset available for virtual cell model development. The dataset will comprise over 120 million single-cell profiles and 225,000 drug–patient perturbation interactions and will be released open source as part of a shared open science commitment.

The collaboration merges Tahoe’s Mosaic platform with Arc’s scBaseCount and Biohub’s CELLxGENE datasets, extending previous efforts that have powered foundational AI models such as STATE, Tahoe-x1, and TranscriptFormer. The new dataset is expected to be more than four times richer in perturbation diversity than Tahoe-100M, a 2025 release that has seen over 250,000 downloads and broad adoption for model training and benchmarking.

See also: Vevo Therapeutics Open-Sources Largest Single-Cell Dataset with Arc Institute

"With 50 cell lines in each Mosaic pool, we measure how various cell lines, with varying baseline genetics and gene expression, respond to various treatments. And using a cell village means we eliminate batch effects and allow extreme parallelization."

Image credit: Tahoe Therapeutics

All three organizations are active developers of virtual cell technologies—machine learning systems trained to simulate cellular behavior, drug response, and disease biology:

  • Tahoe Therapeutics develops AI-based models of human cells using large-scale perturbative single-cell datasets; 
  • Arc Institute focuses on multi-omics and computational biology to investigate disease mechanisms; 
  • Biohub builds AI-integrated biological measurement platforms to study and reprogram cellular systems at scale.

This marks the first time major contributors in the field have pooled resources to create a foundational training corpus at this scale. 

Data will be generated using Tahoe’s Mosaic high-throughput platform, and the full dataset will be accessible via the Arc Virtual Cell Atlas following internal access and integration. According to public statements, the project is backed by a multi-million dollar commitment from all three groups.

The initiative reflects broader institutional efforts to align large-scale biological experimentation with modern AI research infrastructure, aiming to accelerate development of generalizable in silico cell models for biomedical discovery.

In June 2025, Arc Institute ran the Virtual Cell Challenge, an open AI benchmarking competition focused on predicting cellular responses to genetic perturbations. The challenge, published in Cell, offered $175,000 in prizes for models that can generalize predictions from one set of cell types to held-out types, using high-depth single-cell RNA sequencing data. The competition was backed by NVIDIA, 10x Genomics, and Ultima Genomics.

For further context on the broader landscape of AI-powered virtual cell modeling—including foundation models, data infrastructure, and the evolution of in silico experimentation—see our deep dive, Building the Virtual Cell: AI Foundation Models and Billion-Cell Datasets.

Topic: AI in Bio

Share:   Share in LinkedIn  Share in Bluesky  Share in Reddit  Share in Hacker News  Share in X  Share in Facebook  Send by email

You may also be interested to read:

Tahoe Therapeutics Raises $30 Million to Expand Single-Cell Data Production for AI Models
by Anastasiia Rohozianska

 

BiopharmaTrend.com

Where Tech Meets Bio
mail  Newsletter
in  LinkedIn
x  X
rss  RSS Feed

About


  • What we do
  • Citations and Press Coverage
  • Terms of Use
  • Privacy Policy
  • Disclaimer

We Offer


  • Newsletter
  • Business Intelligence
  • Interviews
  • Partner Events
  • Case Studies

Opportunities


  • Advertise
  • Lab Access
  • Lab Membership
  • Write for Us
  • Contact Us

© BPT Analytics LTD 2026
We use cookies to personalise content and to analyse our traffic. You consent to our cookies if you continue to use our website. Read more details in our cookies policy.