Into the Unknown — How Artificial Intelligence Can Help Biotech Companies Chart the Dark Genome

With DeepMind’s release of AlphaGenome—a deep learning model that can predict the relevance of variants in both coding and non-coding DNA—the so-called “dark genome” has taken center stage in biotech news. AlphaGenome, an “all-in-one” genome exploration model, is a unified deep learning model that can predict multimodal properties and variant effects from up to 1 million base pairs of DNA sequence. In the preprint first published on June 25, 2025, the authors showed that AlphaGenome outperforms current state-of-the-art models in the majority of tasks—and holds great potential to improve the interpretation of non-coding variants.

But what exactly lies within these dark, or non-coding, regions that make up 98% of our 3.1 billion–letter genome?

For some years, a number of biotech companies and their pharma partners have been exploring the parts of the genome that go beyond the roughly 20,000 human genes mapped by the Human Genome Project in the early 2000s. Their quest to uncover new disease mechanisms and drug targets within this vast non-coding space has been enabled by both advanced sequencing technologies, which make the dark genome accessible, and artificial intelligence algorithms, which help assign function to these previously uncharted regions.

Driving discovery: How technology helps transform disease understanding

Genetic variation is linked to disease in both subtle and more direct ways. Especially in the rare disease field, 80% of the over 6,000 rare diseases can likely be attributed to genetic causes. Estimates suggest that around 90% of disease-relevant genetic variation can be found in the dark, or non-coding, areas of the genome.

Traditional next-generation sequencing (NGS), which relies on complex algorithms to reconstruct sequences based on short reads (50–300 base pairs), works well for unique regions but struggles with repetitive, ambiguous regions typical of the dark genome. In contrast, long-read sequencing generates much longer reads—often several kilobases or more—providing greater continuity and enabling accurate mapping of challenging, previously inaccessible genomic regions.

Research initiatives such as the telomere-to-telomere (T2T) sequencing project, completed in 2022, have implemented long-read sequencing to assess previously uncharted regions, achieving near-complete sequencing of all chromosomes. With the increased accuracy provided by the latest product releases, long-read sequencing is moving closer to becoming part of routine genetic testing. The inclusion of long-read data in analyses of disease relevance might greatly expand the detection of disease-relevant genetic variations, especially in the rare disease area.

From dark matter to drug targets: How biotechs explore the dark genome

The exploration of the dark genome has the potential to unveil a treasure trove of therapeutic opportunities across various disease areas, which has attracted a number of biotech companies and pharma partnerships to the space. Pathogenic non-coding variants are often linked to disturbances in gene regulatory elements such as promoters, enhancers, silencers, and a variety of regulatory RNA species.

Swiss HAYA Therapeutics and Boston-based NextRNA Therapeutics have set their eyes on long non-coding (lnc) RNAs as potential therapeutic targets in cardiovascular disease and oncology, and big pharma is taking notice. HAYA, which received $65 million in Series A funding in May 2025, has a collaboration with Eli Lilly worth up to $1 billion to discover lncRNA-based drug targets for obesity and other metabolic diseases. NextRNA has entered a collaboration with Bayer for the discovery and development of small molecules targeting lncRNAs in cancer, worth up to $547 million. As initial clinical trials of drugs targeting lncRNAs begin, we will soon learn more about the validity of this new target class.

Another feature of the genome that has sparked the interest of multiple biotechs are transposable elements (TEs), which make up almost half of our genome. TEs are current or former mobile genetic elements—DNA sequences that can change their position, or “jump”, within the genome. Activation of virus-like transposable elements, such as long interspersed nuclear elements (LINEs) and human endogenous retroviruses (HERVs), has been implicated in neurodegenerative disorders like Alzheimer’s disease and amyotrophic lateral sclerosis (ALS).

US companies Transposon Therapeutics and Rome Therapeutics, are targeting the biology of LINE-1, a prominent transposon which by itself accounts for 17% of the genome. Transposon’s TPN-101—originally conceived as an antiretroviral drug for HIV-1—inhibits LINE-1 reverse transcriptase (RT). TPN-101 showed initially encouraging results in a Phase 2 trial for progressive supranuclear palsy and was fast-tracked by the FDA. Rome Therapeutics focuses on the potential inflammatory role of LINE-1. Rome is testing LINE-1 reverse transcriptase (RT) inhibition as a non-immunosuppressive therapeutic strategy in autoimmune diseases such as type I interferonopathies (e.g., systemic lupus erythematosus (SLE), cutaneous lupus erythematosus (CLE)) and further aims to explore LINE-1 biology in cancer and neurodegenerative diseases.

Danish biotech HERVolution aims to harness the therapeutic potential of human endogenous retroviruses (HERVs) for cancer and aging-related neurodegenerative diseases. HERVolution employs rationally designed HERV antigens to overcome the immune system's “self-tolerance” and induce potent and durable anti-HERV immune responses to tackle cancer, metabolic, and other age-related diseases. However, Swiss biotech’s GeNeuro Phase 2 study in HERV-W ENV patients suffering from post-COVID-19 neuropsychiatric syndromes, showed no meaningful improvement after HERV-targeting monoclonal antibody temelimab compared to placebo.

The dark genome also offers opportunities for the development of cancer immunotherapies. Evaxion targets HERVs as neoantigens for cancer immunotherapy, leveraging its AI‑Immunology™ platform to identify ERVs reactivated in cancer cells and to design precision cancer vaccines. Similarly, Enara Bio uses its proprietary EDAPT® platform to discover novel “Dark Antigens”—peptide–HLA targets derived from non-coding genomic regions in tumors—to develop TCR-directed immunotherapies and therapeutic vaccines.

Biotech	Therapeutic Area Focus	Dark Genome Element Targeted	Recent funding events	Pharma Partnerships / Collaborations
Enara Bio Oxford, UK	Cancer immunotherapy (TCR-based)	“Dark antigens” from non-coding/transcribed regions	Series B $32.5M (Oct2024), backed by Pfizer and Merck KGaA’s	Boehringer Ingelheim; GWU collaboration
ROME Therapeutics Boston, MA, USA	Autoimmune, cancer, neurodegeneration	Repetitive non-coding elements (LINEs, SINEs, HERVs)	Series B extension $72 M in (Sep 2023), backed by J&J and BMS	-
Transposon Therapeutics San Diego, CA, USA	Neurodegenerative diseases, aging	LINE‑1 reverse transcriptase in transposons/ repeat elements	$4.68M	-
HERVolution Therapeutics Copenhagen, Denmark	Cancer, metabolic & aging therapeutics	HERV (human endogenous retrovirus) antigens	Series A €11.7M (Dec2024) Backed by Serum Institute India, European Innovation Council (EIC) Fund	-
Evaxion Hørsholm, Denmark	Cancer & infectious disease vaccines	Endogenous retroviruses (ERVs), repetitive elements	Public offering $10.8M (Jan2025)	MSD/Merck: EVX‑B2/B3 options, Clin trial collaboration EVX‑01 with Keytruda; Afrigen/WHO; Gates Foundation
GENeuro Geneva, Switzerland	MS, ALS, long‑COVID (autoimmune/neurology)	HERV‑derived Envelope proteins	Debt-restructuring moratorium, May 2025	NINDS/NIH collaboration
Nucleome Therapeutics Oxford, UK	Autoimmune, rare, precision medicine	3D non-coding genomic regulatory variants	Series A £37.5M (~$47M) (Oct2022) Backed by Pfizer, Merck KGaA and J&J,	Strategic partnership with J&J
Lucid Genomics Berlin, Germany	Rare disease diagnostics	Non-coding regulatory DNA variants	Pre-seed €1.3 M (Sep 2022)
NextRNA Therapeutics Boston, MA, USA	Multi-indication via lncRNA therapies	Long non-coding RNAs (lncRNAs)	Series A $46.8M (March 2022)	collaboration with Bayer (up to $547 M)
HAYA Therapeutics Lausanne, Switzerland	Heart failure, fibrosis, obesity	Long non-coding RNAs (lncRNAs)	Series A $65M (May2025) Backed by Soffinova Partners, Eli Lilly	Eli Lilly partnership (up to $1B)
Amaroq Therapeutics Auckland, New Zealand	Cancer	Long non‑coding RNAs (lncRNAs) in tumors	Seed $14M (Oct 2023)	-
Flamingo Therapeutics Leuven, Belgium & San Diego, USA	Oncology	lncRNAs (e.g. MALAT1)	€1.7M grant (2023)	Alliance with Ionis Pharma

Illuminating the dark genome – How AI / ML can aid interpretation of non-coding DNA

While our understanding of the dark genome and its links to disease is growing, the interpretability of genetic variation within those unmapped genetic territories remains challenging—due to the sheer amount of dark DNA and the often more subtle outcomes of variation in regulatory regions compared to protein-coding regions.

Various AI models for genome exploration have been developed, including, most recently, DeepMind’s AlphaGenome, which allows the prediction of multimodal outcomes of single-nucleotide variation within the non-coding parts of the genome. A number of biotech companies, like Nucleome, LUCID genomics, HAYA and Evaxion, also employ Artificial Intelligence and Machine Learning (AI/ML) algorithms to better explore and exploit genetic elements within the dark genome.

For example, Nucleome’s platform is focused on interpreting SNPs in a cell-type-specific manner, aiming to discover new targets and biomarkers with an initial focus on autoimmune diseases, which Nucleome will further explore through a strategic partnership with J&J, which started in October 2024.

Lucid Genomics’ TAD annotation tool (TADA) algorithm allows prioritization of SNPs, as well as structural variants. Structural variants (typically ≥1 kb in size), such as deletions, duplications, or translocations, are prevalent in the dark genome and have been shown to be linked to many disease-relevant contexts. Lucid’s algorithm allows annotation of structural variation in the context of its functional environment by using the boundaries of so-called topologically associating domains (TADs), which make up the long-range regulatory architecture of genes. Combining this approach with a disease-specific expert decision system, the company—which recently raised €1.3 million in pre-seed funding—aims to explore full genomes in a quest to improve clinical genomics for better diagnosis and to aid drug discovery and development in rare disease and beyond.

In an interesting 2024 study, investigators at Johns Hopkins University used a machine learning technique they termed Artemis (Analysis of RepeaT EleMents in dISease) to decipher the role of repetitive DNA elements in cancer. In a large-scale effort, the team analyzed tumor DNA and cell-free plasma samples from 1,975 cancer patients, discovering 820 previously unknown repeat elements altered in human cancer. Notably, repeat elements were enriched fifteenfold on average within 736 genes known to drive cancers—holding promise for better understanding the complex genomes of cancer cells and for finding new therapeutic avenues.

Further exploration of the dark genome may help improve diagnostic rates in rare disease and contribute to target discovery and the development of new therapeutic modalities. Personalized diagnoses also hold promise for individualized gene-editing–based approaches, as recently demonstrated for the first time in an infant with a rare, previously incurable disease at Children’s Hospital of Philadelphia.

References

Topic: AI in Bio