DNA language models reshape biology with AI foundation models

Craig Nash
By
Craig Nash
Tech writer at All Things Geek. Covers artificial intelligence, semiconductors, and computing hardware.
9 Min Read
DNA language models reshape biology with AI foundation models

DNA language models are AI tools trained on massive genomic datasets to identify and predict patterns in DNA sequences, treating genetic code as a universal programming language shared by every living organism on Earth. This shift from analyzing text to analyzing nucleotides is reshaping how scientists approach mutation prediction, genome design, and disease research. The field has accelerated dramatically in 2025, with multiple breakthrough models demonstrating that machines can now read, write, and think in the language of life itself.

Key Takeaways

  • Evo 2, trained on 9.3 trillion nucleotides across 128,000 whole genomes, predicts mutations and designs synthetic genomes
  • AlphaGenome analyzes 1 million DNA base-pairs at single-letter resolution, trained in 4 hours with half the compute of prior models
  • DNA language models are shifting from encoder-only architectures to encoder-decoder systems for sequence-to-sequence tasks
  • Foundation models still struggle with non-coding DNA prediction across all human cell types
  • Models like ENBED can identify enhancers, promoters, splice sites, and generate viral mutations with byte-level precision

The Largest DNA Model Ever Built

Evo 2 represents a watershed moment in generative biology. Developed by researchers at UC Berkeley, Arc Institute, UCSF, Stanford, and NVIDIA, Evo 2 is the largest AI model in biology, trained on over 9.3 trillion nucleotides from more than 128,000 whole genomes and metagenomic data across over 100,000 species. This scale dwarfs its predecessor, Evo 1, which was limited to single-cell genomes. By expanding to multicellular and whole-genome data, Evo 2 achieves what Patrick Hsu, UC Berkeley assistant professor of bioengineering and Arc Institute co-founder, describes as a generalist understanding of the tree of life. According to Hsu, the model enables machines to read, write, and think in the language of nucleotides, unlocking applications from predicting disease-causing mutations in humans to designing potential code for artificial life.

The practical capabilities are striking. Evo 2 can predict the effects of all genetic mutations, identify disease-causing variants in human genomes, design bacterial-length genomes from scratch, and uncover gene sequence patterns that span organisms across evolutionary distance. This isn’t theoretical—the model has already demonstrated these abilities across diverse biological domains, marking a genuine inflection point in how scientists approach genome analysis and engineering.

Competing Architectures and Specialized Models

While Evo 2 dominates in scale, other DNA language models are pursuing different architectural strategies. AlphaGenome, developed by DeepMind, takes a different approach by analyzing up to 1 million DNA base-pairs at single-letter resolution while predicting regulatory activity and variant effects. Remarkably, AlphaGenome was trained in just 4 hours using half the computational resources required by its predecessor, Enformer, demonstrating that larger context windows don’t necessarily demand exponential compute increases. The model is already available via API, with full release for fine-tuning on custom datasets pending.

ENBED (Ensemble Nucleotide Byte-level Encoder-Decoder) represents a different architectural innovation—an encoder-decoder Transformer foundation model using byte-level DNA precision. Unlike encoder-only models, ENBED’s sequence-to-sequence design enables tasks like identifying enhancers, promoters, and splice sites; detecting sequencing errors like base mismatches and insertions; annotating biological function; and even generating and validating Influenza virus mutations. GENERator, another emerging model with 1.2 billion parameters and 98,000 nucleotide context length, was trained on 386 billion base-pairs of eukaryotic DNA and specializes in generating protein-coding sequences and optimizing enhancers.

Where DNA Language Models Still Fall Short

Despite their power, DNA language models face real limitations. Foundation models in genomics struggle to predict non-coding DNA activity across all human cell types, requiring data-efficient on-device strategies to supplement their predictions. This gap matters because non-coding DNA—which doesn’t directly encode proteins—plays a crucial regulatory role in gene expression and cell differentiation. A single large model cannot capture the cell-type-specific behavior of regulatory elements, meaning researchers still need specialized, fine-tuned approaches for certain applications.

This limitation suggests that the future of genomic AI may not rest entirely on scaling foundation models. Instead, hybrid approaches that combine large-scale pre-trained models with smaller, specialized models trained on cell-type-specific data may prove more effective. The field is still early enough that architectural choices—encoder versus encoder-decoder, context window length, training dataset composition—remain open questions with real performance implications.

Implications for Plant Biology and Crop Engineering

The source article’s framing positions DNA language models as transformative for plant biology, yet the research evidence emphasizes applications across human genomics, disease research, and synthetic biology. Evo 2’s training on 100,000 species across the tree of life does provide a generalist understanding that could theoretically apply to plant genomes, and mutation prediction capabilities are directly relevant to crop improvement. However, the brief contains no specific evidence of plant-focused applications or breakthroughs. The potential is real—plants face pressing challenges from climate change and food security—but the current momentum in DNA language models centers on human disease, regulatory prediction, and synthetic organism design. Plant biology researchers will benefit from these tools, but the field will likely need specialized fine-tuning and plant-specific datasets to unlock their full potential for agriculture.

What’s Next for DNA Language Models

The convergence of scale, architectural diversity, and API availability suggests DNA language models are transitioning from research curiosity to practical tool. Evo 2’s 2025 release and AlphaGenome’s January 2026 Nature publication mark inflection points where these models are moving from papers to deployable systems. The diversity of approaches—from Evo 2’s massive scale to ENBED’s encoder-decoder design to AlphaGenome’s efficient context handling—indicates the field is still exploring the design space rather than converging on a single winning architecture.

The real test will be how these models perform on tasks outside their training distribution. Predicting mutations in well-studied human genes is one thing; designing novel plant defense mechanisms or engineering microbes for industrial applications is another. As these tools move from academic labs into the hands of biotech companies and agricultural researchers, their limitations will become clearer, and the need for domain-specific fine-tuning will likely intensify.

Can DNA language models predict disease mutations accurately?

Yes. Evo 2 specifically identifies disease-causing mutations in human genomes as one of its core capabilities, and AlphaGenome predicts variant effects across regulatory regions. However, accuracy varies by mutation type and genomic context—predicting effects in well-characterized genes is more reliable than predicting effects in non-coding regions or novel variants.

How do DNA language models differ from traditional bioinformatics tools?

Traditional tools rely on hand-crafted rules and sequence alignment; DNA language models learn patterns directly from massive genomic datasets without explicit programming. This allows them to generalize across species and discover non-obvious relationships in genetic code that rule-based systems would miss.

Are DNA language models available to researchers outside large institutions?

Partially. AlphaGenome is available via API access, and other models like ENBED have been published with code availability. However, training your own DNA language model requires significant computational resources, so most researchers will rely on fine-tuning pre-trained models rather than training from scratch.

DNA language models represent a genuine shift in how biology is practiced. By treating genetic code as a universal language that machines can learn, analyze, and generate, these tools are collapsing the distinction between reading genomes and writing them. The models built in 2025 and 2026 will likely seem primitive in a decade, but they mark the moment when artificial intelligence moved from analyzing human language to analyzing the language of life itself. For plant biologists, disease researchers, and synthetic biologists, the practical question is no longer whether these tools work—it’s how quickly they can integrate them into their workflows.

Edited by the All Things Geek team.

Source: TechRadar

Share This Article
Tech writer at All Things Geek. Covers artificial intelligence, semiconductors, and computing hardware.