Massive AI Model Trained on 9.3 Trillion Nucleotides Now Publicly Available

Researchers and scientists worldwide can now access Evo 2, a massive new foundation model that deciphers the genetic code across all domains of life. Unveiled as the largest publicly available AI model for genomic data, Evo 2 was developed with NVIDIA DGX Cloud platform on Amazon Web Services (AWS) through a collaboration led by the nonprofit biomedical research organization Arc Institute and Stanford University. The powerful model offers analysis of DNA, RNA, and proteins across a wide range of species.

Trained on an unprecedented dataset of over 9 trillion nucleotides—the fundamental building blocks of DNA and RNA—Evo 2 is designed to drive biomolecular research, from predicting protein structures to identifying novel molecules. It also provides insights into how gene mutations impact biological function, accelerating advances in precision medicine and synthetic biology.

The Evo 2 code is now publicly accessible from Arc’s GitHub. It is also integrated into the NVIDIA BioNeMo framework, as part of a collaboration between Arc Institute and NVIDIA to accelerate computational biomedical research.

A Giant Leap for Biotechnological Research

In a groundbreaking leap for biotechnology, researchers at the Arc Institute, in collaboration with NVIDIA and top academic institutions, have unveiled Evo 2—the largest publicly available artificial intelligence model for genomic data to date. This revolutionary AI, trained on an unprecedented scale of over 9.3 trillion nucleotides from 128,000 genomes, can decode and design genetic sequences across all three domains of life.

Pushing the Boundaries of AI-Powered Biology

Built on machine learning innovations, Evo 2 is trained on the DNA of more than 100,000 species, allowing it to detect intricate genetic patterns that would take human researchers years to uncover. With capabilities that extend to predicting disease-causing mutations and designing entirely new genomes, this AI-driven model is set to transform our understanding of life itself.

The development team—comprised of scientists from Arc Institute and NVIDIA, alongside researchers from Stanford University, UC Berkeley, and UC San Francisco—will release details about Evo 2 in a preprint on February 19, 2025. A user-friendly platform, Evo Designer, will accompany the publication, enabling researchers worldwide to interact with the model.

The fully open-source Evo 2 is available via Arc’s GitHub and seamlessly integrates with the NVIDIA BioNeMo framework, including as an NVIDIA NIM microservice for easy, secure AI deployment.

To further enhance interpretability, the team collaborated with Goodfire, an AI research lab, to develop a mechanistic visualizer that exposes the key biological insights the model identifies in genomic sequences. By providing access to training data, inference code, and model weights, the Arc Institute is setting a new benchmark for open-source biological AI.

From Evo 1 to Evo 2: Scaling Up the Tree of Life

Expanding upon its predecessor, Evo 1, which focused solely on single-cell organisms, Evo 2 marks the largest AI model trained in the biological sciences. With an extensive dataset spanning bacteria, archaea, phages, plants, and multi-cellular eukaryotic species—including humans—this AI is poised to accelerate breakthroughs in medicine, genetics, and synthetic biology.

Thinking in the Language of Nucleotides

Patrick Hsu, Arc Institute Co-Founder and Core Investigator, and Assistant Professor of Bioengineering at the University of California, Berkeley, stated:

“Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology, as the models have enabled machines to read, write, and think in the language of nucleotides. Evo 2 has a generalist understanding of the tree of life that’s useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life. We’re excited to see what the research community builds on top of these foundation models.”

Brian Hie, the preprint’s other co-senior, an Assistant Professor of Chemical Engineering at Stanford University, Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, and Arc Institute Innovation Investigator, said:

“Just as the world has left its imprint on the language of the Internet used to train large language models, evolution has left its imprint on biological sequences. These patterns, refined over millions of years, contain signals about how molecules work and interact.”

Training AI at an Unprecedented Scale

Running on over 2,000 NVIDIA H100 GPUs via the NVIDIA DGX Cloud AI platform on AWS, Evo 2 required months of intensive training. It processes genetic sequences up to one million nucleotides at a time—allowing it to identify long-range relationships within genomes. The research team leveraged an advanced AI architecture, StripedHyena 2, co-developed by OpenAI’s Greg Brockman, to train Evo 2 with 30 times more data than its predecessor.

Disease Prediction & Genetic Engineering

In tests on variants of the breast cancer-linked BRCA1 gene, Evo 2 demonstrated over 90% accuracy in predicting which mutations are benign or pathogenic—saving valuable time and resources needed for experimental validation. This capability is expected to accelerate drug development, genetic therapies, and personalized medicine.

Beyond disease prediction, Evo 2 opens doors for drug discovery applications. Hani Goodarzi, an Arc Core Investigator and an Associate Professor of Biochemistry and Biophysics at the University of California, San Francisco, stated:

“If you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells. This precise control could help develop more targeted treatments with fewer side effects.”

A Platform for Future Innovation

Dave Burke, Chief Technology Officer at Arc Institute, said:

“In a loose way, you can think of the model almost like an operating system kernel—you can have all of these different applications that are built on top of it. From predicting how single DNA mutations affect a protein’s function to designing genetic elements that behave differently in different cell types, as we continue to refine the model and researchers begin using it in creative ways, we expect to see beneficial uses for Evo 2 we haven’t even imagined yet.”

Ethical Considerations & Responsible AI Development

To address potential safety concerns, the team ensured that Evo 2 excludes human and complex pathogen data from its training set. Additionally, the model is designed to withhold responses related to pathogenic sequences. Tina Hernandez-Boussard, Professor at Stanford University, and her lab played a key role in implementing safeguards for responsible AI deployment.

Arc Institute to Accelerate Discoveries Like Never Before

Arc Institute, established in 2021 with $650 million in funding from its founding donors, supports long-term scientific challenges by providing researchers with multiyear funding and cutting-edge resources. Core investigators receive state-of-the-art lab space and renewable eight-year funding terms, allowing them to focus on groundbreaking research rather than continuous grant applications. Arc Institute researchers are pursuing advancements in cancer, immune dysfunction, and neurodegeneration.

By combining this unique research environment with accelerated computing from NVIDIA, Arc scientists can analyze larger datasets and accelerate discoveries in ways never before possible.

An Open-access Framework

Anthony Costa, director of digital biology at NVIDIA said:

“Evo 2 has fundamentally advanced our understanding of biological systems. By overcoming previous limitations in the scale of biological foundation models with a unique architecture and the largest integrated dataset of its kind, Evo 2 generalizes across more known biology than any other model to date — and by releasing these capabilities broadly, the Arc Institute has given scientists around the world a new partner in solving humanity’s most pressing health and disease challenges.”

Please visit BioRxiv to access the related preprint “Genome modeling and design across all domains of life with Evo 2” and the sister machine learning paper “Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale.”


Recommended Companies

    • blank 26 1 768x473

    BIOSAXS

    • imusyn transformed 768x597

    imusyn


Ad


More Headlines