Zymrael commited on
Commit
78c715a
1 Parent(s): 567369e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -9
README.md CHANGED
@@ -35,18 +35,21 @@ model = AutoModelForCausalLM.from_pretrained(
35
  Evo is a biological foundation model capable of long-context modeling and design.
36
 
37
  Evo uses the [StripedHyena architecture](https://github.com/togethercomputer/stripedhyena) to enable modeling of sequences at a single-nucleotide, byte-level resolution with near-linear scaling of compute and memory relative to context length.
38
- Evo has 7 billion parameters and is trained on OpenGenome, a prokaryotic whole-genome dataset containing ~300 billion tokens.
39
 
40
- Technical details about Evo can be found in our preprint and our accompanying blog posts. Evo was collaboratively developed by the [Arc Institute](https://arcinstitute.org/) and TogetherAI.
41
 
42
  As part of our commitment to open science, we release **weights of 15 intermediate pretraining checkpoints** for phase 1 and phase 2 of pretraining. The checkpoints are available as branches of the corresponding HuggingFace repository.
43
 
44
  **Evo-1 (Phase 2)** is our **longer context model** in the Evo family, trained at a context length of 131k and tested on generation of sequences of length >650k
45
 
 
46
  | Checkpoint Name | Description |
47
  |----------------------------------------|-------------|
48
  | `evo-1-8k-base` | A model pretrained with 8,192 context. We use this model as the base model for molecular-scale finetuning tasks. |
49
- | `evo-1-131k-base` | A model pretrained with 131,072 context using `evo-1-8k-base` as the initialization. We use this model to reason about and generate sequences at the genome scale. |
 
 
50
 
51
  ### Model Architecture
52
 
@@ -80,19 +83,21 @@ The main classes are:
80
  StripedHyena is a mixed precision model. Make sure to keep your `poles` and `residues` in `float32` precision, especially for longer prompts or training.
81
 
82
 
83
-
84
  ### Disclaimer
85
 
86
- To use StripedHyena outside of the playground, you will need to install custom kernels. Please follow the instructions from the [standalone repository](https://github.com/togethercomputer/stripedhyena).
87
 
88
  ## Cite
89
 
90
  ```
91
  @article{nguyen2024sequence,
92
- author = {Eric Nguyen and Michael Poli and Matthew G. Durrant and Armin W. Thomas and Brian Kang and Jeremy Sullivan and Madelena Y. Ng and Ashley Lewis and Aman Patel and Aaron Lou and Stefano Ermon and Stephen A. Baccus and Tina Hernandez-Boussard and Christopher Ré and Patrick D. Hsu and Brian L. Hie},
93
- journal = {Arc Institute manuscripts},
94
  title = {Sequence modeling and design from molecular to genome scale with Evo},
95
- url = {https://arcinstitute.org/manuscripts/Evo},
 
 
 
96
  year = {2024},
97
- }
 
98
  ```
 
35
  Evo is a biological foundation model capable of long-context modeling and design.
36
 
37
  Evo uses the [StripedHyena architecture](https://github.com/togethercomputer/stripedhyena) to enable modeling of sequences at a single-nucleotide, byte-level resolution with near-linear scaling of compute and memory relative to context length.
38
+ Evo has 7 billion parameters and is trained on [OpenGenome](https://huggingface.co/datasets/LongSafari/open-genome), a prokaryotic whole-genome dataset containing ~300 billion tokens.
39
 
40
+ We describe Evo in the paper [“Sequence modeling and design from molecular to genome scale with Evo”](https://www.science.org/doi/10.1126/science.ado9336).
41
 
42
  As part of our commitment to open science, we release **weights of 15 intermediate pretraining checkpoints** for phase 1 and phase 2 of pretraining. The checkpoints are available as branches of the corresponding HuggingFace repository.
43
 
44
  **Evo-1 (Phase 2)** is our **longer context model** in the Evo family, trained at a context length of 131k and tested on generation of sequences of length >650k
45
 
46
+ We provide the following model checkpoints:
47
  | Checkpoint Name | Description |
48
  |----------------------------------------|-------------|
49
  | `evo-1-8k-base` | A model pretrained with 8,192 context. We use this model as the base model for molecular-scale finetuning tasks. |
50
+ | `evo-1-131k-base` | A model pretrained with 131,072 context using `evo-1-8k-base` as the base model. We use this model to reason about and generate sequences at the genome scale. |
51
+ | `evo-1-8k-crispr` | A model finetuned using `evo-1-8k-base` as the base model to generate CRISPR-Cas systems. |
52
+ | `evo-1-8k-transposon` | A model finetuned using `evo-1-8k-base` as the base model to generate IS200/IS605 transposons. |
53
 
54
  ### Model Architecture
55
 
 
83
  StripedHyena is a mixed precision model. Make sure to keep your `poles` and `residues` in `float32` precision, especially for longer prompts or training.
84
 
85
 
 
86
  ### Disclaimer
87
 
88
+ To use StripedHyena, you will need to install custom kernels. Please follow the instructions from the [standalone repository](https://github.com/togethercomputer/stripedhyena).
89
 
90
  ## Cite
91
 
92
  ```
93
  @article{nguyen2024sequence,
94
+ author = {Eric Nguyen and Michael Poli and Matthew G. Durrant and Brian Kang and Dhruva Katrekar and David B. Li and Liam J. Bartie and Armin W. Thomas and Samuel H. King and Garyk Brixi and Jeremy Sullivan and Madelena Y. Ng and Ashley Lewis and Aaron Lou and Stefano Ermon and Stephen A. Baccus and Tina Hernandez-Boussard and Christopher Ré and Patrick D. Hsu and Brian L. Hie },
 
95
  title = {Sequence modeling and design from molecular to genome scale with Evo},
96
+ journal = {Science},
97
+ volume = {386},
98
+ number = {6723},
99
+ pages = {eado9336},
100
  year = {2024},
101
+ doi = {10.1126/science.ado9336},
102
+ URL = {https://www.science.org/doi/abs/10.1126/science.ado9336},
103
  ```