RaphaelMourad commited on
Commit
790be30
1 Parent(s): f76cf36

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -3
README.md CHANGED
@@ -1,3 +1,63 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - pretrained
5
+ - mistral
6
+ - DNA
7
+ - virus (both rna and dna)
8
+ - biology
9
+ - genomics
10
+ ---
11
+
12
+ # Model Card for Mistral-DNA-v1-138M-virus (mistral for DNA)
13
+
14
+ The Mistral-DNA-v1-138M-virus Large Language Model (LLM) is a pretrained generative DNA text model with 17.31M parameters x 8 experts = 138.5M parameters.
15
+ It is derived from Mistral-7B-v0.1 model, which was simplified for DNA: the number of layers and the hidden size were reduced.
16
+ The model was pretrained using around 15071 viruses > 1kb.
17
+
18
+ Virus genome database was downloaded from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&VirusLineage_ss=taxid:10239&SourceDB_s=RefSeq.
19
+ NB: the DNA sequence was used, not the RNA sequence.
20
+
21
+ For full details of this model please read our [github repo](https://github.com/raphaelmourad/Mistral-DNA).
22
+
23
+ ## Model Architecture
24
+
25
+ Like Mistral-7B-v0.1, it is a transformer model, with the following architecture choices:
26
+ - Grouped-Query Attention
27
+ - Sliding-Window Attention
28
+ - Byte-fallback BPE tokenizer
29
+
30
+ ## Load the model from huggingface:
31
+
32
+ ```
33
+ import torch
34
+ from transformers import AutoTokenizer, AutoModel
35
+
36
+ tokenizer = AutoTokenizer.from_pretrained("RaphaelMourad/Mistral-DNA-v1-138M-virus", trust_remote_code=True) # Same as DNABERT2
37
+ model = AutoModel.from_pretrained("RaphaelMourad/Mistral-DNA-v1-138M-virus", trust_remote_code=True)
38
+ ```
39
+
40
+ ## Calculate the embedding of a DNA sequence
41
+
42
+ ```
43
+ dna = "TGATGATTGGCGCGGCTAGGATCGGCT"
44
+ inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
45
+ hidden_states = model(inputs)[0] # [1, sequence_length, 256]
46
+
47
+ # embedding with max pooling
48
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
49
+ print(embedding_max.shape) # expect to be 256
50
+ ```
51
+
52
+ ## Troubleshooting
53
+
54
+ Ensure you are utilizing a stable version of Transformers, 4.34.0 or newer.
55
+
56
+ ## Notice
57
+
58
+ Mistral-DNA-v1-138M-virus is a pretrained base model for virus genomes.
59
+
60
+ ## Contact
61
+
62
+ Raphaël Mourad. raphael.mourad@univ-tlse3.fr
63
+