zhihan1996
/

DNABERT-2-117M

Inference Endpoints

Model card Files Files and versions Community

zhihan1996 commited on Jul 6, 2023

Commit

5fd206e

·

1 Parent(s): 6041066

Create README.md

Files changed (1) hide show

README.md +33 -0

README.md ADDED Viewed

	@@ -0,0 +1,33 @@

+---
+metrics:
+- matthews_correlation
+- f1
+tags:
+- biology
+- medical
+---
+DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
+To load the model from huggingface:
+```
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
+model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
+```
+To calculate the embedding of a dna sequence
+```
+dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
+inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
+hidden_states = model(inputs)[0] # [1, sequence_length, 768]
+# embedding with mean pooling
+embedding_mean = torch.mean(hidden_states[0], dim=0)
+print(embedding_mean.shape) # expect to be 768
+# embedding with max pooling
+embedding_max = torch.max(hidden_states[0], dim=0)[0]
+print(embedding_max.shape) # expect to be 768
+```