File size: 3,346 Bytes
d933223
 
c8a2f3d
 
 
 
 
 
 
 
d933223
bd6228a
4dd9188
c8a2f3d
 
 
 
49bac6d
 
 
 
 
 
 
c8a2f3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
faecb10
c8a2f3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
faecb10
 
c8a2f3d
4dd9188
c8a2f3d
 
4dd9188
c8a2f3d
 
4dd9188
c8a2f3d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---

license: cc-by-nc-sa-4.0
tags:
- Helical
- RNA
- Biology
- Transformers
- Genomics
- Sequence
library_name: transformers
---

 # Helix-mRNA-v0

Helix-mRNA emerges as a hybrid state-space and transformer based model, leveraging both the efficient sequence processing capabilities of Mamba2's state-space architecture and the contextual understanding of transformer attention mechanisms, allowing for the best of both worlds between these two approaches. These traits make it particularly suitable for studying full-length transcripts, splice variants, and complex mRNA structural elements.

We tokenize mRNA sequences at single-nucleotide resolution by mapping each nucleotide (A, C, U, G) and ambiguous base (N) to a unique integer. A further special character E is incorporated into the sequence, denoting the start of each codon. This fine-grained approach maximizes the model's ability to extract patterns from the sequences. Unlike coarser tokenization methods that might group nucleotides together or use k-mer based approaches, our single-nucleotide resolution preserves the full sequential information of the mRNA molecule. This simple yet effective encoding scheme ensures that no information is lost during the preprocessing stage, allowing the downstream model to learn directly from the raw sequence composition.

<p align="center">
<img src="assets/results_graph.png" alt="bar_charts" width="750"/>
<figcaption align="center">Helix-mRNA benchmark comparison against Transformer HELM, Transformer XE and CodonBERT.</figcaption>
</p>

Read more about it in our [blog post](https://www.helical-ai.com/blog/helix-mrna-v0)!

# Helical<a name="helical"></a>

#### Install the package

Run the following to install the [Helical](https://github.com/helicalAI/helical) package via pip:
```console

pip install --upgrade helical

```

#### Generate Embeddings
```python

from helical import HelixmRNA, HelixmRNAConfig

import torch



device = "cuda" if torch.cuda.is_available() else "cpu"



input_sequences = ["EACU"*20, "EAUG"*20, "EAUG"*20, "EACU"*20, "EAUU"*20]



helix_mrna_config = HelixmRNAConfig(batch_size=5, device=device, max_length=100)

helix_mrna = HelixmRNA(configurer=helix_mrna_config)



# prepare data for input to the model

processed_input_data = helix_mrna.process_data(input_sequences)



# generate the embeddings for the processed data

embeddings = helix_mrna.get_embeddings(processed_input_data)

```

#### Fine-Tuning
Classification fine-tuning example:
```python

from helical import HelixmRNAFineTuningModel, HelixmRNAConfig

import torch



device = "cuda" if torch.cuda.is_available() else "cpu"



input_sequences = ["EACU"*20, "EAUG"*20, "EAUG"*20, "EACU"*20, "EAUU"*20]

labels = [0, 2, 2, 0, 1]



helixr_config = HelixmRNAConfig(batch_size=5, device=device, max_length=100)

helixr_fine_tune = HelixmRNAFineTuningModel(helix_mrna_config=helixr_config, fine_tuning_head="classification", output_size=3)



# prepare data for input to the model

train_dataset = helixr_fine_tune.process_data(input_sequences)



# fine-tune the model with the relevant training labels

helixr_fine_tune.train(train_dataset=train_dataset, train_labels=labels)



# get outputs from the fine-tuned model on a processed dataset

outputs = helixr_fine_tune.get_outputs(train_dataset)

```