TCRT5 model (pre-trained)
Model description
This model is the pre-trained model used for finetuning TCRT5. The finetuned model is a seq2seq model designed for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that is built on the T5 architecture and operationalized by the associated HuggingFace abstraction. It is released along with this paper.
Intended uses & limitations
This model is released to be used for seq2seq finetuning on custom datasets. It may be useful for both the pMHC -> TCR (TCR design) or TCR -> pMHC (TCR de-orphanization) sequence generation. Additionally, it can also be used (though it has not been tested in this capacity) for finetuning on classification or regression-style tasks involving sequence representations of TCR (CDR3 ) and pMHC (peptide-pseudo sequence):
How to use
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_pre_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_pre_tcrdb")
pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]"
encoded_pmhc = tokenizer(pmhc, return_tensors='pt')
# Can be useful for classification/regression downstream tasks
enc_outputs = tcrt5.encoder(**encoded_pmhc)
Limitations
As it stands, the model was jointly pre-trained on peptide-pseudosequence and CDR3 sequences. As such sequences comprised of just peptide, CDR3 , or other parts of the TCR would be out-of-distribution OOD.
Training data
TCRT5 was pre-trained on masked span reconstruction of on a dataset built around ~14M CDR3 sequences from TCRdb as well as ~780k peptide-pseudosequence pairs taken from IEDB. To correct for the data imbalance, upsampling was used to bring the TCR:pMHC sequence ratio to 70:30.
Training procedure
Preprocessing
All amino acid sequences, and V/J gene names were standardized using the tidytcells
package. See here. MHC
allele information was standardized using mhcgnomes
, available here before mapping allele information to the MHC pseudo-sequence
as defined in NetMHCpan.
Pre-training
TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens. This forces the model to learn richer k-mer dependencies of the masked sequences.
Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm:
* Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence.
* Ensure that the spans are not directly adjacent to ensure max_span_length is observed
* Once the span masks are generated according to T5 standards mask the inputs and generate the targets
Example Input:
CASSLGQGYEQYF
Masked Input:
CASSLG[X]GY[Y]F
Target:
[X]Q[Y]EQY[Z].
Hyperparameters:
Hparam | #Enc. | #Dec. | Vocab. Size | D_model | Num Attn. Heads | Dropout | D_ff |
---|---|---|---|---|---|---|---|
10 | 10 | 128 | 256 | 16 | 0.1 | 1024 |
TA | Bsz. | LR | Steps | Weight Decay | Warmup |
---|---|---|---|---|---|
512 | 3e- 4 | 168k (~4eps) | 0.1 | 500 |
Hardware
- Hardware Type: NVIDIA A100 80GB PCIe
- Hours used: 60
- Carbon Emitted: 6.48 kg CO2 eq.
Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
BibTeX entry and citation info
@article{dkarthikeyan2024tcrtranslate,
title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences},
author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn},
journal={bioArXiv},
year={2024},
}
- Downloads last month
- 12