--- license: cc-by-nc-sa-4.0 widget: - text: ATGCTTTGTGCTGGCATGCCATGTCATGTTGCATCAGCATTTTCTTTATATTTTCTTTCTGATCTTTTCTGTGCTTCAAAACCTCATTCGTCTGTTTCCTTCTTTCCTACCAGTTATCCACAGACACACCCTATTAGAGTACTCCATGCTTGTTTATTTCTTTTGTCAAATAGAAGGGTCTTTTCTCCTCGCTTTAGTAGGGAATGTTGTCTTCCTCATTTGGGAAAAAAAAATTGTTCCTGCAGTTATGCCAGTCATGGGCTCTTTTTGATTGGTTGCATTGATATATTGTCTACCCCGTTTTCTGTAGGAATGATACATATTCCTGATCCTGAGCCTATTTGA tags: - DNA - biology - genomics --- # Plant foundation DNA large language models The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. **Developed by:** zhangtaolab ### Model Sources - **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs) - **Manuscript:** [Versatile applications of foundation DNA large language models in plant genomes]() ### Architecture The model is trained based on the Google Gemma model with modified tokenizer specific for DNA sequence. This model is fine-tuned for predicting lncRNAs. ### How to use Install the runtime library first: ```bash pip install transformers ``` Here is a simple code for inference: ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline model_name = 'plant-dnagemma-6mer-lncRNAs' # load model and tokenizer model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) # inference sequences = ['ATGGAACTCATGAAGACGTTAGATCTTCACAAGAGGATATTTTCCGAATTTAGTGATGAACAATCAAGAGTGTCATACACTGCAAAAATCTATCAAGAACAAATAAAAGCGGCAAAAGGGAGGTTGCCTGATAGTAGTGTAAAGCAATTAGGTGTCTGGCAACTTCATGTTTTCCTCAAAAGATGTGAAAAAGCACCCAACCAGGACAATACGACATCAGGAATTCTGTAA', 'ATGGCTGATGAAGCTCAGGAGAAGGCTGAACATGATCGCATTTTCAAGCGCTTCGACTTGAACGGAGACGGCAGGATCTCTGCCGCAGAGCTAGGTGACTGCTTGAAGACCCTTGGTTCAGTCACCCCGGATGAGATCCAGCGTATGATGGCAGAGATTGATACTGATGGTGATGGATACATATCATATGAAGAATTCACAGATTTTGCCATGGCCAACCGTGGCCTAATGAAGGATGTGGCCAAGATATTCTAA'] pipe = pipeline('text-classification', model=model, tokenizer=tokenizer, trust_remote_code=True, top_k=None) results = pipe(sequences) print(results) ``` ### Training data We use GemmaForSequenceClassification to fine-tune the model. Detailed training procedure can be found in our manuscript. #### Hardware Model was trained on a NVIDIA GTX1080Ti GPU (11 GB).