monsoon-nlp
/

llama3-biotokenpretrain-kaniwa

Generated from Trainer

Model card Files Files and versions Metrics Training metrics Community

monsoon-nlp commited on May 12, 2024

Commit

3bb7c4d

·

verified ·

1 Parent(s): 8327a15

Update README.md

Files changed (1) hide show

README.md +29 -15

README.md CHANGED Viewed

@@ -1,35 +1,48 @@
 ---
 license: llama3
 library_name: peft
 tags:
 - trl
 - sft
 - unsloth
 - generated_from_trainer
 base_model: gradientai/Llama-3-8B-Instruct-262k
 model-index:
 - name: llama3-biotokenpretrain-kaniwa
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # llama3-biotokenpretrain-kaniwa
-This model is a fine-tuned version of [gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k) on the None dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
@@ -47,14 +60,15 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_steps: 5
 - training_steps: 280
-### Training results
 ### Framework versions
 - PEFT 0.10.0
 - Transformers 4.40.2
 - Pytorch 2.2.1+cu121
 - Datasets 2.19.1
-- Tokenizers 0.19.1

 ---
 license: llama3
 library_name: peft
+language:
+- en
 tags:
 - trl
 - sft
 - unsloth
 - generated_from_trainer
+- dna
 base_model: gradientai/Llama-3-8B-Instruct-262k
 model-index:
 - name: llama3-biotokenpretrain-kaniwa
   results: []
 ---
 # llama3-biotokenpretrain-kaniwa
+This is a LoRA adapter.
+The base model is the longer-context LLaMA-3-8b-Instruct developed by Gradient and Crusoe: `gradientai/Llama-3-8B-Instruct-262k`
+The tokenizer has added "biotokens" ∎A, ∎C, ∎G, and ∎T.
+The dataset was 0.5% of BYU's 2019 kaniwa (*Chenopodium pallidicaule*) genome, from https://genomevolution.org/coge/GenomeInfo.pl?gid=53872
+The adapter was finetuned for 3 hours on an L4 GPU. The data was split into ~7k nucleotide snippets with an Alpaca like message format.
+Training Notebook: https://colab.research.google.com/drive/1FKA3p_jnfRHYd-hqJdYmKn8MQpxec0t5?usp=sharing
+Sample message:
+```
+Write information about the nucleotide sequence.
+### Sequence:
+∎G∎C∎C∎T∎A∎T∎A∎G∎T∎G∎T∎G∎T∎A∎G...
+### Annotation:
+Information about location in the kaniwa chromosome: >lcl|Cp5
+```
+This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 ## Training procedure
 - lr_scheduler_warmup_steps: 5
 - training_steps: 280
 ### Framework versions
 - PEFT 0.10.0
 - Transformers 4.40.2
 - Pytorch 2.2.1+cu121
 - Datasets 2.19.1
+- Tokenizers 0.19.1
+### Genome Citation
+Mangelson H, et al. The genome of *Chenopodium pallidicaule*: an emerging Andean super grain. Appl. Plant Sci. 2019;7:e11300. doi: 10.1002/aps3.11300