monsoon-nlp commited on
Commit
3bb7c4d
1 Parent(s): 8327a15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -15
README.md CHANGED
@@ -1,35 +1,48 @@
1
  ---
2
  license: llama3
3
  library_name: peft
 
 
4
  tags:
5
  - trl
6
  - sft
7
  - unsloth
8
  - generated_from_trainer
 
9
  base_model: gradientai/Llama-3-8B-Instruct-262k
10
  model-index:
11
  - name: llama3-biotokenpretrain-kaniwa
12
  results: []
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
-
18
  # llama3-biotokenpretrain-kaniwa
19
 
20
- This model is a fine-tuned version of [gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k) on the None dataset.
21
 
22
- ## Model description
 
 
 
 
 
 
 
 
 
 
23
 
24
- More information needed
 
 
25
 
26
- ## Intended uses & limitations
 
27
 
28
- More information needed
 
 
29
 
30
- ## Training and evaluation data
31
 
32
- More information needed
33
 
34
  ## Training procedure
35
 
@@ -47,14 +60,15 @@ The following hyperparameters were used during training:
47
  - lr_scheduler_warmup_steps: 5
48
  - training_steps: 280
49
 
50
- ### Training results
51
-
52
-
53
-
54
  ### Framework versions
55
 
56
  - PEFT 0.10.0
57
  - Transformers 4.40.2
58
  - Pytorch 2.2.1+cu121
59
  - Datasets 2.19.1
60
- - Tokenizers 0.19.1
 
 
 
 
 
 
1
  ---
2
  license: llama3
3
  library_name: peft
4
+ language:
5
+ - en
6
  tags:
7
  - trl
8
  - sft
9
  - unsloth
10
  - generated_from_trainer
11
+ - dna
12
  base_model: gradientai/Llama-3-8B-Instruct-262k
13
  model-index:
14
  - name: llama3-biotokenpretrain-kaniwa
15
  results: []
16
  ---
17
 
 
 
 
18
  # llama3-biotokenpretrain-kaniwa
19
 
 
20
 
21
+ This is a LoRA adapter.
22
+
23
+ The base model is the longer-context LLaMA-3-8b-Instruct developed by Gradient and Crusoe: `gradientai/Llama-3-8B-Instruct-262k`
24
+
25
+ The tokenizer has added "biotokens" ∎A, ∎C, ∎G, and ∎T.
26
+
27
+ The dataset was 0.5% of BYU's 2019 kaniwa (*Chenopodium pallidicaule*) genome, from https://genomevolution.org/coge/GenomeInfo.pl?gid=53872
28
+
29
+ The adapter was finetuned for 3 hours on an L4 GPU. The data was split into ~7k nucleotide snippets with an Alpaca like message format.
30
+
31
+ Training Notebook: https://colab.research.google.com/drive/1FKA3p_jnfRHYd-hqJdYmKn8MQpxec0t5?usp=sharing
32
 
33
+ Sample message:
34
+ ```
35
+ Write information about the nucleotide sequence.
36
 
37
+ ### Sequence:
38
+ ∎G∎C∎C∎T∎A∎T∎A∎G∎T∎G∎T∎G∎T∎A∎G...
39
 
40
+ ### Annotation:
41
+ Information about location in the kaniwa chromosome: >lcl|Cp5
42
+ ```
43
 
44
+ This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
45
 
 
46
 
47
  ## Training procedure
48
 
 
60
  - lr_scheduler_warmup_steps: 5
61
  - training_steps: 280
62
 
 
 
 
 
63
  ### Framework versions
64
 
65
  - PEFT 0.10.0
66
  - Transformers 4.40.2
67
  - Pytorch 2.2.1+cu121
68
  - Datasets 2.19.1
69
+ - Tokenizers 0.19.1
70
+
71
+
72
+ ### Genome Citation
73
+
74
+ Mangelson H, et al. The genome of *Chenopodium pallidicaule*: an emerging Andean super grain. Appl. Plant Sci. 2019;7:e11300. doi: 10.1002/aps3.11300