lucyknada commited on
Commit
55bfee7
·
verified ·
1 Parent(s): af6eb18

Upload ./README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +197 -0
README.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Mielikki/Erebus-87k
5
+ - allura-org/r_shortstories_24k
6
+ language:
7
+ - en
8
+ base_model:
9
+ - arcee-ai/SuperNova-Medius
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ ---
13
+ ### exl2 quant (measurement.json in main branch)
14
+ ---
15
+ ### check revisions for quants
16
+ ---
17
+
18
+
19
+ <img src="card_img.png">
20
+ <small>Image by CalamitousFelicitousness</small>
21
+
22
+ ---
23
+
24
+ # Qwen2.5-14B Sugarquill v1
25
+
26
+ A continued pretrain of SuperNova-Medius on assorted short story data from the web. Supernova already had a nice prose, but diversifying it a bit definitely doesn't hurt.
27
+ Also, finally a storywriter model with enough context for something more than a short story, that's also nice.
28
+
29
+ It's a fair bit more temperamental than Gemma, but can be tamed with some sampling.
30
+ Instruction following also stayed rather strong, so it works for both RP and storywriting, both in chat mode via back-and-forth co-writing and on raw completion.
31
+
32
+ Overall, I'd say it successfully transfers the essence of what I liked about Gemma Sugarquill. I will also make a Qwen version of Aletheia, but with a brand new LoRA, based on a brand new RP dataset that's in the making right now.
33
+
34
+
35
+ Model was trained by Auri.
36
+
37
+ ---
38
+
39
+ **Training notes**
40
+
41
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. I've also normalized punctuation to ASCII on the train split, so mismatched quote marks should not be an issue anymore. Also normalized whitespaces, so double spaces after period should be gone as well.
42
+
43
+ It was trained on 5x3090Ti workstation for 7.5 hours with rsLoRA. I switched back to Axolotl for this run, as LF just plain refused to run at all on this workstation. Also, it's a bf16 LoRA this time. Overall training went much smoother than last time. I've attempted to train Qwen Sugarquill several times before, but loss jumped like crazy. Effective batch size of 40, rsLoRA and paged_ademamix_8bit optimizer seemingly completely solved this issue.
44
+
45
+ Thanks to Kearm for providing compute for this training run!
46
+
47
+ **Format**
48
+
49
+ Model responds to ChatML instruct formatting, exactly like it's base model.
50
+
51
+ ```
52
+ <|im_start|>system
53
+ {system message}<|im_end|>
54
+ <|im_start|>user
55
+ {user message}<|im_end|>
56
+ <|im_start|>assistant
57
+ {response}<|im_end|>
58
+ ```
59
+
60
+ **Recommended Samplers**
61
+
62
+ I found this configuration to be quite stable:
63
+
64
+ ```
65
+ Temperature - 0.8
66
+ Min-P - 0.05
67
+ Top-A - 0.3
68
+ Repetition Penalty - 1.03
69
+ ```
70
+
71
+ Feel free to toy around with samplers after you get a feel for it. It seems to like Top-A and Smooth Sampling quite a bit.
72
+
73
+ **Training config**
74
+ <details><summary>See Axolotl config</summary>
75
+
76
+ axolotl version: `0.4.1`
77
+ ```yaml
78
+ # Model
79
+ base_model: arcee-ai/SuperNova-Medius
80
+ strict: false
81
+
82
+ # Liger Kernels (optimization)
83
+ plugins:
84
+ - axolotl.integrations.liger.LigerPlugin
85
+ liger_rope: true
86
+ liger_rms_norm: true
87
+ liger_swiglu: true
88
+ liger_fused_linear_cross_entropy: true
89
+
90
+ # Output and HuggingFace
91
+ output_dir: /home/kearm/axolotl/TQ-2.5-14B-Sugarquill
92
+ hub_model_id: allura-org/TQ-2.5-14B-Sugarquill-LoRA
93
+ hf_use_auth_token: true
94
+ hub_strategy: "all_checkpoints"
95
+
96
+ # WandB
97
+ wandb_project: huggingface
98
+ wandb_entity:
99
+ wandb_name: TQ-2.5-14B-Sugarquill-1
100
+
101
+ # Data
102
+ #chat_template: chatml
103
+ #train_on_inputs: false
104
+ group_by_length: false
105
+ datasets:
106
+ - path: allura-org/sugarquill-10k
107
+ type: completion
108
+
109
+ ## Evaluation
110
+ val_set_size: 0.01
111
+ evals_per_epoch: 4
112
+ eval_table_size:
113
+ eval_max_new_tokens: 128
114
+
115
+ # Technical aspects
116
+ sequence_len: 8192
117
+ save_safetensors: true
118
+ saves_per_epoch: 2
119
+ logging_steps: 1
120
+ special_tokens:
121
+
122
+ # Quantization
123
+ bf16: auto
124
+ fp16:
125
+ tf32: false
126
+ ## For LoRA
127
+ load_in_8bit: false
128
+ load_in_4bit: false
129
+
130
+ # LoRA
131
+ peft_use_rslora: true
132
+ peft_use_dora: false # better but slower
133
+ adapter: lora # lora or qlora
134
+ lora_model_dir:
135
+ lora_r: 64 # 64 is optimal for most trains on instruct
136
+ lora_alpha: 32
137
+ lora_dropout: 0.1
138
+ lora_target_linear: true
139
+ lora_fan_in_fan_out:
140
+ lora_target_modules:
141
+ # - embed_tokens
142
+ # - lm_head
143
+
144
+ #loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
145
+ #loraplus_lr_embedding:
146
+
147
+ # Training hyperparameters
148
+ # max_steps:
149
+ num_epochs: 2
150
+
151
+ # Anti Overfit and Stability
152
+ weight_decay: 0.01
153
+ max_grad_norm: 1.0
154
+
155
+ ## Learning Rate
156
+ warmup_ratio: 0.05
157
+ learning_rate: 0.00003
158
+ lr_scheduler: cosine
159
+ #lr_scheduler_kwargs:
160
+ # min_lr: 0.0000024
161
+ optimizer: paged_ademamix_8bit # usually adamw_torch or paged_adamw_8bit
162
+
163
+ ## Batch Size
164
+ gradient_accumulation_steps: 8 # More effective batch size - stabler train, usually. MBS also speeds it up.
165
+ micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
166
+ eval_batch_size: 1
167
+
168
+ # Optimizations
169
+ pad_to_sequence_len: true
170
+ sample_packing: true
171
+ eval_sample_packing: false
172
+ flash_attention: true
173
+ xformers_attention:
174
+ gradient_checkpointing: "unsloth"
175
+ gradient_checkpointing_kwargs:
176
+ use_reentrant: true
177
+ local_rank:
178
+ deepspeed: /home/kearm/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all
179
+ # fsdp:
180
+ # - full_shard
181
+ # - auto_wrap
182
+ # fsdp_config:
183
+ # fsdp_limit_all_gathers: true
184
+ # fsdp_sync_module_states: true
185
+ # fsdp_offload_params: true
186
+ # fsdp_use_orig_params: false
187
+ # fsdp_cpu_ram_efficient_loading: true
188
+ # fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
189
+ # fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
190
+ # fsdp_state_dict_type: FULL_STATE_DICT
191
+ # fsdp_sharding_strategy: FULL_SHARD
192
+ # Misc
193
+ early_stopping_patience:
194
+ debug:
195
+ ```
196
+
197
+ </details>