Triangle104 commited on
Commit
8ac7fcc
1 Parent(s): 2511c2d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -0
README.md CHANGED
@@ -17,6 +17,173 @@ tags:
17
  This model was converted to GGUF format from [`allura-org/TQ2.5-14B-Sugarquill-v1`](https://huggingface.co/allura-org/TQ2.5-14B-Sugarquill-v1) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
18
  Refer to the [original model card](https://huggingface.co/allura-org/TQ2.5-14B-Sugarquill-v1) for more details on the model.
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ## Use with llama.cpp
21
  Install llama.cpp through brew (works on Mac and Linux)
22
 
 
17
  This model was converted to GGUF format from [`allura-org/TQ2.5-14B-Sugarquill-v1`](https://huggingface.co/allura-org/TQ2.5-14B-Sugarquill-v1) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
18
  Refer to the [original model card](https://huggingface.co/allura-org/TQ2.5-14B-Sugarquill-v1) for more details on the model.
19
 
20
+ ---
21
+ Model details:
22
+ -
23
+ Qwen2.5-14B Sugarquill v1
24
+
25
+ A continued pretrain of SuperNova-Medius on assorted short story data from the web. Supernova already had a nice prose, but diversifying it a bit definitely doesn't hurt. Also, finally a storywriter model with enough context for something more than a short story, that's also nice.
26
+
27
+ It's a fair bit more temperamental than Gemma, but can be tamed with some sampling. Instruction following also stayed rather strong, so it works for both RP and storywriting, both in chat mode via back-and-forth co-writing and on raw completion.
28
+
29
+ Overall, I'd say it successfully transfers the essence of what I liked about Gemma Sugarquill. I will also make a Qwen version of Aletheia, but with a brand new LoRA, based on a brand new RP dataset that's in the making right now.
30
+
31
+ Model was trained by Auri.
32
+
33
+ Training notes
34
+
35
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. I've also normalized punctuation to ASCII on the train split, so mismatched quote marks should not be an issue anymore. Also normalized whitespaces, so double spaces after period should be gone as well.
36
+
37
+ It was trained on 5x3090Ti workstation for 7.5 hours with rsLoRA. I switched back to Axolotl for this run, as LF just plain refused to run at all on this workstation. Also, it's a bf16 LoRA this time. Overall training went much smoother than last time. I've attempted to train Qwen Sugarquill several times before, but loss jumped like crazy. Effective batch size of 40, rsLoRA and paged_ademamix_8bit optimizer seemingly completely solved this issue.
38
+
39
+ Thanks to Kearm for providing compute for this training run!
40
+
41
+ Format
42
+
43
+ Model responds to ChatML instruct formatting, exactly like it's base model.
44
+
45
+ <|im_start|>system
46
+ {system message}<|im_end|>
47
+ <|im_start|>user
48
+ {user message}<|im_end|>
49
+ <|im_start|>assistant
50
+ {response}<|im_end|>
51
+
52
+ Recommended Samplers
53
+
54
+ I found this configuration to be quite stable:
55
+
56
+ Temperature - 0.8
57
+ Min-P - 0.05
58
+ Top-A - 0.3
59
+ Repetition Penalty - 1.03
60
+
61
+ Feel free to toy around with samplers after you get a feel for it. It seems to like Top-A and Smooth Sampling quite a bit.
62
+
63
+ Training config
64
+ See Axolotl config
65
+
66
+ axolotl version: 0.4.1
67
+
68
+ # Model
69
+ base_model: arcee-ai/SuperNova-Medius
70
+ strict: false
71
+
72
+ # Liger Kernels (optimization)
73
+ plugins:
74
+ - axolotl.integrations.liger.LigerPlugin
75
+ liger_rope: true
76
+ liger_rms_norm: true
77
+ liger_swiglu: true
78
+ liger_fused_linear_cross_entropy: true
79
+
80
+ # Output and HuggingFace
81
+ output_dir: /home/kearm/axolotl/TQ-2.5-14B-Sugarquill
82
+ hub_model_id: allura-org/TQ-2.5-14B-Sugarquill-LoRA
83
+ hf_use_auth_token: true
84
+ hub_strategy: "all_checkpoints"
85
+
86
+ # WandB
87
+ wandb_project: huggingface
88
+ wandb_entity:
89
+ wandb_name: TQ-2.5-14B-Sugarquill-1
90
+
91
+ # Data
92
+ #chat_template: chatml
93
+ #train_on_inputs: false
94
+ group_by_length: false
95
+ datasets:
96
+ - path: allura-org/sugarquill-10k
97
+ type: completion
98
+
99
+ ## Evaluation
100
+ val_set_size: 0.01
101
+ evals_per_epoch: 4
102
+ eval_table_size:
103
+ eval_max_new_tokens: 128
104
+
105
+ # Technical aspects
106
+ sequence_len: 8192
107
+ save_safetensors: true
108
+ saves_per_epoch: 2
109
+ logging_steps: 1
110
+ special_tokens:
111
+
112
+ # Quantization
113
+ bf16: auto
114
+ fp16:
115
+ tf32: false
116
+ ## For LoRA
117
+ load_in_8bit: false
118
+ load_in_4bit: false
119
+
120
+ # LoRA
121
+ peft_use_rslora: true
122
+ peft_use_dora: false # better but slower
123
+ adapter: lora # lora or qlora
124
+ lora_model_dir:
125
+ lora_r: 64 # 64 is optimal for most trains on instruct
126
+ lora_alpha: 32
127
+ lora_dropout: 0.1
128
+ lora_target_linear: true
129
+ lora_fan_in_fan_out:
130
+ lora_target_modules:
131
+ # - embed_tokens
132
+ # - lm_head
133
+
134
+ #loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
135
+ #loraplus_lr_embedding:
136
+
137
+ # Training hyperparameters
138
+ # max_steps:
139
+ num_epochs: 2
140
+
141
+ # Anti Overfit and Stability
142
+ weight_decay: 0.01
143
+ max_grad_norm: 1.0
144
+
145
+ ## Learning Rate
146
+ warmup_ratio: 0.05
147
+ learning_rate: 0.00003
148
+ lr_scheduler: cosine
149
+ #lr_scheduler_kwargs:
150
+ # min_lr: 0.0000024
151
+ optimizer: paged_ademamix_8bit # usually adamw_torch or paged_adamw_8bit
152
+
153
+ ## Batch Size
154
+ gradient_accumulation_steps: 8 # More effective batch size - stabler train, usually. MBS also speeds it up.
155
+ micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
156
+ eval_batch_size: 1
157
+
158
+ # Optimizations
159
+ pad_to_sequence_len: true
160
+ sample_packing: true
161
+ eval_sample_packing: false
162
+ flash_attention: true
163
+ xformers_attention:
164
+ gradient_checkpointing: "unsloth"
165
+ gradient_checkpointing_kwargs:
166
+ use_reentrant: true
167
+ local_rank:
168
+ deepspeed: /home/kearm/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all
169
+ # fsdp:
170
+ # - full_shard
171
+ # - auto_wrap
172
+ # fsdp_config:
173
+ # fsdp_limit_all_gathers: true
174
+ # fsdp_sync_module_states: true
175
+ # fsdp_offload_params: true
176
+ # fsdp_use_orig_params: false
177
+ # fsdp_cpu_ram_efficient_loading: true
178
+ # fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
179
+ # fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
180
+ # fsdp_state_dict_type: FULL_STATE_DICT
181
+ # fsdp_sharding_strategy: FULL_SHARD
182
+ # Misc
183
+ early_stopping_patience:
184
+ debug:
185
+
186
+ ---
187
  ## Use with llama.cpp
188
  Install llama.cpp through brew (works on Mac and Linux)
189