Text Generation
Transformers
GGUF
Inference Endpoints
conversational
aashish1904 commited on
Commit
5d665e2
1 Parent(s): 7ec153c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +171 -0
README.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ license: gemma
5
+ datasets:
6
+ - Mielikki/Erebus-87k
7
+ - allura-org/r_shortstories_24k
8
+ base_model:
9
+ - UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
10
+ pipeline_tag: text-generation
11
+ library_name: transformers
12
+
13
+ ---
14
+
15
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
16
+
17
+
18
+ # QuantFactory/G2-9B-Sugarquill-v0-GGUF
19
+ This is quantized version of [allura-org/G2-9B-Sugarquill-v0](https://huggingface.co/allura-org/G2-9B-Sugarquill-v0) created using llama.cpp
20
+
21
+ # Original Model Card
22
+
23
+
24
+ <img src="image_27.png" alt="A beautiful witch writing a book with a quill">
25
+ <sub>Image by CalamitousFelicitouness</sub>
26
+
27
+ ---
28
+
29
+ # Gemma-2-9B Sugarquill v0
30
+
31
+ An experimental continued pretrain of Gemma-2-9B-It-SPPO-Iter3 on assorted short story data from the web.
32
+ I was trying to diversify Gemma's prose, without completely destroying it's smarts. I think I half-succeeded? This model could have used another epoch of training, but even this is already more creative and descriptive than it's base model, w/o becoming too silly. Doesn't seem to have degraded much in terms of core abilities as well.
33
+ Should be usable both for RP and raw completion storywriting.
34
+ I originally planned to use this in a merge, but I feel like this model is interesting enough to be released on it's own as well.
35
+
36
+ Model was trained by Auri.
37
+
38
+ Dedicated to Cahvay, who wanted a Gemma finetune from me for months by now, and to La Rata, who loves storywriter models.
39
+
40
+ GGUFs by Prodeus: https://huggingface.co/allura-org/G2-9B-Sugarquill-v0-GGUF
41
+
42
+ **Training notes**
43
+
44
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 8xH100 SXM node for 30 minutes with rsLoRA.
45
+ I got complete nonsense reported to my wandb during this run, and logging stopped altogether after step 13 for some reason. Seems to be directly related to Gemma, as my training setup worked flawlessly for Qwen.
46
+ Thanks to Kearm for helping with setting up LF on that node and to Featherless for providing it for EVA-Qwen2.5 (and this model, unknowingly lol) training.
47
+
48
+ **Format**
49
+
50
+ Model responds to Gemma instruct formatting, exactly like it's base model.
51
+
52
+ ```
53
+ <bos><start_of_turn>user
54
+ {user message}<end_of_turn>
55
+ <start_of_turn>model
56
+ {response}<end_of_turn><eos>
57
+ ```
58
+
59
+ **Training config**
60
+ <details><summary>See LLaMA-Factory config</summary>
61
+
62
+ ```yaml
63
+ ### Model
64
+ model_name_or_path: UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
65
+ #ref_model: # Reference model for RL (optional, for everything besides SimPO, which doesn't take it at all)
66
+ #ref_model_quantization_bit: 8 # 8 or 4
67
+
68
+ ### Method
69
+ stage: pt # pt, sft, rm, ppo, kto, dpo (includes orpo and simpo)
70
+ do_train: true
71
+ finetuning_type: lora # full, freeze or lora
72
+ lora_target: all
73
+ #pref_beta: 0.1
74
+ #pref_loss: simpo # sigmoid (dpo), orpo, simpo, ipo, hinge
75
+
76
+ ### Reward model
77
+ #reward_model: RLHFlow/ArmoRM-Llama3-8B-v0.1 # or sfairXC/FsfairX-Gemma2-RM-v0.1 or nvidia/Llama-3.1-Nemotron-70B-Reward-HF
78
+ #reward_model_type: full # full, lora, api
79
+ #reward_model_adapters: # Path to RM LoRA adapter(s) if using a LoRA RM
80
+ #reward_model_quantization_bit: 8 # 4 or 8
81
+
82
+ ### Freeze
83
+ #freeze_trainable_layers: # The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive number means n last layers to train, negative - n first layers to train
84
+ #freeze_trainable_modules: # Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate
85
+ #freeze_extra_modules: # Name(s) of modules apart from hidden layers to be set as trainable. Use commas to separate
86
+
87
+ ### LoRA
88
+ #loraplus_lr_ratio: 8.0
89
+ #loraplus_lr_embedding:
90
+ use_dora: false
91
+ use_rslora: true
92
+ lora_rank: 64 # 64 is optimal for most trains on instruct, if training on base - use rslora or dora
93
+ lora_alpha: 32
94
+ lora_dropout: 0.05
95
+ #pissa_init: true
96
+ #pissa_iter: 16
97
+ #pissa_convert: true
98
+
99
+ ### QLoRA
100
+ quantization_bit: 8 # 2,3,4,5,6,8 in HQQ, 4 or 8 in bnb
101
+ quantization_method: hqq # bitsandbytes or hqq
102
+
103
+ ### DeepSpeed
104
+ deepspeed: examples/deepspeed/ds_z2_config.json # ds_z3_config.json or ds_z2_config.json which is required for HQQ on multigpu
105
+
106
+ ### Dataset
107
+ dataset: sugarquill-10k # define in data/dataset_info.json
108
+ cutoff_len: 8192
109
+ max_samples: 10000
110
+ overwrite_cache: true
111
+ preprocessing_num_workers: 16
112
+ #template: chatml
113
+
114
+ ### Output
115
+ output_dir: saves/gemma/lora/sugarquill-1
116
+ logging_steps: 3
117
+ save_steps: 50
118
+ plot_loss: true
119
+ compute_accuracy: true
120
+ overwrite_output_dir: true
121
+
122
+ ### Train
123
+ per_device_train_batch_size: 1 # Effective b/s == per-device b/s * grad accum steps * number of GPUs
124
+ gradient_accumulation_steps: 8
125
+ learning_rate: 3.0e-5
126
+ optim: paged_adamw_8bit # paged_adamw_8bit or adamw_torch usually
127
+ num_train_epochs: 2.0
128
+ lr_scheduler_type: cosine # cosine, constant or linear
129
+ warmup_ratio: 0.05
130
+ bf16: true
131
+ ddp_timeout: 180000000
132
+ packing: true
133
+ max_grad_norm: 1.0
134
+
135
+ ### Opts
136
+ flash_attn: fa2 # auto, disabled, sdpa, fa2 | Gemma will fallback to eager
137
+ enable_liger_kernel: true # Pretty much must have if it works
138
+ #use_unsloth: true # May not work with multigpu idk
139
+ #use_adam_mini: true # Comment optim if using this
140
+
141
+ ### Eval
142
+ val_size: 0.1
143
+ per_device_eval_batch_size: 1
144
+ eval_strategy: steps
145
+ eval_steps: 0.05
146
+
147
+ ### Misc
148
+ include_num_input_tokens_seen: true
149
+ ddp_find_unused_parameters: false # Stupid thing tries to start distributed training otherwise
150
+ upcast_layernorm: true
151
+
152
+ ### Inference for PPO
153
+ #max_new_tokens: 512
154
+ #temperature: 0.8
155
+ #top_k: 0
156
+ #top_p: 0.8
157
+
158
+ ### Tracking
159
+ report_to: wandb # or tensorboard or mlflow | LOGIN BEFORE STARTING TRAIN OR ELSE IT WILL CRASH
160
+ run_name: G2-9B-Sugarquill-1
161
+
162
+ ### Merge Adapter
163
+ #export_dir: models/G2-9B-Sugarquill
164
+ #export_size: 4
165
+ #export_device: gpu
166
+ #export_legacy_format: false
167
+
168
+ ```
169
+
170
+ </details>
171
+