Update README.md
Browse files
README.md
CHANGED
@@ -22,18 +22,19 @@ The training parameters are there not to ruin it - not make it better, so you do
|
|
22 |
|
23 |
**Some more notes:**
|
24 |
|
25 |
-
13b can go only THAT far. There is no way you can create 100% solid finetuning on 13b. You will get close - but like with a child, sometimes it will spill a cup of milk in your lap. 33b is the way. Sadly training 33b on home hardware with 24GB is basically useless because you really have to tone down the parameters - to what I said before - basically ruining it. 48GB at least for 33b so you can crank it up.
|
26 |
|
27 |
-
IMHO gradient accumulation will LOWER the quality if you can do more than a few batches. There may be sweet spot somewehere, but IDK. Sure batch 1 and GA 32 will be better than batch 1 and GA 1, but that's not the point, that's a bandaid
|
28 |
Edit: It could prevent overfitting though and hence help with generalization. It depends what is the goal and how diverse the dataset is.
|
29 |
|
30 |
-
size of dataset matters when you are finetuning on base, but matters less when finetuning on well finetuned model. - in fact sometimes less is better in that case or you may be ruining a good previous finetuning.
|
31 |
|
32 |
-
alpha = 2x rank seems like something that came from the old times when people had potato VRAM at most. I really don't feel like it makes much sense - it multiplies the weights and that's it. (check the PEFT code) Making things louder, makes also noise louder.
|
33 |
|
34 |
-
my favorite scheduler is warmup, hold for 1 epoch then cosine down for the next 1- x epochs.
|
35 |
|
36 |
-
rank is literally how many trainable parameters you get - you don't have to try to find some other meaning (style vs knowledge). It's like an image taken with 1Mpixel vs 16Mpixel. You always get the whole image, but on 1Mpixel the details are very mushy.
|
|
|
37 |
|
38 |
**Anything else?**
|
39 |
|
|
|
22 |
|
23 |
**Some more notes:**
|
24 |
|
25 |
+
- 13b can go only THAT far. There is no way you can create 100% solid finetuning on 13b. You will get close - but like with a child, sometimes it will spill a cup of milk in your lap. 33b is the way. Sadly training 33b on home hardware with 24GB is basically useless because you really have to tone down the parameters - to what I said before - basically ruining it. 48GB at least for 33b so you can crank it up.
|
26 |
|
27 |
+
- IMHO gradient accumulation will LOWER the quality if you can do more than a few batches. There may be sweet spot somewehere, but IDK. Sure batch 1 and GA 32 will be better than batch 1 and GA 1, but that's not the point, that's a bandaid
|
28 |
Edit: It could prevent overfitting though and hence help with generalization. It depends what is the goal and how diverse the dataset is.
|
29 |
|
30 |
+
- size of dataset matters when you are finetuning on base, but matters less when finetuning on well finetuned model. - in fact sometimes less is better in that case or you may be ruining a good previous finetuning.
|
31 |
|
32 |
+
- alpha = 2x rank seems like something that came from the old times when people had potato VRAM at most. I really don't feel like it makes much sense - it multiplies the weights and that's it. (check the PEFT code) Making things louder, makes also noise louder.
|
33 |
|
34 |
+
- my favorite scheduler is warmup, hold for 1 epoch then cosine down for the next 1- x epochs.
|
35 |
|
36 |
+
- rank is literally how many trainable parameters you get - you don't have to try to find some other meaning (style vs knowledge). It's like an image taken with 1Mpixel vs 16Mpixel. You always get the whole image, but on 1Mpixel the details are very mushy.
|
37 |
+
the problem of course is - do you have enough diverse training data to fill those parameters with? If not, you'd be creating very specific model that would have hard time to generalize. Lowring rank will help with generalizations, but also the mundane details will be lost.
|
38 |
|
39 |
**Anything else?**
|
40 |
|