70B finetunes?

#2
by xxx777xxxASD - opened

Would you make 70B finetunes? Looks like everyone these days are mostly focused on small 8~34B models or 123B ones D:

Right now mostly using 70B Nautilus, Sunfall and OG Nemotron, actually, wanted to merge a few 70Bs in order to achieve something new but had no disk space and online service i tried still hasn't finished merging in now 20 days... Anyway, big fan of your work, thanks for the effort and your amazing models!

Hey thanks! I know I've overlooked 70B for a while now, but 123B has just been fantastic and I'd argue is worth the extra size. That said, I'm hoping Mistral could release a 70B Medium. I'm not a big fan of Qwen & Llama. Let me see what I can do right now.

I'd like to use 123B as well if not 36GB VRAM limitation, basically the only way i can run 123B is by using IQ2_XXS quants with 16k 4bit context, welp, hoping bitnet would be a thing one day :(

Anyway, i know about how much everyone hates L3.1 models for RP finetunes due to how overcooked they are. That's not a recommendation at all, im a zero IQ dummy at finetuning, but your Tunguska model reminded me about pruned to 42B parameters L3 70B and it made me think about something

We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning

Would it make situation better if such layers wouldn't be removed but just nulled? Would that make L3.1 70B easier to finetune?

@xxx777xxxASD What about a 100B?

Pal, how fast are you cooking? You're awesome! :D

1.gif

Anyway, running 100B for me has to be much easier, probably should be able to run IQ2_M quant, that's almost 3bpw. 90B IQ3_XXS might be cool as well, but i doubt it'll be stronger than one quant lower 100B, maybe. Actually, are you planning further fine-tuning these models in order to "heal" them? Should post-fine-tuned IQ2_M 100B feel better than Q3_K_S 70B Nemotron?

I'm starting with 100B since it behaved like Largestral with barely any errors.

I'll revisit 90B, but my first attempt with it was riddled with errors on every first gen.

72B felt like it needed massive healing effort and I'll probably write it off as impossible.


Btw, thanks for the inspo.

Yeah, i feel like is a 72B great number but not if it was pruned from 123B which is almost 2 times bigger, however everything's possible, thanks for your work :)

Sign up or log in to comment