Tiny models
Hey man realy in love with your stuff but i gotta say guys like me are GPU poor.. i have an idea that could make allot of happy mediums across the board
models like the Wolvereens...are cool... even the dulling ones.. but A3b models.. just seem to work better...
but theres as cross road for context limits and they get hit alot with model sizes biger then 16B and more them ctx of 10K or more
so swing and pitch ... what if models like this Qwen3-4b had coder variants...as wll as a min ( A"N"B ) activated amounts.. like MOEs normally do... BUT instead of 4 models of coders.. combine different variations like...this modle does planning realy well, and this model dose coding REALY well....
the loading a 16B models inst hard for me i run 64gb ram .. but all on CPU
coders and guys like me need TINY models thats think FAST or not at all "reasoning is serious a drag to watch"
so something like ((Qwen2.5-coder-3x7B-A3B-MAX coder-imatrix-gguf types with X20 brainstorm or something))
small plenty of headroom and tones of talent in the coding field .. but i looked you got ALOT of variation .. but none tuned to the task .. why n just kill the wights not responsible for TALKING thinking and coding...alot of the other stuff in the models are just useless benchmark shit that serves no purpose...
would love to here your thoughts and possibly collab cause well im working in AGI and this area of training and coding LLMS isnt realy my forte
thank you in advance man ... appreciate it
I hear you; at the moment making "MOEs" of smaller Qwens is not possible due to Mergekit updates (awaiting them...)
However these may be of interest to you- Coders from 0.8B to 12B in 3 collections:
You may also want to see this (as Jan V1 is a coder too, and a very good one):
https://huggingface.co/DavidAU/Qwen3-Jan-v1-256k-ctx-6B-Brainstorm20x
....what about narrowing the weights with removing non active nodes to get just the weight used for coding and English and dev...
then brain storm the hell out of it.. and a light post-train???
run a lora type pass, with next to 0 for the updates just to see whats gets activated and not.. weight by weight for a data set of python and dev work
invert and remove???
easily just be a small 3b model instruct version right???
am i barking up the wring tree here???
thanks again dude
It is a lot easier to train / make domain specific , then remove (reverse training?) as noted.
There might be a way to do this -; might want to goggle it.
It could create a real mess / damage the model.