What other base models are you considering, if any?

#1
by Downtown-Case - opened

As the title says.

There are some you should clearly avoid (like Qwen), but I think there are good candidates too, like ByteDance 36B (fantastic at long context compared to llama) or GLM 4.5 Air (much easier to CPU offload), that aren't so terrible with world knowledge/storytelling prose like many releases.

Latitude org

Rest assured, we're always investigating new base models! The main challenge is usually getting the actual finetune process to work.

Non-Llama/Mistral architectures tend to be less well-supported in that regard.

ByteDance is basically Llama!

It's so close I might be able to 'llamafy' it myself actually, aka make it LlamaforCausalLM architecture, like has been done for older models. I'll try. But it is probably too new to have support in Axoltl or whatever y'all use.

Latitude org

Excellent catch - Suppose I'll know what to test soon!

It does seem to be llama, or at least token-identical in inference. I've uploaded a 'llamafied' version here (with the two base models under that org as well), so it should train the exact same way you trained llama 70B.

https://huggingface.co/llamafy/ByteDance-Seed_Seed-OSS-36B-Base-llamafied

I think it'd be good for a 'long context' version of your series. That's certainly a strength of this model.

I've also had success with an "merge" version of 50% instruct 50% base. It seems to write well, more like a novel style 'base model', and benches midway between the instruct and base.

Latitude org

Oh, well done there! This will definitely go on my "to investigate with priority" list, thank you. :-)

Just as a late FYI - Nometron Super 1.5 is a really good base. TheDrummer did a fantastic work on it with their latest Valkyrie (49b-v2)
To my sense this fine-tune beats most of the Llama 3.3 70b fine-tunes out there (Plus it can think... in case you need more in your role-plays).

Maybe a base worth investigating?

(Disclaimer: I haven't tested this Nova fine-tune yet - just pitching in for the base)

@illioren

For STEM stuff I love Nemotron to death, but there's thoughts that Nemotron 49B is really overtuned and bad for creative writing, according to logprob tests:

https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0/discussions/14#6817e02a1ad5806e65445c5c

My own tests with Valkyrie V1 are… Well, it’s not dumb, but it still feels “deep fried” to me, and it collapses quickly at longer contexts.

I dunno about Valkyrie V2, but Nemotron 1.5 is basically 1.0 with tons of thinking training, so I’ve never picked it up to test. I don’t really prefer long thinking blocks for these big dense models (unless it’s for STEM type questions).


For Drummer style “SillyTavern” RP in 49B class weight, I’d recommend IceBlink way more. It’s clever, fast and creative: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

But honestly GLM 4.5 base (or GLM 4.5 Air base) with raw completion formatting are my all-time favorites.

It makes sense that Latitude would be hesitant to train these though, as MoEs are tricky. To quote IceBlink's page:

MoE's are brutal to train even with a small dataset like mine, so I took a different approach from usual. I used a very low LR in an effort to avoid having to apply DPO / KTO training afterwards.

I think there's likely a better config to be found, but experimentation with the model to find it is quite draining.

its not a base model, but it would be cool if someone made a fairly big model (from 24B to 32B) entirely focused on realistic conversations, not roleplay nor kindness, only a story, a personality, text and the weight of your actions, a model in which characters are capable of insulting, situations feels real, something like a wound is not just a you dies but there is pain and sufferent perfectly transmitted trough text... you know what i mean?

Sign up or log in to comment