Wur doomed!

#14
by jukofyork - opened

Continuation of THE THREAD OF DOOM.

jukofyork pinned discussion

What do you and the others think of the distilled R1 models for writing?

The llama3 / qwen models SFT'd on R1 outputs? I only tried 2 of them.

R1 Qwen (32b) - Lacks knowledge of fiction (same as the official Qwen release), so it's writing is no better.

R1 Llama3 - This is generally the worst of them (not just for writing). It'll generate the CoT and then write something completely different.

CoT traces won't let the model do anything out of distribution, so not very useful if the base model doesn't have a lot in it's training data.

Yeah, I have tried the same two and felt the same way.

I also felt that any attempt to add an R1 distill to the merge recipe of an existing merge project made it worse...so far...

@gghfez @BigHuggyD that has been my experience as well, which is a shame as I had a go of R1 on Openrouter and I was blown away.

What model is anywhere close that is usable on a 24gb vram machine with 32gb of ram in your experience?

There's nothing like it for now. I'm running R1 slowly on my ThreadRipper:

prompt eval time =   14026.61 ms /   918 tokens (   15.28 ms per token,    65.45 tokens per second)
       eval time =  398806.12 ms /  1807 tokens (  220.70 ms per token,     4.53 tokens per second)
      total time =  412832.73 ms /  2725 tokens

I tried training Wizard2 8x22b MoE on R1 data, but it doesn't really work well. It will plan ahead in think tags eg:

I need to ensure the story maintains its gritty, realistic tone without becoming overly melodramatic. The characters' growth should be subtle but significant. Also, the ending should leave a sense of hope but not be too neat—their redemption is fragile, and the future is uncertain.

Let me outline the next few chapters:

Chapter 5: Nightmares and Trust
...

But it doesn't backtrack like R1 does. Just kind of agrees with it's self and ends up writing how it usually would:

“I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead.

lol

Ahhh thats a shame :-(

"I don’t know what I want anymore,” she admitted, voice barely above a whisper as rain tapped against corrugated roofing overhead."

Oh god!

I'll have to keep an eye on this thread.

I did enjoy Ppoyaa/MythoNemo-L3.1-70B-v1.0

But my tastes are probably not as refined as others on this thread ;-)

@jukofyork It's better if you don't mention safety at all. It just does what you tell it.

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># System Preamble

It seems really good at picking up on what you give it as a "seed",

Yes that's one of the things I like about this model. It pretty much just does what you tell it to do.
I don't actually thing it has a lot of baked in censorship, it looks like it's been trained to do what it's told, and they rely on putting the safety instructions in the tokenizer.chat_template

but if you just straight up prompt asking for "a story" it'll be Elara's all round.

Yeah, a lot more slop and now "Claude-isms" vs the R series.

@Downtown-Case

EDIT: this is for command R, but perhaps A inherited that too?

Did you ever find out what tool he used to produce that?

It’s just a bit brainwashed.

Literally?

https://github.com/turboderp-org/exllamav3/issues/34#issuecomment-2854186639

EDIT: this is for command R, but perhaps A inherited that too?

I think it could be due to the older models being stored in float16. My guess is they were actually trained using float32 and then they tried to scale all the weights and norms to make it fit in float16:

{
  "_name_or_path": "/home/ahmet_cohere_com/HF_Final_weight_tie",
  "logit_scale": 0.0625,
  "torch_dtype": "float16",
}

https://huggingface.co/CohereLabs/c4ai-command-r-v01/blob/main/config.json

I found that the 0.0625 logit scale was right on the boundary of causing an overflow, so guess they tried several different values until they settled on this.

It's likely the weird bit in the plot above is just something that has overflowed and then had NaN converted to +/- the smallest/largest float16 value, and they never realised.

@jukofyork It's better if you don't mention safety at all. It just does what you tell it.

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># System Preamble

It seems really good at picking up on what you give it as a "seed",

Yes that's one of the things I like about this model. It pretty much just does what you tell it to do.
I don't actually thing it has a lot of baked in censorship, it looks like it's been trained to do what it's told, and they rely on putting the safety instructions in the tokenizer.chat_template

but if you just straight up prompt asking for "a story" it'll be Elara's all round.

Yeah, a lot more slop and now "Claude-isms" vs the R series.

I'm actually gonna try uncensoring it next - I've updated all the datasets I've created (other than my private "books" dataset which I can't for obvious reasons):

https://huggingface.co/jukofyork/datasets#repos

It seems HF doesn't like my single massive JSON files though and they crash the dataset viewer:

Job manager crashed while running this job (missing heartbeats).

Error code:   JobManagerCrashedError

😞

I think we might have to make a new doom-thread soon - this one seems to have started bugging out and isn't showing in my "unread" notifications properly now ☹️

I'd appreciate that - this page is almost impossible for my browser to load at this point (for several weeks now, haha)

my browser

Firefox as well right?

isn't showing in my "unread" notifications

Okay not just me thing, I've seen it fully vanish from my inbox.

Error code: JobManagerCrashedError

I've been getting that too (randomly) for the past few weeks. I think it's overloaded, and it won't retry after the failure.

Firefox as well right?

No, Safari on macOS. I'm sure firefox wouldn't be any better for me either

jukofyork unpinned discussion
jukofyork changed discussion status to closed

Sign up or log in to comment