Models

#1
by concedo - opened

Model might be undercooked or maybe I'm using it wrong. Unlike v0.2, it's clear that the model retains a large amount of pretrain knowledge and formatting, and seems very enthusiastic to drop the correct format.

Also, the <|text_sep|> added token that is prevalent in 0.2 still seems to exist, but you don't use it? Model seems to expect <|space|> instead.

Decoding is different without the <|code_start|> and <|code_end|> boundary tokens, why were they removed?

image.png

vs

image.png

Hey, regarding "seems very enthusiastic to drop the correct format" this sounds unusual. I've just run a bunch of generation tests on (edit 500M one seems fine, 1B is also okay) q8, q6, and q4 quants using the exact prompt "hello my name is brenda." not once did it fail to output the <|im_end|> token, even at a high temperatures. So I'm not entirely sure why this is happening on your end. It could be something set up incorrectly or perhaps a bug?

As for the token changes, yes, the prompt has been slightly modified from the previous version. <|text_sep|> was replaced with <|space|>, and <|code_start|> and <|code_end|> were removed. These tokens were kind of redundant because boundaries can be detected using the time and space tokens:

[time token] [audio tokens] [space token]\n

When decoding, you still extract just the audio tokens, ignoring everything else. If you want to chunk words during streaming, you can use the time token as the <|code_start|> and the space token as the <|code_end|>. This doesn’t make much of a difference functionally, removing those extra tokens also slightly improves generation speed.

One little noobtrap warning for others coming after me: although the 1B and 500M models share a lot of similar vocab <|0|> has a token id of 151672 in the 500M model but 50307 in the 1B model! Why this is different I cannot say but it will affect your CTS decoding, so make sure you adjust your offsets.

Also, there is an off-by-one issue, where the final code is <|4099|> vs <|4100|>, though I cannot tell if 4100 is actually used or just a dummy.

To the authors: recommend keeping vocab IDs standardized between future versions.

Thanks for the suggestion. The different token ids comes because they use different models, the 500M model is based on qwen 0.5, while the 1B model is based on olmo 1B, so their vocabularies are entirely different. To support the exact same tokens, would need to extend the vocabulary by 100k tokens, which would not be practical or efficient. The audio codebook is up to 4096, so those extra tokens are dummies and not used.

Sign up or log in to comment