batched predictions with padding through the model don't seem to work correctly

#7
by karthikramen - opened
    input = tokenizer.apply_chat_template(
        [
          [{"role": "user", "content": prompt}, {"role": "assistant", "content": response }], 
          [{"role": "user", "content": prompt2 }, {"role": "assistant", "content": response2 }], 
        return_tensors="pt",
        truncation=True,
        padding=True,
        tokenize=True
    ).to(model.device)

    with torch.no_grad():
      model(**input)

this doesn't work as intended -- I checked all the logic in the tokenizer and the config and they seem to be correct -- haven't dug into the custom modeling code yet

you can tell this is different because if you only pass in one set of inputs instead of multiple, you get different results

This comment has been hidden

my guess is that before passing the hidden states into the final NNs self.rewards and self.gating, you need to apply the attention_mask to filter out the tokens that are generated by padding

here: https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1/blob/main/modeling_custom.py#L150 and a bit below

having dug in some more, the code to define sequence lengths and gating tokens is pretty accurate, but the llama model (transformer_outputs = model(...)[0]) just generates fundamentally different values depending on whether you have a padded input or a non-padded input, even when the attention_mask is defined correctly

this takes me back to the beginning where either there's something wrong with the tokenizer and / or the llama-3 implementation :sad:

ok, I figured out my issue -- it was because I was quantizing the model to int8 in order to get batching to work faster, but this messes up some of the model's internals. for now, I've gone with quantizing it to fp4 instead which gives an added speed boost and ensures that all the metrics match what I expect.

karthikramen changed discussion status to closed
RLHFlow org

Thanks for the information!

Sign up or log in to comment