jiulaikankan/Qwen2.5-14B-ReasonGenRM

Model Description

This model is fine-tuned on reward modeling data and has undergone two stages of training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). As a result, it is a post-DPO model optimized for reasoning and text generation tasks.

chat_message = [
  {"role": "user", "content": ...},
  {"role": "reason", "content": ...},
  {"role": "assistant", "content": ...},
]

Intended Use

While this model is specifically designed for reward modeling tasks, it also demonstrates adaptability to general-purpose tasks. Notably, it exhibits a degree of correctness and reliability across various applications.

Limitations

The model’s performance may vary depending on the domain and specificity of the input.
It may inherit biases present in the training data.

Code and Resources

The code and additional resources for this model are available on GitHub.

jiulaikankan
/

Qwen2.5-14B-ReasonGenRM

Model Description

Intended Use

Limitations

Code and Resources

Model tree for jiulaikankan/Qwen2.5-14B-ReasonGenRM