System Prompt
What is the system prompt for the distilled model?
I'm using the QwQ system prompt, seems works just fine
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.
I have a quick example output with system/user prompts and benchmarks for 3090TI FE running a bnb-4bit quant locally posted at r/LocalLLaMA
I am still experimenting as even with temperature in the suggested range 0.5~0.8 it can get hung up repeatedly second guessing itself in a loop.
We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.
Models | AIME24 | MATH500 | GSM8K | GPQA-Diamond | ARC-Challenge | MMLU-Pro | MMLU | LiveCodeBench |
---|---|---|---|---|---|---|---|---|
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 46.67 | 88.20 | - | 57.58 | - | - | - | - |
More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview
@Wanfq wow, you guys are fast, i see you just released a merge FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview today?!
the benchmark numbers on your merge are looking good! I was using Sky-T1-32B
up until today when DeepSeek-R1-Distill-Qwen-32B
landed.
can't wait to try out your merge after the GGUFs land! Though been a busy day for @bartowski already... haha! 🎉
cheers!
@Wanfq well those scores are significantly lower than deepseek's, I wonder if they included how they setup the test environment in the paper
Have you tried "no system prompt"? Since deepseek V3 also never got any official system prompt, maybe their new models performs the best without any system prompt
For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
Model | AIME 2024 pass@1 | AIME 2024 cons@64 |
---|---|---|
DeepSeek-R1-Distill-Qwen-32B | 72.6 | 83.3 |
cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call) No inference-time search or sampling
Have you tried "no system prompt"? Since deepseek V3 also never got any official system prompt, maybe their new models performs the best without any system prompt
I have tryed no system prompt in my early attempt. The results are close to the "You are a helpful and harmless assistant. You should think step-by-step."
For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
Model AIME 2024 pass@1 AIME 2024 cons@64 DeepSeek-R1-Distill-Qwen-32B 72.6 83.3 cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call) No inference-time search or sampling
We use a temperature of 0.7, a maximum length of 32768, and the evaluation code is based on https://github.com/NovaSky-AI/SkyThought to calculate the pass@1.
We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.
Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - - More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview
In the readme: DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.
Maybe you can try to set the system as "You are a helpful assistant.", it's the same as Qwen2.5.
@Wanfq
We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.
Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - - More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview
The evaluation code is modified from SkyThought. In our evaluation, we set the temperature to 0.7 and the max_tokens to 32768. We provide the example to reproduce our results in evaluation.
The system prompt for evaluation is set to:
You are a helpful and harmless assistant. You should think step-by-step.
We are currently attempting to reproduce the results reported in the DeepSeek-R1 paper by experimenting with different system prompts. We will update our findings once we have acquired the original system prompt used in their study.
The updated evaluation results are presented here:
Models | AIME24 | MATH500 | GSM8K | GPQA-Diamond | ARC-Challenge | MMLU-Pro | MMLU | LiveCodeBench |
---|---|---|---|---|---|---|---|---|
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 46.67 | 88.20 | 93.71 | 57.58 | 95.90 | 68.70 | 82.17 | 59.69 |
More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview
The paper says they used a Top P of 0.95, and temp of 0.6 for their benchmarks.
How have people been setting Top K, Repeat Penalty, and Min P to get the best results?