System Prompt

by Wanfq - opened 3 days ago

Discussion

Wanfq

3 days ago

What is the system prompt for the distilled model?

AaronFeng753

3 days ago

I'm using the QwQ system prompt, seems works just fine

You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.

ubergarm

3 days ago

I have a quick example output with system/user prompts and benchmarks for 3090TI FE running a bnb-4bit quant locally posted at r/LocalLLaMA

I am still experimenting as even with temperature in the suggested range 0.5~0.8 it can get hung up repeatedly second guessing itself in a loop.

Wanfq

2 days ago

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models	AIME24	MATH500	GSM8K	GPQA-Diamond	ARC-Challenge	MMLU-Pro	MMLU	LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	46.67	88.20	-	57.58	-	-	-	-

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

ubergarm

2 days ago

@Wanfq wow, you guys are fast, i see you just released a merge FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview today?!

the benchmark numbers on your merge are looking good! I was using Sky-T1-32B up until today when DeepSeek-R1-Distill-Qwen-32B landed.

can't wait to try out your merge after the GGUFs land! Though been a busy day for @bartowski already... haha! 🎉

cheers!

AaronFeng753

2 days ago

@Wanfq well those scores are significantly lower than deepseek's, I wonder if they included how they setup the test environment in the paper

AaronFeng753

2 days ago

Have you tried "no system prompt"? Since deepseek V3 also never got any official system prompt, maybe their new models performs the best without any system prompt

ubergarm

2 days ago

•

edited 2 days ago

DeepSeek-R1-Evaluation

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.

Distilled Model Evaluation

Model	AIME 2024 pass@1	AIME 2024 cons@64
DeepSeek-R1-Distill-Qwen-32B	72.6	83.3

cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call) No inference-time search or sampling

Wanfq

2 days ago

Have you tried "no system prompt"? Since deepseek V3 also never got any official system prompt, maybe their new models performs the best without any system prompt

I have tryed no system prompt in my early attempt. The results are close to the "You are a helpful and harmless assistant. You should think step-by-step."

Wanfq

2 days ago

DeepSeek-R1-Evaluation

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.

Distilled Model Evaluation

Model AIME 2024 pass@1 AIME 2024 cons@64

DeepSeek-R1-Distill-Qwen-32B 72.6 83.3

cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call) No inference-time search or sampling

We use a temperature of 0.7, a maximum length of 32768, and the evaluation code is based on https://github.com/NovaSky-AI/SkyThought to calculate the pass@1.

YeungNLP

2 days ago

We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

In the readme: DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.
Maybe you can try to set the system as "You are a helpful assistant.", it's the same as Qwen2.5.
@Wanfq

Wanfq

2 days ago

We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

The evaluation code is modified from SkyThought. In our evaluation, we set the temperature to 0.7 and the max_tokens to 32768. We provide the example to reproduce our results in evaluation.

The system prompt for evaluation is set to:

You are a helpful and harmless assistant. You should think step-by-step.

We are currently attempting to reproduce the results reported in the DeepSeek-R1 paper by experimenting with different system prompts. We will update our findings once we have acquired the original system prompt used in their study.

The updated evaluation results are presented here:

Models	AIME24	MATH500	GSM8K	GPQA-Diamond	ARC-Challenge	MMLU-Pro	MMLU	LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	46.67	88.20	93.71	57.58	95.90	68.70	82.17	59.69

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

noahzs

2 days ago

The paper says they used a Top P of 0.95, and temp of 0.6 for their benchmarks.
How have people been setting Top K, Repeat Penalty, and Min P to get the best results?

bartowski

1 day ago

it's up btw @ubergarm https://huggingface.co/bartowski/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-GGUF

Lingrui

22 minutes ago

Here ya go.
https://github.com/deepseek-ai/DeepSeek-R1/pull/48/files

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment