Releasing QwQ-LongCoT-130K

Community Article Published December 5, 2024
kmmlu

Figure 1: Just a cute picture generate with [Flux](https://huggingface.co/Shakker-Labs/FLUX.1-dev-LoRA-Logo-Design)

Today, I’m excited to release QwQ-LongCoT-130K, a SFT dataset designed for training O1-like large language models (LLMs). This dataset includes about 130k instances, each with responses generated using QwQ-32B-Preview. The dataset is available under the Apache 2.0 license, so feel free to use it as you like.

Link to dataset: *QwQ-LongCoT-130K

Dataset Construction

The challenging part of creating QwQ-LongCoT-130K was curating seed instructions that truly worth longer chain-of-thought reasoning. Simply put, I didn’t want to generate lengthy responses—spanning thousands of tokens—for simple prompts like, “What color is the sky?” At the same time, I wanted them to be free of licensing issues. Accordingly, I collect seed-instructions using the following two methods.

Initially, I sourced data from the NuminaMath-CoT dataset, which contains a collection of 860K math questions and their corresponding answers. This dataset is licensed under Apache 2.0. To add diversity and include categories beyond math, I used the Magpie approach to extract questions from the QwQ-32B-Preview model. A common approach with Magpie involves inputting a blank space, sometimes with a user token, and expecting the model to generate a user query. However, with QwQ-32B-Preview, we observed that this method often leads the model to refuse to respond, frequently replying with something like: “I’d be able to assist better if you provided more details.” Also using this approach we have little or no control over the instruction generated. So, in our experiments we use the following template:

import random
adjective = random.choice(["Physics", "Chemistry", "Biology", ...])
subject = random.choice([ "difficult", "tough", "long", "challenging", "tricky", ...])

This is one {adjective} {subject} question. I'll first repeat the question word-by-word before I start to solve.

After collecting the seed instructions, I used QwQ-32B-Preview to generate one response for each instruction. Once the generation was complete, I applied simple rule-based filtering to remove responses containing phrases like "Sorry" or "As an AI model." I also filtered out instances with excessive repetition of sentences and attempted to exclude those containing Chinese characters—though some may still remain. In any case, there is still room for further refinement.

Dataset Analysis

The dataset consists of 90k samples from NuminaMath and about 43k generated via Magpie. In my first effort with Magpie, I accidentally forgot to log the subjects used to generate each instruction, but in the figure below you can see the distributions of the ones I didn't forget (oops). I'm planning to add more Magpie data if I find some more computing resources.

kmmlu

Figure 2: Dataset distribution

Below is a comparison of the length distribution of instances in the QwQ-LongCoT dataset, using the top_300k_longer_conversations subset from Magpie-Ultra as a baseline. For the readability of the plot, I excluded some outliers exceeding 20k characters from the QwQ-LongCoT dataset, although the longest sample had over 170k characters. From the plot, it is evident that QwQ-LongCoT generally contains longer instances.

kmmlu

Figure 3: Length comparison

Lessons learned from training with QwQ-LongCoT-130K

Well, I initially tried training with the dataset in a simple SFT setting, only to find that it does not work well. My random guess is that the thinking traces in QwQ-LongCoT include intentionally generating wrong statements and then fixing them. This "intentionally generating wrong stuff" seems to be bad for the model—I don’t have any evidence. I probably need different approaches to mask away the wrong traces during SFT or use RL. The model is still embarrassing to share, and I'm still trying some more training runs, so I hope to get a decent, shareable model soon.

ETC

Big thanks for the Qwen Team and Project-Numina.
If you're interested in exploring the dataset further or collaborating with me, please feel free to reach out at: spthsrbwls123@yonsei.ac.kr.