McGill-NLP/longcot-8k-1.5b

TL;DR

Markovian Thinking for RL in reasoning LLMs: replace the trivial MDP where state = prompt + all past thinking tokens (quadratic compute) with a bounded, fixed-size state, yielding linear compute in thinking tokens and constant memory by design.
Delethink RL trains a model to “think” in fixed-size chunks with bounded state..
This 1.5B model uses an effective thinking budget of about 24K tokens while only requiring an 8K active context at any time via chunked rollouts and short carryovers.
Initialized from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, trained with the Delethink RL paradigm. See the paper for full details.

Model Summary

Base model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Objective: Reinforcement Learning using standard LongCoT, trained for 1000 steps.
Thinking 8K budget; uses the entire context.
Intended use: Math/logic reasoning with step-by-step derivations; final answer typically formatted inside LaTeX \boxed{}.
Library compatibility: Works well with SGLang for chunked inference; also usable with Transformers for standard generation.

Intended Uses and Limitations

Intended uses:
- Long-form reasoning on math and related tasks.
- Bounded-context rollouts with repeated chunking and short carryovers.
Not intended for:
- Safety-sensitive applications without human oversight.
- Use cases requiring faithful, verifiable citations to external sources.
Limitations:
- May hallucinate, make arithmetic/algebraic mistakes, or produce inconsistent plans.
- The chunked rollout procedure is needed to realize Delethink’s efficiency advantages.

Prompting

Use the model’s chat template and request a step-by-step solution with a final boxed answer:
- “Please reason step by step, and put your final answer within \boxed{}.”

Quickstart (SGLang, chunked Delethink rollout)

import asyncio
import sglang as sgl

def main():
    llm = sgl.Engine(
        model_path="McGill-NLP/longcot-8k-1.5b",
        dtype="bfloat16",
        attention_backend="flashinfer",
        mem_fraction_static=0.8,
        log_level="WARNING",
    )

    prompt = (
        r"There exist real numbers $x$ and $y$, both greater than 1, such that "
        r"$\log_x\left(y^x\right)=\log_y\left(x^{4y}\right)=10$. Find $xy$."
        "\n\nPlease reason step by step, and put your final answer within \\boxed{}."
    )
    tok = llm.tokenizer_manager.tokenizer
    query_ids = tok.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=True,
        add_generation_prompt=True,
    )

    params = {"temperature": 0.6, "max_new_tokens": 8192}
    ids = llm.generate(input_ids=query_ids, sampling_params=params, return_logprob=True)
    print(tok.decode(ids, skip_special_tokens=False))

if __name__ == "__main__":
    main()

Suggested generation settings

temperature: 0.6
top_p: 1.0
top_k: -1

Safety and Use

This model can produce incorrect or misleading reasoning steps and answers. Always verify results.
Do not deploy in high-stakes domains without human oversight.

Citation

@misc{Aghajohari2025:TheMarkovianThinker,
      title={The Markovian Thinker}, 
      author={Milad Aghajohari and Kamran Chitsaz and Amirhossein Kazemnejad and Sarath Chandar and Alessandro Sordoni and Aaron Courville and Siva Reddy},
      year={2025},
      eprint={2510.06557},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.06557}, 
}

Downloads last month: 19

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for McGill-NLP/longcot-8k-1.5b

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Finetuned

(504)

this model

Dataset used to train McGill-NLP/longcot-8k-1.5b

Collection including McGill-NLP/longcot-8k-1.5b

The Markovian Thinker

Collection

Reformulating the RL of reasoning LLMs through Markovian Thinking paradigm. • 7 items • Updated 26 days ago • 10

McGill-NLP
/

longcot-8k-1.5b

McGill-NLP/longcot-8k-1.5b

TL;DR

Links

Model Summary

Intended Uses and Limitations

Prompting

Quickstart (SGLang, chunked Delethink rollout)

Suggested generation settings

Safety and Use

Citation

Model tree for McGill-NLP/longcot-8k-1.5b

Dataset used to train McGill-NLP/longcot-8k-1.5b

Collection including McGill-NLP/longcot-8k-1.5b

The Markovian Thinker