Instructions to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps")
model = AutoModelForCausalLM.from_pretrained("ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps

SGLang

How to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with Docker Model Runner:
```
docker model run hf.co/ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps
```

Model Card for Model ID

This is a saved checkpoint from fine-tuning a Qwen3/Qwen3-4B-Base model using the MaxRL objective, "Maximum Likelihood Reinforcement Learning". In our work, we introduce MaxRL, a framework for optimizing maximum likelihood in RL settings.

Model Details

Model Description

This is the model card of a Qwen3/Qwen3-4B-Base model fine-tuned using MaxRL.

Finetuned from model: Qwen3/Qwen3-4B-Base

Model Sources

Repository: Official Code Release for the paper "Maximum Likelihood Reinforcement Learning"
Paper: Maximum Likelihood Reinforcement Learning
Project Website: Project Website

Training Details

Training Data

We train on the POLARIS-53K dataset to produce this checkpoint.

Training Procedure

Please use the given script or in general the published codebase to reproduce training this checkpoint. Hyperparameters and other details are provided in the training script.

Due to computational constraints, we have trained for 1000 steps, and released the final checkpoint.

Hardware

This model has been finetuned using 32 NVIDIA H200 GPUs (4 nodes of 8xH200 GPUs).

Citation

BibTeX:

@misc{tajwar2026maximumlikelihoodreinforcementlearning,
      title={Maximum Likelihood Reinforcement Learning}, 
      author={Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov and Haiwen Feng and Andrea Zanette},
      year={2026},
      eprint={2602.02710},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.02710}, 
}

Model Card Contact

Fahim Tajwar

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

Collection including ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps

MaxRL

Collection

Qwen3-Base post-trained checkpoints for our paper, Maximum Likelihood Reinforcement Learning [https://zanette-labs.github.io/MaxRL/] • 4 items • Updated Feb 26 • 2

Paper for ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps

Maximum Likelihood Reinforcement Learning

Paper • 2602.02710 • Published Feb 2 • 4