Instructions to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps") model = AutoModelForCausalLM.from_pretrained("ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps
- SGLang
How to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps with Docker Model Runner:
docker model run hf.co/ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps
Model Card for Model ID
This is a saved checkpoint from fine-tuning a Qwen3/Qwen3-4B-Base model using the MaxRL objective, "Maximum Likelihood Reinforcement Learning". In our work, we introduce MaxRL, a framework for optimizing maximum likelihood in RL settings.
Model Details
Model Description
This is the model card of a Qwen3/Qwen3-4B-Base model fine-tuned using MaxRL.
- Finetuned from model: Qwen3/Qwen3-4B-Base
Model Sources
- Repository: Official Code Release for the paper "Maximum Likelihood Reinforcement Learning"
- Paper: Maximum Likelihood Reinforcement Learning
- Project Website: Project Website
Training Details
Training Data
We train on the POLARIS-53K dataset to produce this checkpoint.
Training Procedure
Please use the given script or in general the published codebase to reproduce training this checkpoint. Hyperparameters and other details are provided in the training script.
Due to computational constraints, we have trained for 1000 steps, and released the final checkpoint.
Hardware
This model has been finetuned using 32 NVIDIA H200 GPUs (4 nodes of 8xH200 GPUs).
Citation
BibTeX:
@misc{tajwar2026maximumlikelihoodreinforcementlearning,
title={Maximum Likelihood Reinforcement Learning},
author={Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov and Haiwen Feng and Andrea Zanette},
year={2026},
eprint={2602.02710},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.02710},
}
Model Card Contact
- Downloads last month
- 1