Instructions to use nvidia/Nemotron-Cascade-2-30B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/Nemotron-Cascade-2-30B-A3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nvidia/Nemotron-Cascade-2-30B-A3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/Nemotron-Cascade-2-30B-A3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-2-30B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/Nemotron-Cascade-2-30B-A3B
- SGLang
How to use nvidia/Nemotron-Cascade-2-30B-A3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Cascade-2-30B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-2-30B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Cascade-2-30B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-2-30B-A3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/Nemotron-Cascade-2-30B-A3B with Docker Model Runner:
docker model run hf.co/nvidia/Nemotron-Cascade-2-30B-A3B
[Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25
Hi Nemo team, thanks for this incredible model and fully open-sourced data and training recipe. I've been trying to reproduce your evals using nemo-evaluator-launcher, but getting the numbers far below reported:
| Benchmark | reproduced results | reported in Cascade 2 docs |
|---|---|---|
| AIME 2025 with tools (avg@8) | 88.3 | 98.6 |
| AIME 2026 with tools (avg@8) | 90.4 | 95.0 |
| HMMT Feb 2025 with tools (avg@8) | 81.3 | 94.6 |
Software/Hardware:
- GPU: 2xRTX Pro 6000 Blackwell
- Inference engine: SGLang v0.5.9 (latest)
- Evals library: Nemo Evaluator Launcher 0.2.4
Here is my config:
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: nel-results/cascade2_fp8
mounts:
evaluation:
./hf_cache: /root/.cache/huggingface
target:
api_endpoint:
model_id: nvidia/Nemotron-Cascade-2-30B-A3B
url: http://<my-sglang-endpoint>/v1/chat/completions
api_key_name: VAST_API_KEY
evaluation:
env_vars:
HF_TOKEN: host:HF_TOKEN
HF_HOME: host:HF_HOME
nemo_evaluator_config:
config:
params:
parallelism: 16
max_new_tokens: 131072
temperature: 1.0
top_p: 0.95
request_timeout: 6000
max_retries: 10
extra:
tokenizer_backend: huggingface
tokenizer: nvidia/Nemotron-Cascade-2-30B-A3B
target:
api_endpoint:
adapter_config:
params_to_add: {"chat_template_kwargs": {"enable_thinking": true}, "skip_special_tokens": false}
use_caching: true
tracking_requests_stats: true
log_failed_requests: true
use_request_logging: true
max_logged_requests: 10
use_response_logging: true
max_logged_responses: 10
tasks:
- name: nemo_skills.ns_aime2025
nemo_evaluator_config:
config:
params:
extra:
use_sandbox: true
num_repeats: 8
args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
- name: nemo_skills.ns_aime2026
nemo_evaluator_config:
config:
params:
extra:
use_sandbox: true
num_repeats: 8
args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
- name: nemo_skills.ns_hmmt_feb2025
nemo_evaluator_config:
config:
params:
extra:
use_sandbox: true
num_repeats: 8
args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
I execute it like this:
VAST_API_KEY=<token> HF_TOKEN=<token> HF_HOME="~/.cache/huggingface" nemo-evaluator-launcher run --config eval_cfgs/eval_cascade2_bf16.yaml
The SGLang server is launched with the following params:
python -m sglang.launch_server \
--model nvidia/Nemotron-Cascade-2-30B-A3B \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser nano_v3
so good!
Hi @chankhavu ,
Thanks for your effort!
Here is my Nemo-Skills (https://github.com/NVIDIA-NeMo/Skills) python script to reproduce the AIME25 number:
from nemo_skills.pipeline.cli import eval, wrap_arguments
cluster = "slurm"
eval(
ctx=wrap_arguments(
"++inference.tokens_to_generate=131072 "
"++inference.temperature=1.0 "
"++inference.top_p=0.95 "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
),
cluster=cluster,
expname="debug",
model="nvidia/Nemotron-Cascade-2-30B-A3B",
server_type='vllm',
server_container='vllm/vllm-openai:v0.14.1',
server_gpus=1,
num_chunks=1,
with_sandbox=True,
benchmarks=f"aime25:8",
server_args="--mamba_ssm_cache_dtype float32 --no-enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder",
output_dir="<OUTPUT_DIR>"
)
# Results
---------------------------------------- aime25 ----------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-8] | 30 | 12582 | 827 | 98.75% ± 1.73% | 0.00%
majority@8 | 30 | 12582 | 827 | 100.00% | 0.00%
pass@8 | 30 | 12582 | 827 | 100.00% | 0.00%
- Are you able to reproduce the number with no tool use? this helps to ablate the tool use issue
- Can you try vLLM server? this helps to ablate the server issue
Thanks.
Thanks for your quick response, @ychenNLP ! Indeed, switching to vLLM with your exact parameters works. Here is my results on AIME'25, using nemo-evaluator-launcher with the same YAML config in my post above:
| evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer |
|---|---|---|---|---|---|
| pass@1[avg-of-8] | 30 | 11494 | 3330 | 99.17% ± 1.54% | 0.00% |
| majority@8 | 30 | 11494 | 3330 | 100.00% | 0.00% |
| pass@8 | 30 | 11494 | 3330 | 100.00% | 0.00% |
Differences with SGLang / default command from Nemotron-3-Nano vLLM cookbook:
- Added
--mamba_ssm_cache_dtype float32-- this might be the main reason, will ablate on this parameter when I have time later this evening - Removed
--reasoning-parser nemotron_v3-- I don't think this has anything to do with perf increase
My full vLLM command:
vllm serve nvidia/Nemotron-Cascade-2-30B-A3B
--max-model-len 262144 \
--trust-remote-code \
--mamba_ssm_cache_dtype float32 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Great to hear!
For vllm, --mamba_ssm_cache_dtype float32 is a crucial config for this model.
For SGlang, --mamba-ssm-dtype float32 might be important.
Thanks a lot, @ychenNLP !
I was able to confirm that the selective quantization recipe of Nano 30b (from the Nemotron 3 Nano Technical Report) works perfectly for Cascade 2 as well:
| Benchmark | BF16 (reproduced) | FP8 | NVFP4 |
|---|---|---|---|
| AIME 2025 (avg@8) | 98.8 | 96.7 | 97.9 |
| AIME 2026 (avg@8) | 94.2 | 95.0 | 92.1 |
| HMMT Feb 2025 (avg@8) | 92.9 | 93.8 | 90.1 |
With 8 rollouts per problem, ±2% deviation across runs is expected. FP8 is equivalent to BF16. NVFP4 is consistently 1-2% below BF16.
@chankhavu Thanks a lot for validating this and for sharing the follow-up results. It looks like the problem is resolved.
When you have a moment, could you please update the issue title accordingly and close it? Really appreciate it.
@chankhavu In case you want to reproduce the no tool use setting for IMO-AnswerBench:
https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B/discussions/24