Instructions to use nvidia/Nemotron-Cascade-2-30B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Cascade-2-30B-A3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Cascade-2-30B-A3B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Nemotron-Cascade-2-30B-A3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Cascade-2-30B-A3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-2-30B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Cascade-2-30B-A3B

SGLang

How to use nvidia/Nemotron-Cascade-2-30B-A3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Cascade-2-30B-A3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-2-30B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Cascade-2-30B-A3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-2-30B-A3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Nemotron-Cascade-2-30B-A3B with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Cascade-2-30B-A3B
```

[Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25

by chankhavu - opened Mar 23

Discussion

chankhavu

Mar 23

•

edited Mar 23

Hi Nemo team, thanks for this incredible model and fully open-sourced data and training recipe. I've been trying to reproduce your evals using nemo-evaluator-launcher, but getting the numbers far below reported:

Benchmark	reproduced results	reported in Cascade 2 docs
AIME 2025 with tools (avg@8)	88.3	98.6
AIME 2026 with tools (avg@8)	90.4	95.0
HMMT Feb 2025 with tools (avg@8)	81.3	94.6

Software/Hardware:

GPU: 2xRTX Pro 6000 Blackwell
Inference engine: SGLang v0.5.9 (latest)
Evals library: Nemo Evaluator Launcher 0.2.4

Here is my config:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: nel-results/cascade2_fp8
  mounts:
    evaluation:
      ./hf_cache: /root/.cache/huggingface
target:
  api_endpoint:
    model_id: nvidia/Nemotron-Cascade-2-30B-A3B
    url: http://<my-sglang-endpoint>/v1/chat/completions
    api_key_name: VAST_API_KEY

evaluation:
  env_vars:
    HF_TOKEN: host:HF_TOKEN
    HF_HOME: host:HF_HOME
  nemo_evaluator_config:
    config:
      params:
        parallelism: 16
        max_new_tokens: 131072
        temperature: 1.0
        top_p: 0.95
        request_timeout: 6000
        max_retries: 10
        extra:
          tokenizer_backend: huggingface
          tokenizer: nvidia/Nemotron-Cascade-2-30B-A3B
    target:
      api_endpoint:
        adapter_config:
          params_to_add: {"chat_template_kwargs": {"enable_thinking": true}, "skip_special_tokens": false}
          use_caching: true
          tracking_requests_stats: true
          log_failed_requests: true
          use_request_logging: true
          max_logged_requests: 10
          use_response_logging: true
          max_logged_responses: 10

  tasks:
  - name: nemo_skills.ns_aime2025
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
  - name: nemo_skills.ns_aime2026
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"
  - name: nemo_skills.ns_hmmt_feb2025
    nemo_evaluator_config:
      config:
        params:
          extra:
            use_sandbox: true
            num_repeats: 8
            args: "++inference.tokens_to_generate=null ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]"

I execute it like this:

VAST_API_KEY=<token> HF_TOKEN=<token> HF_HOME="~/.cache/huggingface" nemo-evaluator-launcher run --config eval_cfgs/eval_cascade2_bf16.yaml

The SGLang server is launched with the following params:

python -m sglang.launch_server \
    --model nvidia/Nemotron-Cascade-2-30B-A3B \
    --trust-remote-code \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nano_v3

iKetamine

Mar 23

so good!

ychenNLP

NVIDIA org Mar 23

•

edited Mar 23

Hi @chankhavu ,

Thanks for your effort!
Here is my Nemo-Skills (https://github.com/NVIDIA-NeMo/Skills) python script to reproduce the AIME25 number:

from nemo_skills.pipeline.cli import eval, wrap_arguments

cluster = "slurm"

eval(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=131072 "
        "++inference.temperature=1.0 "
        "++inference.top_p=0.95 "
        "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
    ),
    cluster=cluster,
    expname="debug",
    model="nvidia/Nemotron-Cascade-2-30B-A3B",
    server_type='vllm',
    server_container='vllm/vllm-openai:v0.14.1',
    server_gpus=1,
    num_chunks=1,
    with_sandbox=True,
    benchmarks=f"aime25:8",
    server_args="--mamba_ssm_cache_dtype float32 --no-enable-prefix-caching --enable-auto-tool-choice --tool-call-parser qwen3_coder",
    output_dir="<OUTPUT_DIR>"
)


# Results
---------------------------------------- aime25 ----------------------------------------
evaluation_mode  | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-8] | 30          | 12582      | 827         | 98.75% ± 1.73%   | 0.00%    
majority@8       | 30          | 12582      | 827         | 100.00%          | 0.00%    
pass@8           | 30          | 12582      | 827         | 100.00%          | 0.00%

Are you able to reproduce the number with no tool use? this helps to ablate the tool use issue
Can you try vLLM server? this helps to ablate the server issue

Thanks.

chankhavu

Mar 24

Thanks for your quick response, @ychenNLP ! Indeed, switching to vLLM with your exact parameters works. Here is my results on AIME'25, using nemo-evaluator-launcher with the same YAML config in my post above:

evaluation_mode	num_entries	avg_tokens	gen_seconds	symbolic_correct	no_answer
pass@1[avg-of-8]	30	11494	3330	99.17% ± 1.54%	0.00%
majority@8	30	11494	3330	100.00%	0.00%
pass@8	30	11494	3330	100.00%	0.00%

Differences with SGLang / default command from Nemotron-3-Nano vLLM cookbook:

Added --mamba_ssm_cache_dtype float32 -- this might be the main reason, will ablate on this parameter when I have time later this evening
Removed --reasoning-parser nemotron_v3 -- I don't think this has anything to do with perf increase

My full vLLM command:

vllm serve nvidia/Nemotron-Cascade-2-30B-A3B
  --max-model-len 262144 \
  --trust-remote-code \
  --mamba_ssm_cache_dtype float32 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

ychenNLP

NVIDIA org Mar 24

•

edited Mar 24

Great to hear!
For vllm, --mamba_ssm_cache_dtype float32 is a crucial config for this model.
For SGlang, --mamba-ssm-dtype float32 might be important.

chankhavu

Mar 24

•

edited Mar 24

Thanks a lot, @ychenNLP !

I was able to confirm that the selective quantization recipe of Nano 30b (from the Nemotron 3 Nano Technical Report) works perfectly for Cascade 2 as well:

Benchmark	BF16 (reproduced)	FP8	NVFP4
AIME 2025 (avg@8)	98.8	96.7	97.9
AIME 2026 (avg@8)	94.2	95.0	92.1
HMMT Feb 2025 (avg@8)	92.9	93.8	90.1

With 8 rollouts per problem, ±2% deviation across runs is expected. FP8 is equivalent to BF16. NVFP4 is consistently 1-2% below BF16.

ychenNLP

NVIDIA org Mar 24

•

edited Mar 24

@chankhavu Thanks a lot for validating this and for sharing the follow-up results. It looks like the problem is resolved.
When you have a moment, could you please update the issue title accordingly and close it? Really appreciate it.

chankhavu changed discussion title from Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25 to [Resolved] Unable to reproduce evals on AIME'25, AIME'26, HMMT Feb25 Mar 24

chankhavu changed discussion status to closed Mar 24

ychenNLP

NVIDIA org Apr 3

•

edited Apr 3

@chankhavu In case you want to reproduce the no tool use setting for IMO-AnswerBench:
https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B/discussions/24

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment