Instructions to use tool-genesis/Tool-Genesis-Qwen3-8B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tool-genesis/Tool-Genesis-Qwen3-8B-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tool-genesis/Tool-Genesis-Qwen3-8B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tool-genesis/Tool-Genesis-Qwen3-8B-SFT")
model = AutoModelForCausalLM.from_pretrained("tool-genesis/Tool-Genesis-Qwen3-8B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use tool-genesis/Tool-Genesis-Qwen3-8B-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tool-genesis/Tool-Genesis-Qwen3-8B-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tool-genesis/Tool-Genesis-Qwen3-8B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tool-genesis/Tool-Genesis-Qwen3-8B-SFT

SGLang

How to use tool-genesis/Tool-Genesis-Qwen3-8B-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tool-genesis/Tool-Genesis-Qwen3-8B-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tool-genesis/Tool-Genesis-Qwen3-8B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tool-genesis/Tool-Genesis-Qwen3-8B-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tool-genesis/Tool-Genesis-Qwen3-8B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tool-genesis/Tool-Genesis-Qwen3-8B-SFT with Docker Model Runner:
```
docker model run hf.co/tool-genesis/Tool-Genesis-Qwen3-8B-SFT
```

Tool-Genesis-Qwen3-8B-SFT

A fine-tuned Qwen3-8B model for autonomous MCP (Model Context Protocol) tool server generation. Given a natural language scenario description, the model generates a complete, runnable MCP server with tool schemas and implementation code.

Model Details

Property	Value
Base model	Qwen/Qwen3-8B
Architecture	Qwen3ForCausalLM
Parameters	8B
Hidden size	4096
Layers	36
Attention heads	32
Context length	131,072 tokens
Training method	Full-parameter SFT
Training epochs	3
Training steps	117
Training loss	0.522
Training data	~2,500 samples

Training

The model was fine-tuned on curated MCP server generation examples from the Tool-Genesis benchmark. Each training sample consists of:

Input: A natural language scenario description specifying what the MCP server should do
Output: A complete Python MCP server implementation using the FastMCP framework

Training Configuration

Epochs: 3
Total steps: 117 (~39 steps/epoch)
Final training loss: 0.522
Training runtime: ~4.6 hours

Loss Curve

Step	Loss
1	0.763
10	0.690
20	0.641
39 (epoch 1)	0.539
60	0.434
78 (epoch 2)	0.436
100	0.420
117 (epoch 3)	0.522

Benchmark Results

Evaluated on the Tool-Genesis Benchmark (86 MCP servers, 4-level evaluation).

Direct Generation (single-call, no agent loop)

Model	L1 Compliance	L1 Launch	L2 Schema F1	L2 UT Soft
Qwen3-8B (base)	0.686	0.012	0.011	0.001
Qwen3-8B-SFT (ours)	0.826	0.047	0.046	0.017
Qwen3-235B	0.874	0.333	0.316	0.142
GPT-4.1	0.881	0.738	0.691	0.267
GPT-5.1	0.855	0.759	0.713	0.291

SFT gains over base Qwen3-8B (Direct):

L1 Compliance: +14.0% (0.686 → 0.826)
L1 Launch: +3.5% (0.012 → 0.047)
L2 Schema F1: +3.5% (0.011 → 0.046)
L2 UT Soft: +1.6% (0.001 → 0.017)

With Coder-Agent (multi-turn with sandbox)

Model	L1 Compliance	L1 Launch	L2 Schema F1	L2 UT Soft
Qwen3-8B (base, coder-agent)	0.776	0.694	0.653	0.246
Qwen3-235B (coder-agent)	0.868	0.971	0.914	0.459
GPT-4.1 (coder-agent)	0.884	0.756	0.691	0.288
GPT-5.1 (coder-agent)	0.906	0.941	0.877	0.426

Note: The coder-agent strategy dramatically improves all models by providing an iterative sandbox-based coding loop. The SFT model has not yet been evaluated with the coder-agent strategy.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tool-genesis/Tool-Genesis-Qwen3-8B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

prompt = """You are a developer building MCP tool servers in Python.
Build a complete MCP server for the following scenario:

A weather information service that provides current weather data, 
forecasts, and weather alerts for any location worldwide.

Output only the Python source code using the FastMCP framework."""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.2)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation Protocol

The Tool-Genesis benchmark evaluates generated MCP servers across four levels:

Level	What it tests
L1: Protocol Compliance	JSON format validity and server launch success
L2: Semantic Correctness	Tool schema matching (F1) and unit test pass rate
L3: Capability Boundary	No unauthorized capabilities or dangerous extra tools
L4: Task Utility	Downstream task completion using generated tools

Citation

@misc{tool_genesis_2025,
  title={Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent},
  author={Xia, Bowei and Hu, Mengkang and Wang, Shijian and Jin, Jiarui and Jiao, Wenxiang and Lu, Yuan and Li, Kexin and Luo, Ping},
  year={2025},
  note={Project page: https://tool-genesis.github.io}
}

License

Apache 2.0

Downloads last month: 8

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for tool-genesis/Tool-Genesis-Qwen3-8B-SFT

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B