SmolLM3-3B-128K-GGUF

#1
by MLShare - opened

unfortunately it ends with

Error: unable to load model: ....

Having same issue, unable to load the .gguf file

update llama to recent version.

llama-cli --version
version: 5873 (f5e96b36)
built with MSVC 19.43.34808.0 for x64

run inference with llama cli:
llama-cli -m "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf" --jinja -p "Tell me something cool about space."

why the -jinja tag... it breaks my inference app :-(

How to use this model with python llama.cpp?

this gguf version of the model is based on jinja template embedded into it. You have to use the tag --jinga to use its default template or you can create your own template and use it with llama cli.

e.g with custom jinja template:

llama-cli ^
-m "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf" ^
--jinja ^
--ctx-size 131072 ^
--interactive ^
--chat-template-file "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\my_custom_template.jinja"

Sample Jinja template:

<my_custom_template.jinja>

<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 12 July 2025

## Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face. Respond clearly and concisely to user queries.
<|im_end|>
{% for message in messages %}
<|im_start|>{{ message.role }}
{{ message.content }}<|im_end|>
{% endfor %}
<|im_start|>assistant

@erickdp :

to load this model with Python llama.cpp:

Make sure you update the llama.cpp to latest version as smolm3 architecture is not supported in older version,

pip install --upgrade llama-cpp-python --force-reinstall --no-cache-dir

pip show llama-cpp-python
Name: llama_cpp_python
Version: 0.3.14

Below is a sample inference code to load the gguf via python llama.cpp,

from llama_cpp import Llama
from jinja2 import Environment, BaseLoader
from datetime import datetime

# Define strftime_now function
def strftime_now(fmt):
    return datetime.now().strftime(fmt)

# Path to GGUF file
model_path = r"..\.cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf"

# Load the model
llm = Llama(
    model_path=model_path,
    n_ctx=131072,
    temperature=0.7,
    repeat_penalty=1.1,
    verbose=False
)

chat_template = llm.metadata.get("tokenizer.chat_template")

if not chat_template:
    print("No embedded template found in GGUF. Using custom fallback.")
    chat_template = """{{ bos_token }}{% for message in messages %}
<|{{ message.role }}|>
{{ message.content }}{% endfor %}<|assistant|>
"""

# Use custom Jinja environment with strftime_now
jinja_env = Environment(loader=BaseLoader())
jinja_env.globals["strftime_now"] = strftime_now
template = jinja_env.from_string(chat_template)

messages = [{"role": "system", "content": "You are a helpful assistant."}]

print("πŸ€– Chat with SmolLM3 (type 'exit' to quit)\n")

while True:
    user_input = input("You: ").strip()
    if user_input.lower() in ["exit", "quit"]:
        break

    messages.append({"role": "user", "content": user_input})

    # Render full chat history using the template
    prompt = template.render(
        messages=messages,
        bos_token=llm.metadata.get("tokenizer.bos_token", ""),
        eos_token=llm.metadata.get("tokenizer.eos_token", "")
    )

    print("Assistant:", end=" ", flush=True)
    assistant_reply = ""
    for chunk in llm(prompt, max_tokens=512, stream=True, stop=["<|user|>", "<|end|>", "</s>"]):
        token = chunk["choices"][0]["text"]
        assistant_reply += token
        print(token, end="", flush=True)
    print()

    messages.append({"role": "assistant", "content": assistant_reply.strip()})

so does that mean its just the wrong template in ollama? for smollm3?

its not related to Ollama, the unsloth/SmolLM3-3B-128K-GGUF quantized version of the model is based on Jinja template. you can try with other quantized models with Ollama as well without Jinja.

MLShare changed discussion status to closed

Sign up or log in to comment