SmolLM3-3B-128K-GGUF

by MLShare - opened Jul 10

Discussion

MLShare

Jul 10

unfortunately it ends with

Error: unable to load model: ....

itsanurag

Jul 10

Having same issue, unable to load the .gguf file

imsathiya17

Jul 11

update llama to recent version.

llama-cli --version
version: 5873 (f5e96b36)
built with MSVC 19.43.34808.0 for x64

run inference with llama cli:
llama-cli -m "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf" --jinja -p "Tell me something cool about space."

snapo

Jul 23

why the -jinja tag... it breaks my inference app :-(

erickdp

Jul 25

How to use this model with python llama.cpp?

imsathiya17

Jul 25

•

edited Jul 25

this gguf version of the model is based on jinja template embedded into it. You have to use the tag --jinga to use its default template or you can create your own template and use it with llama cli.

e.g with custom jinja template:

llama-cli ^
-m "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf" ^
--jinja ^
--ctx-size 131072 ^
--interactive ^
--chat-template-file "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\my_custom_template.jinja"

Sample Jinja template:

<my_custom_template.jinja>

<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 12 July 2025

## Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face. Respond clearly and concisely to user queries.
<|im_end|>
{% for message in messages %}
<|im_start|>{{ message.role }}
{{ message.content }}<|im_end|>
{% endfor %}
<|im_start|>assistant

imsathiya17

Jul 25

•

edited Jul 25

@erickdp :

to load this model with Python llama.cpp:

Make sure you update the llama.cpp to latest version as smolm3 architecture is not supported in older version,

pip install --upgrade llama-cpp-python --force-reinstall --no-cache-dir

pip show llama-cpp-python
Name: llama_cpp_python
Version: 0.3.14

Below is a sample inference code to load the gguf via python llama.cpp,

from llama_cpp import Llama
from jinja2 import Environment, BaseLoader
from datetime import datetime

# Define strftime_now function
def strftime_now(fmt):
    return datetime.now().strftime(fmt)

# Path to GGUF file
model_path = r"..\.cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf"

# Load the model
llm = Llama(
    model_path=model_path,
    n_ctx=131072,
    temperature=0.7,
    repeat_penalty=1.1,
    verbose=False
)

chat_template = llm.metadata.get("tokenizer.chat_template")

if not chat_template:
    print("No embedded template found in GGUF. Using custom fallback.")
    chat_template = """{{ bos_token }}{% for message in messages %}
<|{{ message.role }}|>
{{ message.content }}{% endfor %}<|assistant|>
"""

# Use custom Jinja environment with strftime_now
jinja_env = Environment(loader=BaseLoader())
jinja_env.globals["strftime_now"] = strftime_now
template = jinja_env.from_string(chat_template)

messages = [{"role": "system", "content": "You are a helpful assistant."}]

print("🤖 Chat with SmolLM3 (type 'exit' to quit)\n")

while True:
    user_input = input("You: ").strip()
    if user_input.lower() in ["exit", "quit"]:
        break

    messages.append({"role": "user", "content": user_input})

    # Render full chat history using the template
    prompt = template.render(
        messages=messages,
        bos_token=llm.metadata.get("tokenizer.bos_token", ""),
        eos_token=llm.metadata.get("tokenizer.eos_token", "")
    )

    print("Assistant:", end=" ", flush=True)
    assistant_reply = ""
    for chunk in llm(prompt, max_tokens=512, stream=True, stop=["<|user|>", "<|end|>", "</s>"]):
        token = chunk["choices"][0]["text"]
        assistant_reply += token
        print(token, end="", flush=True)
    print()

    messages.append({"role": "assistant", "content": assistant_reply.strip()})

snapo

Jul 26

so does that mean its just the wrong template in ollama? for smollm3?

imsathiya17

Jul 26

its not related to Ollama, the unsloth/SmolLM3-3B-128K-GGUF quantized version of the model is based on Jinja template. you can try with other quantized models with Ollama as well without Jinja.

MLShare changed discussion status to closed Jul 26

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment