SmolLM3-3B-128K-GGUF
unfortunately it ends with
Error: unable to load model: ....
Having same issue, unable to load the .gguf file
update llama to recent version.
llama-cli --version
version: 5873 (f5e96b36)
built with MSVC 19.43.34808.0 for x64
run inference with llama cli:
llama-cli -m "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf" --jinja -p "Tell me something cool about space."
why the -jinja tag... it breaks my inference app :-(
How to use this model with python llama.cpp?
this gguf version of the model is based on jinja template embedded into it. You have to use the tag --jinga to use its default template or you can create your own template and use it with llama cli.
e.g with custom jinja template:
llama-cli ^
-m "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf" ^
--jinja ^
--ctx-size 131072 ^
--interactive ^
--chat-template-file "...cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\my_custom_template.jinja"
Sample Jinja template:
<my_custom_template.jinja>
<|im_start|>system
## Metadata
Knowledge Cutoff Date: June 2025
Today Date: 12 July 2025
## Instructions
You are a helpful AI assistant named SmolLM, trained by Hugging Face. Respond clearly and concisely to user queries.
<|im_end|>
{% for message in messages %}
<|im_start|>{{ message.role }}
{{ message.content }}<|im_end|>
{% endfor %}
<|im_start|>assistant
@erickdp :
to load this model with Python llama.cpp:
Make sure you update the llama.cpp to latest version as smolm3 architecture is not supported in older version,
pip install --upgrade llama-cpp-python --force-reinstall --no-cache-dir
pip show llama-cpp-python
Name: llama_cpp_python
Version: 0.3.14
Below is a sample inference code to load the gguf via python llama.cpp,
from llama_cpp import Llama
from jinja2 import Environment, BaseLoader
from datetime import datetime
# Define strftime_now function
def strftime_now(fmt):
return datetime.now().strftime(fmt)
# Path to GGUF file
model_path = r"..\.cache\huggingface\hub\models--unsloth--SmolLM3-3B-128K-GGUF\snapshots\3d9d3591996644952b74d2efb7b433450fd3da82\SmolLM3-3B-128K-Q4_K_M.gguf"
# Load the model
llm = Llama(
model_path=model_path,
n_ctx=131072,
temperature=0.7,
repeat_penalty=1.1,
verbose=False
)
chat_template = llm.metadata.get("tokenizer.chat_template")
if not chat_template:
print("No embedded template found in GGUF. Using custom fallback.")
chat_template = """{{ bos_token }}{% for message in messages %}
<|{{ message.role }}|>
{{ message.content }}{% endfor %}<|assistant|>
"""
# Use custom Jinja environment with strftime_now
jinja_env = Environment(loader=BaseLoader())
jinja_env.globals["strftime_now"] = strftime_now
template = jinja_env.from_string(chat_template)
messages = [{"role": "system", "content": "You are a helpful assistant."}]
print("π€ Chat with SmolLM3 (type 'exit' to quit)\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ["exit", "quit"]:
break
messages.append({"role": "user", "content": user_input})
# Render full chat history using the template
prompt = template.render(
messages=messages,
bos_token=llm.metadata.get("tokenizer.bos_token", ""),
eos_token=llm.metadata.get("tokenizer.eos_token", "")
)
print("Assistant:", end=" ", flush=True)
assistant_reply = ""
for chunk in llm(prompt, max_tokens=512, stream=True, stop=["<|user|>", "<|end|>", "</s>"]):
token = chunk["choices"][0]["text"]
assistant_reply += token
print(token, end="", flush=True)
print()
messages.append({"role": "assistant", "content": assistant_reply.strip()})
so does that mean its just the wrong template in ollama? for smollm3?
its not related to Ollama, the unsloth/SmolLM3-3B-128K-GGUF quantized version of the model is based on Jinja template. you can try with other quantized models with Ollama as well without Jinja.