🚀 Next 4B (s330)

Türkiye’s First Vision-Language Model — Efficient, Multimodal, and Reasoning-Focused

License: MIT Language: English HuggingFace


📖 Overview

Next 4B is a 4-billion parameter multimodal Vision-Language Model (VLM) based on Gemma 3, fine-tuned to handle both text and images efficiently. It is Türkiye’s first open-source vision-language model, designed for:

  • Understanding and generating text and image descriptions.
  • Efficient reasoning and context-aware multimodal outputs.
  • Turkish support with multilingual capabilities.
  • Low-resource deployment using 8-bit quantization for consumer-grade GPUs.

This model is ideal for researchers, developers, and organizations who need a high-performance multimodal AI capable of visual understanding, reasoning, and creative generation.


Our Next 1B and Next 4B models are leading to all of the tiny models in benchmarks.

Model MMLU (5-shot) % MMLU-Pro % GSM8K % MATH %
Next 4B preview Version s325 84.6 66.9 82.7 70.5
Next 1B Version t327 87.3 69.2 90.5 70.1
Qwen 3 0.6B 52.81 37.6 60.7 20.5
Llama 3.2 1B 49.3 44.4 11.9 30.6
Kumru 7B not verified 30.7 28.6 15.38 6.4

Also, our Next Z1 model is leading to state-of-the-art models in some of the Benchmarks.

Model MMLU (5-shot) % MMLU-Pro % GSM8K % MATH %
Next Z1 Version l294 97.3 94.2 97.7 93.2
Next Z1 Version l294 (no tool) 94.7 90.1 94.5 88.7
GPT 5 92.5 87.0 98.4 96.0
Claude Opus 4.1 (Thinking) ~92.0 87.8 84.7 95.4

🚀 Installation & Usage

Use with vision:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "Lamapi/next-4b"

model = AutoModelForCausalLM.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id) # For vision.
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Read image
image = Image.open("image.jpg")

# Create a message in chat format
messages = [
  {"role": "system","content": [{"type": "text", "text": "You are Next-X1, a smart and concise AI assistant trained by Lamapi. Always respond in the user's language. Proudly made in Turkey."}]},

  {
      "role": "user","content": [{"type": "image", "image": image},
      {"type": "text", "text": "Who is in this image?"}
    ]
  }
]

# Prepare input with Tokenizer
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

# Output from the model
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Who is in this image?
The image shows Mustafa Kemal Atatürk, the founder and first President of the Republic of Turkey.

Use without vision:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Lamapi/next-4b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Chat message
messages = [
    {"role": "system", "content": "You are Next-X1, a smart and concise AI assistant trained by Lamapi. Always respond in the user's language. Proudly made in Turkey."},
    {"role": "user", "content": "Hello, how are you?"}
]

# Prepare input with Tokenizer
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")

# Output from the model
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Hello, how are you?
I'm fine, thank you. How are you?

🎯 Goals

  1. Multimodal Intelligence: Understand and reason over images and text.
  2. Efficiency: Run on modest GPUs using 8-bit quantization.
  3. Accessibility: Open-source availability for research and applications.
  4. Cultural Relevance: Optimized for Turkish language and context while remaining multilingual.

✨ Key Features

Feature Description
🔋 Efficient Architecture Optimized for low VRAM; supports 8-bit quantization for consumer GPUs.
🖼️ Vision-Language Capable Understands images, captions them, and performs visual reasoning tasks.
🇹🇷 Multilingual & Turkish-Ready Handles complex Turkish text with high accuracy.
🧠 Advanced Reasoning Supports logical and analytical reasoning for both text and images.
📊 Consistent & Reliable Outputs Reproducible responses across multiple runs.
🌍 Open Source Transparent, community-driven, and research-friendly.

📐 Model Specifications

Specification Details
Base Model Gemma 3
Parameter Count 4 Billion
Architecture Transformer, causal LLM + Vision Encoder
Fine-Tuning Method Instruction & multimodal fine-tuning (SFT) on Turkish and multilingual datasets
Optimizations Q8_0, F16, F32 quantizations for low VRAM and high VRAM usage
Modalities Text & Image
Use Cases Image captioning, multimodal QA, text generation, reasoning, creative storytelling

📄 License

This project is licensed under the MIT License — free to use, modify, and distribute. Attribution is appreciated.


📞 Contact & Support


Next 4B — Türkiye’s first vision-language AI, combining multimodal understanding, reasoning, and efficiency.

Follow on HuggingFace

Downloads last month
291
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Lamapi/next-4b