QwenAmann-4B-dse
A multimodal vision-language model specialized for multilingual technical document retrieval.
Overview
QwenAmann-4B-dse is a 4B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction.
Performance
ENERGY Benchmark (racineai/Open-VLM-Retrieval-Leaderboard)

Key Strengths
- Competitive performance: Achieves performance comparable to Jina Embeddings v4 while being fully open-source under Apache 2.0 license (Jina Embeddings v4 is governed by the Qwen Research License as it derives from Qwen-2.5-VL-3B)
- Strong multilingual performance: Stable scores across 5 tested languages
- Multi-domain training: Trained on 1.44M examples across 15+ technical domains
Key Features
- Efficient Retrieval: Generates document and query embeddings for semantic similarity search
- Multimodal Understanding: Processes text, diagrams, charts, and tables in their original layout
- No Preprocessing Required: Directly works with document screenshots
Installation
pip install transformers accelerate pillow torch qwen-vl-utils
Usage Example
from PIL import Image
import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
# Load model and processor
model_path = "racineai/QwenAmann-4B-dse"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Configure image tokens (960 for Qwen3-VL)
num_image_tokens = 960
min_pixels = 1 * 32 * 32
max_pixels = num_image_tokens * 32 * 32
processor = AutoProcessor.from_pretrained(
model_path,
min_pixels=min_pixels,
max_pixels=max_pixels
)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
attn_implementation="flash_attention_2" if torch.cuda.is_available() else None,
torch_dtype=torch.bfloat16,
).to(device).eval()
# Configure processor
processor.tokenizer.padding_side = "left"
model.padding_side = "left"
def get_embedding(last_hidden_state: torch.Tensor, dimension: int = 2560) -> torch.Tensor:
"""Extract and normalize embeddings from last token."""
reps = last_hidden_state[:, -1]
reps = torch.nn.functional.normalize(reps[:, :dimension], p=2, dim=-1)
return reps
# Encode a document image
document_image = Image.open("technical_document.jpg")
doc_messages = [{
'role': 'user',
'content': [
{'type': 'image', 'image': document_image},
{'type': 'text', 'text': 'What is shown in this image?'}
]
}]
doc_text = processor.apply_chat_template(
doc_messages,
tokenize=False,
add_generation_prompt=True
) + "<|endoftext|>"
doc_image_inputs, doc_video_inputs = process_vision_info(doc_messages)
doc_inputs = processor(
text=[doc_text],
images=doc_image_inputs,
videos=doc_video_inputs,
padding='longest',
return_tensors='pt'
).to(device)
cache_position = torch.arange(0, 1)
doc_inputs = model.prepare_inputs_for_generation(
**doc_inputs,
cache_position=cache_position,
use_cache=False
)
with torch.no_grad():
doc_outputs = model(**doc_inputs, return_dict=True, output_hidden_states=True)
doc_embedding = get_embedding(doc_outputs.hidden_states[-1], dimension=2560)
# Encode a text query
query = "What are the specifications of this component?"
query_messages = [{
'role': 'user',
'content': [
{'type': 'image', 'image': Image.new('RGB', (32, 32)),
'resized_height': 1, 'resized_width': 1},
{'type': 'text', 'text': f'Query: {query}'}
]
}]
query_text = processor.apply_chat_template(
query_messages,
tokenize=False,
add_generation_prompt=True
) + "<|endoftext|>"
query_image_inputs, query_video_inputs = process_vision_info(query_messages)
query_inputs = processor(
text=[query_text],
images=query_image_inputs,
videos=query_video_inputs,
padding='longest',
return_tensors='pt'
).to(device)
cache_position = torch.arange(0, 1)
query_inputs = model.prepare_inputs_for_generation(
**query_inputs,
cache_position=cache_position,
use_cache=False
)
with torch.no_grad():
query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
query_embedding = get_embedding(query_outputs.hidden_states[-1], dimension=2560)
# Calculate similarity using dot product
similarity = torch.einsum("bd,cd->bc", query_embedding, doc_embedding)
print(f"Similarity score: {similarity.item():.4f}")
Applications
- Multilingual Technical Document Retrieval: Find relevant documents across multiple languages
- International Technical Support Systems: Match user questions to relevant documentation regardless of language
- Engineering Knowledge Management: Index and search technical specifications, diagrams, and reports
- Multi-Domain Search: Retrieve documents across military, energy, quantum computing, nuclear, geotechnical, and other technical domains
Training Methodology
QwenAmann-4B-dse was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents.
The model was fine-tuned on the OGC_MEGA_2 dataset, comprising 1.44M examples across 35+ languages with primary focus on 5 major European languages (English, French, German, Spanish, Italian). The dataset spans 15+ technical domains including military, energy, quantum computing, nuclear, geotechnical engineering, and more.
Authors
Léo Appourchaux - Lead Developer at TW3 Partners
Paul Lemaistre - GD at Racine.ai – Adjunct Professor at École Centrale d'Électronique
Dataset Curators: Léo Appourchaux, Paul Lemaistre, Yumeng Ye, Mattéo KHAN, André-Louis Rochet
License
This model is released under the Apache 2.0 license.
Citation
@misc{qwenamann-4b-dse,
author = {racine.ai},
title = {QwenAmann-4B-dse: A Multimodal Vision-Language Model for Multilingual Document Retrieval},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/racineai/QwenAmann-4B-dse}
}
- Downloads last month
- 22
Model tree for racineai/QwenAmann-4B-dse
Base model
Qwen/Qwen3-VL-4B-Instruct