NuMarkdown-8B-reasoning on A100 40GB is extremely slow (even for 1 token)

#4
by Fedoration - opened

Context:
– GPU: A100 40GB (bfloat16), device_map="auto"
– Libraries: transformers, Qwen2_5_VLForConditionalGeneration, AutoProcessor
– Input: a single PDF page at 300 dpi, passed as a PIL Image
– Goal: measure latency for generating 1 token

Code:
model_id = "numind/NuMarkdown-8B-reasoning"

processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
min_pixels=2562828, max_pixels=50002828
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)

%%time
pdf_path = "../data/document.pdf"
images = convert_from_path(pdf_path, dpi=300)
img = images[0]

messages = [{
"role": "user",
"content": [
{"type": "image"},
],
}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_input = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)

with torch.no_grad():
print("Start generating")
model_output = model.generate(**model_input, temperature=0.7, max_new_tokens=1)

print("Finish generating")
result = processor.decode(model_output)
result

Logs/timings:
Start generating
Finish generating
CPU times: user 3.1 s, sys: 221 ms, total: 3.32 s
Wall time: 3.55 s

Issue:
Even with max_new_tokens=1, inference takes ~3.5 s on an A100 40GB. Expected sub-second latency for a single image and single token. Additionally, generating an answer of about 512 characters takes roughly 2,048 seconds (around 30 minutes), which seems abnormally slow.

Am I doing something wrong in the current setup that would explain such slow generation, and what should be changed?

How big is your input sequence? I imagine it is likely huge.
Try lowering your dpi and min_pixels. If you want to use transformers it is highly recommended you also use flashattention like we do in the model card.

A lot of tokens generally need to be generated for the reasoning trace, so if speed is of high priority we recommend vllm instead of transformers.

How big is your input sequence? I imagine it is likely huge.
Try lowering your dpi and min_pixels. If you want to use transformers it is highly recommended you also use flashattention like we do in the model card.

A lot of tokens generally need to be generated for the reasoning trace, so if speed is of high priority we recommend vllm instead of transformers.

why i use the vllm it can't work well ,even don't have thinking tag

Sign up or log in to comment