No output / Repeated outputs when using Gemma 3 12B/27B on vLLM
I have hosted Gemma 3 27B and 12B on 4 L4 GPUs using vLLM and I am trying to translate in a few docs from English to Indic languages. However, I am not getting any output in the target language or getting repetitions in English. The vLLM serve command for these models is below. I tried using in sarvam-translate with the exact same settings and it just works out of the box.
I have tried messing in with generation parameters and even tried in with smaller sentences but it does not work. Am I missing something here?
This is my vLLM serve command:
vllm serve google/gemma-3-12b-it
--dtype bfloat16
--tensor-parallel-size 4
--port 8000
--max-model-len 8192
--enable-chunked-prefill
--gpu-memory-utilization 0.9
Vanilla client code that I have been trying:
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
tgt_lang = 'Hindi'
input_txt = 'Be the change you wish to see in the world.'
messages = [{"role": "system", "content": f"Translate the text below to {tgt_lang}."}, {"role": "user", "content": input_txt}]
response = client.chat.completions.create(model=model, messages=messages, temperature=0.01)
output_text = response.choices[0].message.content
print("Input:", input_txt)
print("Translation:", output_text)```
I have this problem.
having the same issue. hope someone in google replies soon
Me too. For image-to-text, it is fine.
The issue is gone, after I use latest image:
inferenceservice:
predictor:
containers:
- name: kserve-container
imageURL: vllm/vllm-openai:v0.10.0
args:
- --model=google/gemma-3-27b-it
- --tokenizer=google/gemma-3-27b-it
- --tensor-parallel-size=8
- "--gpu-memory-utilization=0.9"
- "--max-model-len=8192"
- "--trust-remote-code"
- "--enforce-eager"
Hi,
Apologies for the late reply, the core problem is that the model is likely not interpreting the prompt as a translation task. The vllm serve command you are using is for the base model, google/gemma-3-12b-it. While this is an "instruction-tuned" model, it may not respond well to generic instructions like "Translate the text below."
The standard prompt format for Gemma 3 is as follows:
"< start_of_turn >user
[ your prompt here ]< end_of_turn >
< start_of_turn >model"
Suggested Fixes:
Use a more detailed prompt. Provide more context and specific instructions to the model. A zero-shot prompt may not be sufficient for the complex task of translation, especially for Indic languages where the model may have had less training data.
Try a few-shot prompt. Provide a few examples of English to Hindi translations in the prompt. This can significantly improve performance by showing the model the exact format and type of output you expect. This is a common and effective technique for complex tasks like translation.
Use a specific fine-tuned model. The google/gemma-3-12b-it model is a general instruction-tuned model. If you are doing a high volume of translations, consider using or fine-tuning a model specifically for this purpose. A fine-tuned model for Indic languages will likely perform better than a general-purpose model.
Thanks.