Performance?

#6
by Budenloui - opened

Hello.

I want to use translategemma for live translation. But the performance is very slow. I'm a python beginner. I used the example script and extended it with claude. There may be a lot of failures. Is the hardware too old or is code not optimized? How to speed up the translation? What scores do you achieve when translating this text from English into German?

Environment
OS: Windows 11
GPU: NVIDIA RTX 6000 (2019, Turing architecture, 24 GB VRAM)
Python: 3.11

################################ test_performance.py #####################
"""
Sequential Performance Comparison: TranslateGemma 4B vs 12B
Tests both models one after another and compares results
"""

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
import time

def benchmark_model(model_name, input_text):
"""Benchmark a single model with the given text"""
print("\n" + "=" * 80)
print(f"TESTING MODEL: {model_name}")
print("=" * 80)

# =============================================================================
# Step 1: Model Loading
# =============================================================================
print("\n[1/5] Model Loading...")
start = time.time()

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)
processor = AutoProcessor.from_pretrained(model_name)

load_time = time.time() - start
param_count = sum(p.numel() for p in model.parameters()) / 1e9

print(f"✓ Model loaded in: {load_time:.2f}s")
print(f"✓ Parameters: {param_count:.2f}B")

# =============================================================================
# Step 2: Input Preparation
# =============================================================================
print("\n[2/5] Input preparation...")

messages = [{
    "role": "user",
    "content": [{
        "type": "text",
        "source_lang_code": "en",
        "target_lang_code": "de-DE",
        "text": input_text
    }]
}]

word_count = len(input_text.split())
char_count = len(input_text.strip())
print(f"✓ Input: {word_count} words, {char_count} chars")

# =============================================================================
# Step 3: Tokenization
# =============================================================================
print("\n[3/5] Tokenization...")
start = time.time()

try:
    formatted_text = processor.apply_chat_template(
        messages, 
        tokenize=False,
        add_generation_prompt=True
    )
    inputs = processor(text=formatted_text, return_tensors="pt").to("cuda")
except Exception as e:
    print(f"  Chat template method failed: {e}")
    print("  Using fallback method...")
    prompt = f"Translate from de-DE to en:\n{input_text}"
    inputs = processor(text=prompt, return_tensors="pt").to("cuda")

input_length = inputs['input_ids'].shape[1]
tokenize_time = time.time() - start

print(f"✓ Tokenized in: {tokenize_time:.3f}s")
print(f"✓ Input tokens: {input_length}")

# =============================================================================
# Step 4: Inference with Warmup
# =============================================================================
print("\n[4/5] Translation inference...")

# Warmup
print("  Warmup run...")
with torch.no_grad():
    _ = model.generate(**inputs, max_new_tokens=10, do_sample=False)

# Main inference
print("  Main inference running...")
torch.cuda.reset_peak_memory_stats()
start = time.time()

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=500,
        do_sample=False,
        use_cache=True,
        pad_token_id=processor.tokenizer.eos_token_id,
    )

inference_time = time.time() - start
output_length = outputs.shape[1] - input_length
tokens_per_sec = output_length / inference_time
peak_memory = torch.cuda.max_memory_allocated() / 1024**3

print(f"  ✓ Inference completed in: {inference_time:.2f}s")
print(f"  ✓ Output tokens: {output_length}")
print(f"  ✓ Speed: {tokens_per_sec:.2f} tokens/s")
print(f"  ✓ Time per token: {(inference_time/output_length)*1000:.1f}ms")
print(f"  ✓ Peak VRAM: {peak_memory:.2f} GB")

# =============================================================================
# Step 5: Decoding
# =============================================================================
print("\n[5/5] Decoding...")
start = time.time()

decoded = processor.batch_decode(
    outputs[:, input_length:],
    skip_special_tokens=True
)[0]

decode_time = time.time() - start
output_words = len(decoded.split())
output_chars = len(decoded.strip())

print(f"✓ Decoded in: {decode_time:.3f}s")
print(f"✓ Output: {output_words} words, {output_chars} chars")

# =============================================================================
# Calculate All Metrics
# =============================================================================
total_time = load_time + tokenize_time + inference_time + decode_time
words_per_sec = output_words / inference_time
ms_per_token = (inference_time / output_length) * 1000

results = {
    'model_name': model_name,
    'model_short': model_name.split('/')[-1].replace('TranslateGemma-', '').replace('-it', ''),
    'param_count': param_count,
    'load_time': load_time,
    'tokenize_time': tokenize_time,
    'inference_time': inference_time,
    'decode_time': decode_time,
    'total_time': total_time,
    'input_words': word_count,
    'input_chars': char_count,
    'input_tokens': input_length,
    'output_words': output_words,
    'output_chars': output_chars,
    'output_tokens': output_length,
    'tokens_per_sec': tokens_per_sec,
    'words_per_sec': words_per_sec,
    'ms_per_token': ms_per_token,
    'peak_vram': peak_memory,
    'translation': decoded
}

# Print summary for this model
print("\n" + "-" * 80)
print(f"SUMMARY FOR {results['model_short']}")
print("-" * 80)
print(f"Total time:        {total_time:.2f}s")
print(f"Inference time:    {inference_time:.2f}s")
print(f"Speed:             {tokens_per_sec:.2f} tok/s, {words_per_sec:.2f} words/s")
print(f"Peak VRAM:         {peak_memory:.2f} GB")

# Cleanup
del model
torch.cuda.empty_cache()
time.sleep(2)  # Brief pause before next model

return results

=============================================================================

MAIN PROGRAM

=============================================================================

print("=" * 80)
print("TRANSLATEGEMMA SEQUENTIAL COMPARISON: 4B vs 12B")
print("=" * 80)

Test text - Full translation

input_text = """
But the tech adoption in Europe is a serious and very, very structural problem.
And what scares me the most is I haven't seen any political leader just stand up and say,
we have a serious and structural problem that we are going to fix. So that, then you get to
the development world. I would imagine, it also depends what you mean by the developing
world, I would imagine with not enough knowledge, you're just going to find pockets that
go very well and pockets that go very poorly. As a generalization, like again, if you go
back to this somewhat non-successful soliloquy I had about the underlying architecture,
one way to look at the unfairness of AI is it can test, meaning it loads errors on things.
So societies and organizations and companies that can bear that load have a huge advantage.
The problem is, if you can't, if you've been pretending you're bearing a load, you're not.
It collapses, and that's where you have to start. And so, if you go around and just say,
okay, what societies and microcultures are going to be load-bearing here, I think you would
find that partially developing world, certain communities that are going to do very well.
You do need a realistic assessment of the load-bearing. And there's a certain honesty that
is painful for all of us in this technology. Large language models, however, implemented in
software, you just cannot obfuscate what can bear the load and what can't. And then political
structures are built to do just that. Like, yeah, I can't fix anything, but I can give you some
bullshit line that you want to hear that's going to make you not care about how bad your life
is and how much worse it's going to be tomorrow.
"""

print(f"\nTest text: {len(input_text.split())} words\n")

Test both models

models = [
"google/TranslateGemma-4B-it",
"google/TranslateGemma-12B-it"
]

results = []

for model_name in models:
try:
result = benchmark_model(model_name, input_text)
results.append(result)
except Exception as e:
print(f"\n✗ Error testing {model_name}: {e}")
import traceback
traceback.print_exc()

=============================================================================

COMPARISON ANALYSIS

=============================================================================

if len(results) == 2:
model_4b = results[0]
model_12b = results[1]

print("\n\n" + "=" * 80)
print("DETAILED COMPARISON: 4B vs 12B")
print("=" * 80)

# Phase-by-phase comparison
print("\n" + "-" * 80)
print("PHASE-BY-PHASE BREAKDOWN")
print("-" * 80)
print(f"\n{'Phase':<20} {'4B Model':<15} {'12B Model':<15} {'Difference':<15}")
print("-" * 80)
print(f"{'Model Loading':<20} {model_4b['load_time']:>12.2f}s   {model_12b['load_time']:>12.2f}s   {model_4b['load_time']-model_12b['load_time']:>+12.2f}s")
print(f"{'Tokenization':<20} {model_4b['tokenize_time']:>12.3f}s   {model_12b['tokenize_time']:>12.3f}s   {model_4b['tokenize_time']-model_12b['tokenize_time']:>+12.3f}s")
print(f"{'Inference':<20} {model_4b['inference_time']:>12.2f}s   {model_12b['inference_time']:>12.2f}s   {model_4b['inference_time']-model_12b['inference_time']:>+12.2f}s")
print(f"{'Decoding':<20} {model_4b['decode_time']:>12.3f}s   {model_12b['decode_time']:>12.3f}s   {model_4b['decode_time']-model_12b['decode_time']:>+12.3f}s")
print("-" * 80)
print(f"{'TOTAL':<20} {model_4b['total_time']:>12.2f}s   {model_12b['total_time']:>12.2f}s   {model_4b['total_time']-model_12b['total_time']:>+12.2f}s")

# Performance metrics comparison
print("\n" + "-" * 80)
print("PERFORMANCE METRICS")
print("-" * 80)

speedup_tok = model_4b['tokens_per_sec'] / model_12b['tokens_per_sec']
speedup_words = model_4b['words_per_sec'] / model_12b['words_per_sec']
time_saved = ((model_12b['inference_time'] - model_4b['inference_time']) / model_12b['inference_time']) * 100
vram_saved = ((model_12b['peak_vram'] - model_4b['peak_vram']) / model_12b['peak_vram']) * 100

print(f"\n{'Metric':<25} {'4B Model':<18} {'12B Model':<18} {'4B Advantage':<15}")
print("-" * 80)
print(f"{'Parameters':<25} {model_4b['param_count']:>14.2f}B   {model_12b['param_count']:>14.2f}B   {model_12b['param_count']/model_4b['param_count']:>10.2f}x smaller")
print(f"{'Inference Time':<25} {model_4b['inference_time']:>16.2f}s   {model_12b['inference_time']:>16.2f}s   {time_saved:>10.1f}% faster")
print(f"{'Tokens/second':<25} {model_4b['tokens_per_sec']:>16.2f}   {model_12b['tokens_per_sec']:>16.2f}   {speedup_tok:>10.2f}x")
print(f"{'Words/second':<25} {model_4b['words_per_sec']:>16.2f}   {model_12b['words_per_sec']:>16.2f}   {speedup_words:>10.2f}x")
print(f"{'ms per Token':<25} {model_4b['ms_per_token']:>16.1f}   {model_12b['ms_per_token']:>16.1f}   {model_12b['ms_per_token']/model_4b['ms_per_token']:>10.2f}x faster")
print(f"{'Peak VRAM':<25} {model_4b['peak_vram']:>14.2f}GB   {model_12b['peak_vram']:>14.2f}GB   {vram_saved:>10.1f}% less")

# Token analysis
print("\n" + "-" * 80)
print("TOKEN ANALYSIS")
print("-" * 80)
print(f"\n{'Metric':<25} {'4B Model':<18} {'12B Model':<18} {'Difference':<15}")
print("-" * 80)
print(f"{'Input tokens':<25} {model_4b['input_tokens']:>16d}   {model_12b['input_tokens']:>16d}   {model_4b['input_tokens']-model_12b['input_tokens']:>+15d}")
print(f"{'Output tokens':<25} {model_4b['output_tokens']:>16d}   {model_12b['output_tokens']:>16d}   {model_4b['output_tokens']-model_12b['output_tokens']:>+15d}")
print(f"{'Output words':<25} {model_4b['output_words']:>16d}   {model_12b['output_words']:>16d}   {model_4b['output_words']-model_12b['output_words']:>+15d}")

# Efficiency score
print("\n" + "-" * 80)
print("EFFICIENCY ANALYSIS")
print("-" * 80)

# Speed per parameter (tok/s per billion params)
efficiency_4b = model_4b['tokens_per_sec'] / model_4b['param_count']
efficiency_12b = model_12b['tokens_per_sec'] / model_12b['param_count']

print(f"\nSpeed per billion parameters:")
print(f"  4B Model:  {efficiency_4b:.3f} tok/s/B")
print(f"  12B Model: {efficiency_12b:.3f} tok/s/B")
print(f"  Efficiency ratio: {efficiency_4b/efficiency_12b:.2f}x")

# Time projections
print("\n" + "-" * 80)
print("TIME PROJECTIONS FOR DIFFERENT TEXT LENGTHS")
print("-" * 80)
print(f"\n{'Text Length':<15} {'4B Model':<15} {'12B Model':<15} {'Time Saved':<15}")
print("-" * 80)

for words in [100, 500, 1000, 5000, 10000]:
    time_4b = words / model_4b['words_per_sec']
    time_12b = words / model_12b['words_per_sec']
    saved_pct = ((time_12b - time_4b) / time_12b) * 100
    
    # Format times nicely
    if time_4b < 60:
        time_4b_str = f"{time_4b:.1f}s"
    else:
        time_4b_str = f"{time_4b/60:.1f}min"
    
    if time_12b < 60:
        time_12b_str = f"{time_12b:.1f}s"
    else:
        time_12b_str = f"{time_12b/60:.1f}min"
    
    print(f"{words:>6} words    {time_4b_str:>13}   {time_12b_str:>13}   {saved_pct:>11.1f}%")

# Translation quality sample
print("\n" + "=" * 80)
print("TRANSLATION SAMPLES (first 400 characters)")
print("=" * 80)

print(f"\n4B Model:")
print("-" * 80)
print(model_4b['translation'][:400] + ("..." if len(model_4b['translation']) > 400 else ""))

print(f"\n12B Model:")
print("-" * 80)
print(model_12b['translation'][:400] + ("..." if len(model_12b['translation']) > 400 else ""))

# Summary
print("\n" + "=" * 80)
print("PERFORMANCE SUMMARY")
print("=" * 80)

print(f"""

• 4B model is {speedup_tok:.2f}x faster ({time_saved:.1f}% time saved)
• 4B model uses {vram_saved:.1f}% less VRAM ({model_12b['peak_vram']-model_4b['peak_vram']:.2f} GB saved)
• 4B model is {efficiency_4b/efficiency_12b:.2f}x more efficient per parameter
• 12B model produces more detailed and nuanced translations
""")

else:
print("\n⚠ Could not complete comparison - one or both models failed to run")

System info

print("=" * 80)
print("SYSTEM INFO")
print("=" * 80)
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA: {torch.version.cuda}")
print(f"Total VRAM: {torch.cuda.get_device_properties(0).total_memory/1024**3:.1f} GB")

print("\n" + "=" * 80)
print("ANALYSIS COMPLETE")
print("=" * 80)

######################## output ##########################################################
(tg-env) PS C:\Translate> python test_performance.py

TRANSLATEGEMMA SEQUENTIAL COMPARISON: 4B vs 12B

Test text: 301 words

================================================================================
TESTING MODEL: google/TranslateGemma-4B-it

[1/5] Model Loading...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.74s/it]
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
✓ Model loaded in: 14.61s
✓ Parameters: 4.30B

[2/5] Input preparation...
✓ Input: 301 words, 1694 chars

[3/5] Tokenization...
✓ Tokenized in: 0.063s
✓ Input tokens: 490

[4/5] Translation inference...
Warmup run...
The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set TRANSFORMERS_VERBOSITY=info for more details.
Main inference running...
✓ Inference completed in: 49.61s
✓ Output tokens: 500
✓ Speed: 10.08 tokens/s
✓ Time per token: 99.2ms
✓ Peak VRAM: 8.18 GB

[5/5] Decoding...
✓ Decoded in: 0.000s
✓ Output: 279 words, 1879 chars


SUMMARY FOR 4B

Total time: 64.28s
Inference time: 49.61s
Speed: 10.08 tok/s, 5.62 words/s
Peak VRAM: 8.18 GB

================================================================================
TESTING MODEL: google/TranslateGemma-12B-it

[1/5] Model Loading...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.03s/it]
✓ Model loaded in: 24.69s
✓ Parameters: 12.19B

[2/5] Input preparation...
✓ Input: 301 words, 1694 chars

[3/5] Tokenization...
✓ Tokenized in: 0.005s
✓ Input tokens: 490

[4/5] Translation inference...
Warmup run...
Main inference running...
✓ Inference completed in: 222.32s
✓ Output tokens: 500
✓ Speed: 2.25 tokens/s
✓ Time per token: 444.6ms
✓ Peak VRAM: 23.15 GB

[5/5] Decoding...
✓ Decoded in: 0.000s
✓ Output: 290 words, 1910 chars


SUMMARY FOR 12B

Total time: 247.02s
Inference time: 222.32s
Speed: 2.25 tok/s, 1.30 words/s
Peak VRAM: 23.15 GB

================================================================================
DETAILED COMPARISON: 4B vs 12B


PHASE-BY-PHASE BREAKDOWN

Phase 4B Model 12B Model Difference

Model Loading 14.61s 24.69s -10.08s
Tokenization 0.063s 0.005s +0.057s
Inference 49.61s 222.32s -172.71s
Decoding 0.000s 0.000s +0.000s

TOTAL 64.28s 247.02s -182.74s


PERFORMANCE METRICS

Metric 4B Model 12B Model 4B Advantage

Parameters 4.30B 12.19B 2.83x smaller
Inference Time 49.61s 222.32s 77.7% faster
Tokens/second 10.08 2.25 4.48x
Words/second 5.62 1.30 4.31x
ms per Token 99.2 444.6 4.48x faster
Peak VRAM 8.18GB 23.15GB 64.6% less


TOKEN ANALYSIS

Metric 4B Model 12B Model Difference

Input tokens 490 490 +0
Output tokens 500 500 +0
Output words 279 290 -11


EFFICIENCY ANALYSIS

Speed per billion parameters:
4B Model: 2.344 tok/s/B
12B Model: 0.185 tok/s/B
Efficiency ratio: 12.70x


TIME PROJECTIONS FOR DIFFERENT TEXT LENGTHS

Text Length 4B Model 12B Model Time Saved

100 words 17.8s 1.3min 76.8%
500 words 1.5min 6.4min 76.8%
1000 words 3.0min 12.8min 76.8%
5000 words 14.8min 63.9min 76.8%
10000 words 29.6min 127.8min 76.8%

================================================================================
TRANSLATION SAMPLES (first 400 characters)

4B Model:

Aber die Technologieeinführung in Europa ist ein ernsthaftes und sehr strukturelles Problem.
Und was mich am meisten beunruhigt, ist, dass ich keine politische Führungskraft gesehen habe, die einfach nur sagt:
"Wir haben ein ernsthaftes und strukturelles Problem, das wir beheben werden." Also, dann kommt man zu der Welt der Entwicklung. Ich stelle mir vor, dass es auch davon abhängt, was man unter...

12B Model:

Die Einführung von Technologie in Europa ist ein ernstes und sehr tiefgreifendes Problem.
Und was mich am meisten beunruhigt, ist, dass ich noch keinen politischen Führer erlebt habe, der einfach gesagt hat:
"Wir haben ein ernstes und strukturelles Problem, das wir lösen werden." Wenn man dann in die Entwicklungsländer schaut, so denke ich, hängt es auch davon ab, was man unter "Entwicklungsländer...

================================================================================
PERFORMANCE SUMMARY

• 4B model is 4.48x faster (77.7% time saved)
• 4B model uses 64.6% less VRAM (14.97 GB saved)
• 4B model is 12.70x more efficient per parameter
• 12B model produces more detailed and nuanced translations

================================================================================
SYSTEM INFO

GPU: Quadro RTX 6000
CUDA: 12.6
Total VRAM: 24.0 GB

================================================================================
ANALYSIS COMPLETE

How long does it take to translate this example text from en to de-DE on your system?

RTX 6000 (Turing, 24GB)
Total time 4B : 64s
Total time 12B: 247s

input_text = """
But the tech adoption in Europe is a serious and very, very structural problem.
And what scares me the most is I haven't seen any political leader just stand up and say,
we have a serious and structural problem that we are going to fix. So that, then you get to
the development world. I would imagine, it also depends what you mean by the developing
world, I would imagine with not enough knowledge, you're just going to find pockets that
go very well and pockets that go very poorly. As a generalization, like again, if you go
back to this somewhat non-successful soliloquy I had about the underlying architecture,
one way to look at the unfairness of AI is it can test, meaning it loads errors on things.
So societies and organizations and companies that can bear that load have a huge advantage.
The problem is, if you can't, if you've been pretending you're bearing a load, you're not.
It collapses, and that's where you have to start. And so, if you go around and just say,
okay, what societies and microcultures are going to be load-bearing here, I think you would
find that partially developing world, certain communities that are going to do very well.
You do need a realistic assessment of the load-bearing. And there's a certain honesty that
is painful for all of us in this technology. Large language models, however, implemented in
software, you just cannot obfuscate what can bear the load and what can't. And then political
structures are built to do just that. Like, yeah, I can't fix anything, but I can give you some
bullshit line that you want to hear that's going to make you not care about how bad your life
is and how much worse it's going to be tomorrow.
"""

Google org

Hi @Budenloui
Thank you for sharing such detailed performance benchmarks for the Gemma translation models your insights are incredibly valuable. I’ve will your findings and the latency concerns with our engineering team for a closer look.

I have the same question.

GPU: H100x1
CUDA: 12.9
Total VRAM: 96.0 GB

Total time 27B : 11s

input_text = """
Apart from the snow and the temperature, Greenland does not have much in common with the Swiss alps. But the fight for the future of the island looms over the gathering of world leaders and businesses at the World Economic Forum (WEF) this week.

Indeed, the timing of Donald Trump's extraordinary threat seems likely to have had this meeting in mind.

Trump loves Davos - which is beyond strange given the views of his base.

Last year, he beamed himself into the WEF from the White House, appearing before an audience of largely bewildered European executives just two days after his inauguration.

There was awkward shuffling as he mentioned his territorial ambitions for Canada and Greenland, and made an "offer you can't refuse" to those importing into his country. Build factories in the US or pay tariffs that will raise trillions: "Your prerogative."

He did it with a smile, however, apologised for not attending in person and promised he would be there this year.

And on Wednesday he will be here, pushing the Team USA message at a time of bewilderment in much of the rest of the world, especially in Europe.

Trump is due to speak at what will be the biggest Davos ever, driven by his presence and his policies, which have caused what one of the cerebral sessions of WEF might refer to using a snappy title like "The Great Global Disruption".

Trump is the disruptor-in-chief, certainly right now. He will be pursued by other world leaders and corporate bosses about his attempt to coerce Europe economically to sell Greenland.

The forum will be the centre of the world this week - and totally bizarre.

"A spirit of dialogue" is the official theme, and while there are certainly opportunities at an event like this for conversations not possible elsewhere, there is much in the US administration's approach that seems to be opposed to the call for global cooperation that is the essence of this place.
"""

Thanks for the feedback.
Have you used my code? It also shows how long it takes for the steps to start up the engine and the translation time only.
How much of your 96GB has been used by the 27b model? I think it took the most of your time to initialize the engine and load the model into the nvram.
Can you please repeat it with my example text using 4b und 12b models?

Sign up or log in to comment