example code doesn't work at all

#2
by cloudyu - opened

output is: pad only
Prompt: Write me a poem about Machine Learning.

mlx 0.15.2
mlx-lm 0.15.0

MLX Community org

The example code should work fine:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-2-27b-it-8bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)
prince-canuma changed discussion status to closed

Reproducible here:

% mlx_lm.generate --model "mlx-community/gemma-2-27b-it-8bit" --prompt "Hello"
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 31152.83it/s]
==========
Prompt: <bos><start_of_turn>user
Hello<end_of_turn>
<start_of_turn>model

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
==========
Prompt: 0.538 tokens-per-sec
Generation: 1.840 tokens-per-sec

% python3 prince.py 
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 34820.64it/s]
==========
Prompt: hello
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
==========
Prompt: 0.124 tokens-per-sec
Generation: 2.043 tokens-per-sec

yep, very bad exprience.
not work, but someone still tell you works.

The example code should work fine:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-2-27b-it-8bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)

do you really test the code?

very bad exprience.
it not work, but someone still tell you it works.

I have previously noticed differences with mlx-vlm (and PaliGemma) vs. the official demo on HF as well - but didn't have time to pursue this further. Perhaps there is an underlying MLX issue? I am using macOS 14.3 on M3 Max.

By contrast, the 9B-FP16 variant does work:

% mlx_lm.generate --model "mlx-community/gemma-2-9b-it-fp16" --prompt "Hello"
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 17614.90it/s]

Prompt: user
Hello
model

Hello! 👋

How can I help you today? 😊

==========
Prompt: 6.337 tokens-per-sec
Generation: 13.758 tokens-per-sec

MLX Community org

I'm sorry @cloudyu @ndurner ,

It was an oversight on my part,

There is a tiny bug with the 27B version, and should be fixed soon:
https://github.com/ml-explore/mlx-examples/pull/857

prince-canuma changed discussion status to open
MLX Community org

Fixed ✅

pip install -U mlx-lm

prince-canuma changed discussion status to closed

This is again an issue. Output is again after version 0.19.1. It works up to 0.19.0 only.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment