The generative output is strange
#11
by
tangpeng
- opened
I use following code to run the model, but get strange output.
Is anyone else seeing this?
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name_or_path = "mistralai/Mixtral-8x7B-v0.1-GPTQ-8bit" ## gptq-8bit-128g-actorder_True branch
# To use a different branch, change revision
# For example: revision="gptq-4bit-128g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
device_map="auto",
offload_buffers=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
prompt = "Write a story about llamas"
system_message = "You are a story writing assistant"
prompt_template=f'''{prompt}
'''
print("\n\n*** Generate:")
inputs = tokenizer(prompt_template, return_tensors='pt')
input_ids = {k: v.to('cuda') for k, v in inputs.items()}
output = model.generate(**input_ids, max_length=50)
print(tokenizer.decode(output[0]) )
output:
/home/tp/miniconda3/envs/moe-infinity/lib/python3.9/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
warnings.warn(
/home/tp/miniconda3/envs/moe-infinity/lib/python3.9/site-packages/accelerate/utils/modeling.py:1363: UserWarning: Current model requires 32343407360 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True.
warnings.warn(
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
*** Generate:
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
<s> Write a story about llamas
<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
environment:
Nvidia RTX4090, 24GB memory
Package Version Editable project location
----------------------------- ----------- ---------------------------------------------
accelerate 0.29.3
aiohttp 3.9.5
aiosignal 1.3.1
alabaster 0.7.16
async-timeout 4.0.3
attrs 23.2.0
auto_gptq 0.7.1
Babel 2.14.0
certifi 2024.2.2
chardet 5.2.0
charset-normalizer 3.3.2
coloredlogs 15.0.1
datasets 2.19.0
dill 0.3.8
docutils 0.21.1
einops 0.7.0
filelock 3.13.4
flash-attn 2.5.7
frozenlist 1.4.1
fsspec 2024.3.1
gekko 1.1.1
hjson 3.1.0
huggingface-hub 0.22.2
humanfriendly 10.0
idna 3.7
imagesize 1.4.1
importlib_metadata 7.1.0
Jinja2 3.1.3
MarkupSafe 2.1.5
moe_infinity 0.0.1 /home/tp/edge_moe/baselines/MoE-Infinity-main
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
optimum 1.19.0
packaging 24.0
pandas 2.2.2
peft 0.10.0
pip 23.3.1
protobuf 5.26.1
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 12.0.0
pyarrow-hotfix 0.6
pydantic 1.10.12
Pygments 2.17.2
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.4.16
requests 2.31.0
rouge 1.0.1
safetensors 0.4.3
scipy 1.13.0
sentencepiece 0.2.0
setuptools 68.2.2
six 1.16.0
snowballstemmer 2.2.0
Sphinx 7.3.7
sphinxcontrib-applehelp 1.0.8
sphinxcontrib-devhelp 1.0.6
sphinxcontrib-htmlhelp 2.0.5
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.7
sphinxcontrib-serializinghtml 1.1.10
sympy 1.12
tokenizers 0.15.2
tomli 2.0.1
torch 2.2.2
tqdm 4.66.2
transformers 4.39.3
triton 2.2.0
typing_extensions 4.11.0
tzdata 2024.1
urllib3 2.2.1
wheel 0.41.2
xxhash 3.4.1
yarl 1.9.4
zipp 3.18.1
Thanks!