GGUF
Inference Endpoints
conversational

請問該怎麼使用AutoTokenizer,在本地運行“taide/Llama3-TAIDE-LX-8B-Chat-Alpha1-4bit"呢?

#3
by Stephy0215 - opened

我在本地這邊,運行"tokenizer = AutoTokenizer.from_pretrained("taide/Llama3-TAIDE-LX-8B-Chat-Alpha1-4bit")"這行程式碼時,會遇到OSError: taide/Llama3-TAIDE-LX-8B-Chat-Alpha1-4bit does not appear to have a file named config.json. Checkout 'https://huggingface.co/taide/Llama3-TAIDE-LX-8B-Chat-Alpha1-4bit/main' for available files.

因此想詢問有沒有其他方法,可以在本地運行這個模型呢?謝謝

p.s. 由於我是使用MAC,但是晶片是intel的,因此無法使用LM Studio來跑這個模型

TAIDE org

您好,

請參考:https://huggingface.co/taide/TAIDE-LX-7B-Chat-4bit/discussions/3
載入模型的時候設定 load_in_4bit=True

Best regards.

收到!
非常感謝您~ 這樣做就可以跑了

只不過
我在運行以下這行code的時候 出現了錯誤

code : model = AutoModelForCausalLM.from_pretrained('taide/Llama3-TAIDE-LX-8B-Chat-Alpha1', device_map="auto", load_in_4bit=True)

error : ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

想請問該怎麼解決呢?謝謝!

附註:
以下是使用的python與套件版本,提供給您參考
python 3.9
transformers==4.30
accelerate==0.30.0

TAIDE org
This comment has been hidden
TAIDE org

您好,

可以透過 BitsAndBytes 來載入模型。

請參考:https://huggingface.co/google/flan-ul2/discussions/8
(As this session talked about, it is because your GPU memory is not big enough to load the model. You can increase GPU memory, or separate part of the modules into CPU memory.)

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from torch import bfloat16

# https://huggingface.co/docs/hub/security-tokens#user-access-tokens
my_token = "***********************************************************************"  # 這行需換成您自己的 access token

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=bfloat16)

# load model
model_name = "taide/Llama3-TAIDE-LX-8B-Chat-Alpha1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             load_in_4bit=True,
                                             device_map="auto",
                                             trust_remote_code=True,
                                             quantization_config=bnb_config,
                                             token=my_token)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# prepare prompt
question = "臺灣最高的建築物是?"
chat = [
    {"role": "user", "content": f"{question}"},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False)

# generate response
x = pipe(f"{prompt}", max_new_tokens=1024)
print(f"TAIDE: {x}")

Best regards.

您好,非常感謝您的協助!
"As this session talked about, it is because your GPU memory is not big enough to load the model. You can increase GPU memory, or separate part of the modules into CPU memory." : 所以這代表必須要有GPU,才能跑這個模型嗎?因為我是想要用電腦的CPU去跑

因為我使用BitsAndBytes 來載入模型之後,還是出現了以下的錯誤:

ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

您好,

  • 可以只用 CPU,修改
device_map="cpu"
  • 試試 llm_int8_enable_fp32_cpu_offload
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int8_enable_fp32_cpu_offload=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=bfloat16)

Best regards.

您好!非常感謝您的協助
但是其他的程式碼又遇到了另外一個問題><

code :
question = "臺灣最高的建築物是?"
chat = [
{"role": "user", "content": f"{question}"},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False)

error :
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'apply_chat_template'

使用的transformers版本:4.30

我在這篇文章中,看到upgrade transformers的版本可以解決這個問題,但是將transformers upgrade至4.34.1後,又出現了其他問題><

code :
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_enable_fp32_cpu_offload=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=bfloat16)

model = AutoModelForCausalLM.from_pretrained(model_id,
load_in_4bit=True,
device_map="cpu",
trust_remote_code=True,
quantization_config=bnb_config)

error : ImportError: Using load_in_8bit=True requires Accelerate: pip install accelerate and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes or pip install bitsandbytes`

但是,我很確定我已經安裝accelerate和bitsandbytes了(accelerate ==0.30.0, bitsandbytes==0.42.0 )
而且我根本沒有設load_in_8bit=True ><

實在是不知道為什麼會出現這個錯誤....
再麻煩您協助了,謝謝!

TAIDE org

您好,

請參考以下提到的解法:
https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g/discussions/11

Best regards.

Sign up or log in to comment