GGML Version

by s3nh - opened Jul 29, 2023

s3nh

Jul 29, 2023

Outstanding work! just convert it to ggml, check it out if your are interested! https://huggingface.co/s3nh/LLaMA-2-7B-32K-GGML

deepakkaura26

Jul 29, 2023

@s3nh Will your converted model can run on colab's CPU easily?

mauriceweber

Together org Aug 2, 2023

@deepakkaura26 I think so! by default you get 2 vCPUs on colab with 13G RAM which should be enough to run the ggml versions

deepakkaura26

Aug 2, 2023

@mauriceweber actually I tried it but whether I choose CPU or GPU my colab got crashed 5 times.

mauriceweber

Together org Aug 2, 2023

•

edited Aug 3, 2023

Which quantization did you try? I tried the 4bit version on colab and could run it without problems.

import ctransformers
from ctransformers import AutoModelForCausalLM

model_file = "LLaMA-2-7B-32K.ggmlv3.q4_0.bin"
model = AutoModelForCausalLM.from_pretrained("s3nh/LLaMA-2-7B-32K-GGML",  model_type="llama", model_file=model_file)

prompt = "Whales have been living in the oceans for millions of years "
model(prompt, max_new_tokens=128, temperature=0.9, top_p= 0.7)

EDIT: load model directly from hub.

deepakkaura26

Aug 2, 2023

@mauriceweber I have use this same example which is present in this model website

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)

input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

deepakkaura26

Aug 2, 2023

@mauriceweber I tried to run your codes which you showed they give me this following error

HTTPError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
260 try:
--> 261 response.raise_for_status()
262 except HTTPError as e:

11 frames
HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main

The above exception was the direct cause of the following exception:

RepositoryNotFoundError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
291 " make sure you are authenticated."
292 )
--> 293 raise RepositoryNotFoundError(message, response) from e
294
295 elif response.status_code == 400:

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-64caab34-5bd826d76686f26a76b02644;7f562443-2822-41e5-bcd0-37c62aef99f9)

Repository Not Found for url: https://huggingface.co/api/models/LLaMA-2-7B-32K.ggmlv3.q4_0.bin/revision/main.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

mauriceweber

Together org Aug 3, 2023

@mauriceweber I have use this same example which is present in this model website

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K", trust_remote_code=True, torch_dtype=torch.float16)

input_context = "Your text here"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids, max_length=128, temperature=0.7)
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)

Here you are not using the quantized (ggml) models, which is why you are running out of memory (you need around 14GB RAM for the 7B model with float16).

@mauriceweber I tried to run your codes which you showed they give me this following error

This is error is because the model is not downloaded yet (I was assuming you had it downloaded to colab) -- I adjusted the code snippet above so that the model file gets pulled directly from the repo. You can check the other model versions here.

Let us know how it goes!:)

Sc0urge

Aug 25, 2023

Is this model already trained? running the example code just gives me this:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment