Links for sample inference and database schema are dead
https://huggingface.co/defog/sqlcoder-34b-alpha#using-sqlcoder
The ones right under this heading.
@samvedya Just download the exl2 from here: https://huggingface.co/waldie/sqlcoder-34b-alpha-4bpw-h6-exl2
(Works for my 3090s)
model_name="defog/sqlcoder-34b-alpha"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
use_cache=True,
quantization_config=quantization_config
)
This is a better way
@samvedya Just download the exl2 from here: https://huggingface.co/waldie/sqlcoder-34b-alpha-4bpw-h6-exl2
(Works for my 3090s)
hey, for reference, what kind of t/s do you get for your prompts? (or How long does it take to get an output from the model?)
(How big is your schema in terms of # columns and tables?)
(Around 4 line output prompts)
Hi, all.
I run the inference script on sqlcoder-34b-alpha model but with no sql relsult return, any ideas please.
Thanks a lot
@samvedya : were you able to run the model with 8 bit quantization on an RTX 4090 (24gb vram) with above settings?
@samvedya
: can you please share the code or github report and version of libraries used.
Also, I want to run the code on windows env. Which env you used and is there any other specific changes you did?
@samvedya : can you share the code and version of libraries used. I want to run the code on windows env
#Latest version of every library as of today
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
#import sqlparse # if you want to scrap the sql from raw LLM output
model_id = "codellama/CodeLlama-34b-Instruct-hf"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto",
)
prompt = ""
inputs = tokenizer(prompt, return_tensors="pt",add_special_tokens=True).to("cuda")
output = model.generate(
inputs["input_ids"],
max_new_tokens=128,
do_sample=True,
top_p=0.9,
temperature=0.1,
repetition_penalty=1.05
)
output = output[0].to("cpu")
string_output=(tokenizer.decode(output))
print(string_output)
Hi
@samvedya
: I was able to run the code with 4bit quantization on windows with a specific bitsandbytes library available on path:
https://jllllll.github.io/bitsandbytes-windows-webui/bitsandbytes/
Can you let me know if you were able to run the model in 8 bit and what was the config used for it?