nlpulse
/

gpt-j-6b-english_quotes

Text Generation

Inference Endpoints

Model card Files Files and versions Community

egon-nlpulse commited on Jul 11, 2023

Commit

74852cc

•

1 Parent(s): 440fc16

ajustes

Files changed (1) hide show

README.md +57 -1

README.md CHANGED Viewed

@@ -5,4 +5,60 @@ datasets:
 language:
 - en
 library_name: transformers
----

 language:
 - en
 library_name: transformers
+---
+# Quantization 4Bits - 4.92 GB GPU memory usage for inference:
+```
+$ nvidia-smi
++-----------------------------------------------------------------------------+
+| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                               |                      |               MIG M. |
+|===============================+======================+======================|
+|   1  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
+| 37%   70C    P2   163W / 170W |   4923MiB / 12288MiB |     91%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+```
+```
+import os
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+model_path = "nlpulse/gpt-j-6b-english_quotes"
+model_path = os.environ.get("model_path", model_path)
+# tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+tokenizer.pad_token = tokenizer.eos_token
+# quantization config
+quant_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16
+)
+# model
+model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=quant_config, device_map={"":0})
+# inference
+device = "cuda"
+text_list = ["Ask not what your country", "Be the change that", "You only live once, but", "I'm selfish, impatient and"]
+for text in text_list:
+    inputs = tokenizer(text, return_tensors="pt").to(device)
+    outputs = model.generate(**inputs, max_new_tokens=20)
+    print('>> ', text, " => ", tokenizer.decode(outputs[0], skip_special_tokens=True))
+```