megrisdal reach-vb HF staff commited on
Commit
76396a9
1 Parent(s): 2d74922

Update README.md (#34)

Browse files

- Update README.md (b52ae94a65513191b373f1270cd50f428d883a62)
- Update README.md (2c69e4b0a46571172f86ef79bdd95357e326d26e)


Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +124 -25
README.md CHANGED
@@ -43,13 +43,37 @@ state of the art AI models and helping foster innovation for everyone.
43
 
44
  ### Usage
45
 
46
- Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.
 
 
 
47
 
 
48
 
49
- #### Running the model on a single / multi GPU
 
 
 
 
50
 
51
- > [!IMPORTANT]
52
- > Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise `eager` attention.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ```python
55
  # pip install accelerate
@@ -60,13 +84,24 @@ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
60
  model = AutoModelForCausalLM.from_pretrained(
61
  "google/gemma-2-27b-it",
62
  device_map="auto",
63
- torch_dtype=torch.bfloat16
64
  )
65
 
66
  input_text = "Write me a poem about Machine Learning."
67
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
68
 
69
- outputs = model.generate(**input_ids)
 
 
 
 
 
 
 
 
 
 
 
70
  print(tokenizer.decode(outputs[0]))
71
  ```
72
 
@@ -86,19 +121,32 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
86
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
87
  model = AutoModelForCausalLM.from_pretrained(
88
  "google/gemma-2-27b-it",
89
- device_map="auto"
90
  )
91
 
92
  input_text = "Write me a poem about Machine Learning."
93
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
94
 
95
- outputs = model.generate(**input_ids)
96
  print(tokenizer.decode(outputs[0]))
97
  ```
98
 
 
 
 
 
 
 
 
 
 
 
99
  #### Quantized Versions through `bitsandbytes`
100
 
101
- * _Using 8-bit precision (int8)_
 
 
 
102
 
103
  ```python
104
  # pip install bitsandbytes accelerate
@@ -109,16 +157,21 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
109
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
110
  model = AutoModelForCausalLM.from_pretrained(
111
  "google/gemma-2-27b-it",
112
- quantization_config=quantization_config)
 
113
 
114
  input_text = "Write me a poem about Machine Learning."
115
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
116
 
117
- outputs = model.generate(**input_ids)
118
  print(tokenizer.decode(outputs[0]))
119
  ```
 
120
 
121
- * _Using 4-bit precision_
 
 
 
122
 
123
  ```python
124
  # pip install bitsandbytes accelerate
@@ -129,33 +182,79 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True)
129
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
130
  model = AutoModelForCausalLM.from_pretrained(
131
  "google/gemma-2-27b-it",
132
- quantization_config=quantization_config)
 
133
 
134
  input_text = "Write me a poem about Machine Learning."
135
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
136
 
137
- outputs = model.generate(**input_ids)
138
  print(tokenizer.decode(outputs[0]))
139
  ```
 
140
 
 
141
 
142
- #### Other optimizations
 
 
 
143
 
144
- * _Flash Attention 2_
 
145
 
146
- > [!WARNING]
147
- > Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk.
148
 
149
- First make sure to install `flash-attn` in your environment `pip install flash-attn`
 
 
150
 
151
- ```diff
152
- model = AutoModelForCausalLM.from_pretrained(
153
- model_id,
154
- torch_dtype=torch.float16,
155
- + attn_implementation="flash_attention_2"
156
- ).to(0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  ```
158
 
 
 
 
 
159
  ### Chat Template
160
 
161
  The instruction-tuned models use a chat template that must be adhered to for conversational use.
 
43
 
44
  ### Usage
45
 
46
+ Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:
47
+ ```sh
48
+ pip install -U transformers
49
+ ```
50
 
51
+ Then, copy the snippet from the section that is relevant for your usecase.
52
 
53
+ #### Running with the `pipeline` API
54
+
55
+ ```python
56
+ import torch
57
+ from transformers import pipeline
58
 
59
+ pipe = pipeline(
60
+ "text-generation",
61
+ model="google/gemma-2-27b-it",
62
+ model_kwargs={"torch_dtype": torch.bfloat16},
63
+ device="cuda", # replace with "mps" to run on a Mac device
64
+ )
65
+
66
+ messages = [
67
+ {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
68
+ ]
69
+
70
+ outputs = pipe(messages, max_new_tokens=256)
71
+ assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
72
+ print(assistant_response)
73
+ # Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world. So, what be yer pleasure, eh? 🦜
74
+ ```
75
+
76
+ #### Running the model on a single / multi GPU
77
 
78
  ```python
79
  # pip install accelerate
 
84
  model = AutoModelForCausalLM.from_pretrained(
85
  "google/gemma-2-27b-it",
86
  device_map="auto",
87
+ torch_dtype=torch.bfloat16,
88
  )
89
 
90
  input_text = "Write me a poem about Machine Learning."
91
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
92
 
93
+ outputs = model.generate(**input_ids, max_new_tokens=32)
94
+ print(tokenizer.decode(outputs[0]))
95
+ ```
96
+
97
+ You can ensure the correct chat template is applied by using `tokenizer.apply_chat_template` as follows:
98
+ ```python
99
+ messages = [
100
+ {"role": "user", "content": "Write me a poem about Machine Learning."},
101
+ ]
102
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
103
+
104
+ outputs = model.generate(**input_ids, max_new_tokens=256)
105
  print(tokenizer.decode(outputs[0]))
106
  ```
107
 
 
121
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
122
  model = AutoModelForCausalLM.from_pretrained(
123
  "google/gemma-2-27b-it",
124
+ device_map="auto",
125
  )
126
 
127
  input_text = "Write me a poem about Machine Learning."
128
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
129
 
130
+ outputs = model.generate(**input_ids, max_new_tokens=32)
131
  print(tokenizer.decode(outputs[0]))
132
  ```
133
 
134
+ #### Running the model through a CLI
135
+
136
+ The [local-gemma](https://github.com/huggingface/local-gemma) repository contains a lightweight wrapper around Transformers
137
+ for running Gemma 2 through a command line interface, or CLI. Follow the [installation instructions](https://github.com/huggingface/local-gemma#cli-usage)
138
+ for getting started, then launch the CLI through the following command:
139
+
140
+ ```shell
141
+ local-gemma --model 27b --preset speed
142
+ ```
143
+
144
  #### Quantized Versions through `bitsandbytes`
145
 
146
+ <details>
147
+ <summary>
148
+ Using 8-bit precision (int8)
149
+ </summary>
150
 
151
  ```python
152
  # pip install bitsandbytes accelerate
 
157
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
158
  model = AutoModelForCausalLM.from_pretrained(
159
  "google/gemma-2-27b-it",
160
+ quantization_config=quantization_config,
161
+ )
162
 
163
  input_text = "Write me a poem about Machine Learning."
164
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
165
 
166
+ outputs = model.generate(**input_ids, max_new_tokens=32)
167
  print(tokenizer.decode(outputs[0]))
168
  ```
169
+ </details>
170
 
171
+ <details>
172
+ <summary>
173
+ Using 4-bit precision
174
+ </summary>
175
 
176
  ```python
177
  # pip install bitsandbytes accelerate
 
182
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
183
  model = AutoModelForCausalLM.from_pretrained(
184
  "google/gemma-2-27b-it",
185
+ quantization_config=quantization_config,
186
+ )
187
 
188
  input_text = "Write me a poem about Machine Learning."
189
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
190
 
191
+ outputs = model.generate(**input_ids, max_new_tokens=32)
192
  print(tokenizer.decode(outputs[0]))
193
  ```
194
+ </details>
195
 
196
+ #### Advanced Usage
197
 
198
+ <details>
199
+ <summary>
200
+ Torch compile
201
+ </summary>
202
 
203
+ [Torch compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) is a method for speeding-up the
204
+ inference of PyTorch modules. The Gemma-2 model can be run up to 6x faster by leveraging torch compile.
205
 
206
+ Note that two warm-up steps are required before the full inference speed is realised:
 
207
 
208
+ ```python
209
+ import os
210
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
211
 
212
+ from transformers import AutoTokenizer, Gemma2ForCausalLM
213
+ from transformers.cache_utils import HybridCache
214
+ import torch
215
+
216
+ torch.set_float32_matmul_precision("high")
217
+
218
+ # load the model + tokenizer
219
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
220
+ model = Gemma2ForCausalLM.from_pretrained("google/gemma-2-27b-it", torch_dtype=torch.bfloat16)
221
+ model.to("cuda")
222
+
223
+ # apply the torch compile transformation
224
+ model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
225
+
226
+ # pre-process inputs
227
+ input_text = "The theory of special relativity states "
228
+ model_inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
229
+ prompt_length = model_inputs.input_ids.shape[1]
230
+
231
+ # set-up k/v cache
232
+ past_key_values = HybridCache(
233
+ config=model.config,
234
+ max_batch_size=1,
235
+ max_cache_len=model.config.max_position_embeddings,
236
+ device=model.device,
237
+ dtype=model.dtype
238
+ )
239
+
240
+ # enable passing kv cache to generate
241
+ model._supports_cache_class = True
242
+ model.generation_config.cache_implementation = None
243
+
244
+ # two warm-up steps
245
+ for idx in range(2):
246
+ outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
247
+ past_key_values.reset()
248
+
249
+ # fast run
250
+ outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
251
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
252
  ```
253
 
254
+ For more details, refer to the [Transformers documentation](https://huggingface.co/docs/transformers/main/en/llm_optims?static-kv=basic+usage%3A+generation_config).
255
+
256
+ </details>
257
+
258
  ### Chat Template
259
 
260
  The instruction-tuned models use a chat template that must be adhered to for conversational use.