nm-testing
/

OLMoE-1B-7B-0924-Instruct-FP8

compressed-tensors

Model card Files Files and versions Community

mgoin commited on Sep 20

Commit

8df500d

•

1 Parent(s): 80fe67f

Create README.md

Files changed (1) hide show

README.md +96 -0

README.md ADDED Viewed

	@@ -0,0 +1,96 @@

+```
+lm_eval --model vllm --model_args pretrained=/home/mgoin/code/llm-compressor/examples/quantizing_moe/OLMoE-1B-7B-0924-Instruct-FP8,tensor_parallel_size=1,trust_remote_code=True --tasks gsm8k --num_fewshot 5 --batch_size auto
+vllm (pretrained=/home/mgoin/code/llm-compressor/examples/quantizing_moe/OLMoE-1B-7B-0924-Instruct-FP8,tensor_parallel_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3510|±  |0.0131|
+|     |       |strict-match    |     5|exact_match|↑  |0.3389|±  |0.0130|
+```
+## Creation
+```python
+import torch
+from datasets import load_dataset
+from transformers import AutoTokenizer
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
+# select a Mixture of Experts model for quantization
+MODEL_ID = "allenai/OLMoE-1B-7B-0924-Instruct"
+model = SparseAutoModelForCausalLM.from_pretrained(
+    MODEL_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+# Select calibration dataset.
+# its recommended to use more calibration samples for MoE models so each expert is hit
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"
+DATASET_SPLIT = "train_sft"
+NUM_CALIBRATION_SAMPLES = 2048
+MAX_SEQUENCE_LENGTH = 2048
+# Load dataset and preprocess.
+ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+def preprocess(example):
+    return {
+        "text": tokenizer.apply_chat_template(
+            example["messages"],
+            tokenize=False,
+        )
+    }
+ds = ds.map(preprocess)
+# Tokenize inputs.
+def tokenize(sample):
+    return tokenizer(
+        sample["text"],
+        padding=False,
+        max_length=MAX_SEQUENCE_LENGTH,
+        truncation=True,
+        add_special_tokens=False,
+    )
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+# define a llmcompressor recipe for FP8 W8A8 quantization
+# since the MoE gate layers are sensitive to quantization, we add them to the ignore
+# list so they remain at full precision
+recipe = [
+    QuantizationModifier(
+        targets="Linear",
+        scheme="FP8",
+        ignore=["lm_head", "re:.*mlp.gate$"],
+    ),
+]
+SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8"
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    save_compressed=True,
+    output_dir=SAVE_DIR,
+)
+print("========== SAMPLE GENERATION ==============")
+SAMPLE_INPUT = ["I love quantization because"]
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+inputs = tokenizer(SAMPLE_INPUT, return_tensors="pt", padding=True).to(model.device)
+output = model.generate(**inputs, max_length=50)
+text_output = tokenizer.batch_decode(output)
+print(text_output)
+```