dfurman
/

Llama-2-7B-Instruct-v0.1

Text Generation

PEFT

Safetensors

llama-2

Eval Results

Model card Files Files and versions Community

dfurman commited on Sep 26, 2023

Commit

b66dee6

1 Parent(s): 59e80b1

Update README.md

Browse files

Files changed (1) hide show

README.md +72 -5

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ base_model: meta-llama/Llama-2-7b-hf
 This instruction model was built via parameter-efficient QLoRA finetuning of [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the first 5k rows of [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin) and the first 5k rows of [garage-bAInd/Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). Finetuning was executed on 1x A100 (40 GB SXM) for roughly xx hours on the [Lambda Labs](https://cloud.lambdalabs.com/instances) platform.
-### Benchmark metrics
 | Metric                | Value |
 |-----------------------|-------|
@@ -27,7 +27,7 @@ This instruction model was built via parameter-efficient QLoRA finetuning of [Ll
 We use state-of-the-art [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as Hugging Face's [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
-### Helpful links
 * Model license: coming
 * Basic usage: coming
@@ -51,9 +51,76 @@ While great efforts have been taken to clean the pretraining data, it is possibl
 ## How to use
-coming
-### Runtime tests
 coming
@@ -85,6 +152,6 @@ The following `bitsandbytes` quantization config was used during training:
 - bnb_4bit_use_double_quant: False
 - bnb_4bit_compute_dtype: bfloat16
-### Framework versions
 - PEFT 0.6.0.dev0

 This instruction model was built via parameter-efficient QLoRA finetuning of [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the first 5k rows of [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin) and the first 5k rows of [garage-bAInd/Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). Finetuning was executed on 1x A100 (40 GB SXM) for roughly xx hours on the [Lambda Labs](https://cloud.lambdalabs.com/instances) platform.
+## Benchmark metrics
 | Metric                | Value |
 |-----------------------|-------|
 We use state-of-the-art [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as Hugging Face's [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
+## Helpful links
 * Model license: coming
 * Basic usage: coming
 ## How to use
+* [notebook](assets/basic_inference_llama_2_dolphin.ipynb)
+```python
+!pip install -q -U huggingface_hub peft transformers torch accelerate
+```
+```python
+from huggingface_hub import notebook_login
+import torch
+from peft import PeftModel, PeftConfig
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+    pipeline,
+)
+notebook_login()
+```
+```python
+peft_model_id = "dfurman/llama-2-7b-instruct-peft"
+config = PeftConfig.from_pretrained(peft_model_id)
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+)
+model = AutoModelForCausalLM.from_pretrained(
+    config.base_model_name_or_path,
+    quantization_config=bnb_config,
+    use_auth_token=True,
+    device_map="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_fast=True)
+tokenizer.pad_token = tokenizer.eos_token
+model = PeftModel.from_pretrained(model, peft_model_id)
+format_template = "You are a helpful assistant. {query}\n"
+```
+```python
+# First, format the prompt
+query = "Tell me a recipe for vegan banana bread."
+prompt = format_template.format(query=query)
+# Inference can be done using model.generate
+print("\n\n*** Generate:")
+input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
+with torch.autocast("cuda", dtype=torch.bfloat16):
+    output = model.generate(
+        input_ids=input_ids,
+        max_new_tokens=512,
+        do_sample=True,
+        temperature=0.7,
+        return_dict_in_generate=True,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.pad_token_id,
+        repetition_penalty=1.2,
+    )
+print(tokenizer.decode(output["sequences"][0], skip_special_tokens=True))
+```
+## Runtime tests
 coming
 - bnb_4bit_use_double_quant: False
 - bnb_4bit_compute_dtype: bfloat16
+## Framework versions
 - PEFT 0.6.0.dev0