File size: 3,427 Bytes
698e0cb 2226409 698e0cb 2226409 698e0cb 2226409 698e0cb 2226409 698e0cb 2226409 698e0cb 2226409 698e0cb 2226409 698e0cb 12959bc 2226409 b72c066 2226409 698e0cb 2226409 698e0cb 2226409 b72c066 2226409 698e0cb 2226409 698e0cb 25a15bb 698e0cb 2226409 698e0cb 2226409 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
tags:
- llama
- adapter-transformers
- llama-2
datasets:
- timdettmers/openassistant-guanaco
license: apache-2.0
pipeline_tag: text-generation
---
# OpenAssistant QLoRA Adapter for Llama-2 7B
QLoRA adapter for the Llama-2 7B (`meta-llama/Llama-2-7b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
**This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**
## Usage
First, install `adapters`:
```
pip install -U adapters
```
Now, the model and adapter can be loaded and activated like this:
```python
import adapters
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "meta-llama/Llama-2-7b-hf"
adapter_id = "AdapterHub/llama2-7b-qlora-openassistant"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
),
torch_dtype=torch.bfloat16,
)
adapters.init(model)
adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
```
### Inference
Inference can be done via standard methods built in to the Transformers library.
We add some helper code to properly prompt the model first:
```python
from transformers import StoppingCriteria
# stop if model starts to generate "### Human:"
class EosListStoppingCriteria(StoppingCriteria):
def __init__(self, eos_sequence = [12968, 29901]):
self.eos_sequence = eos_sequence
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
return self.eos_sequence in last_ids
def prompt_model(model, text: str):
batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
batch = batch.to(model.device)
with torch.cuda.amp.autocast():
output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])
# skip prompt when decoding
decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
return decoded[:-10] if decoded.endswith("### Human:") else decoded
```
Now, to prompt the model:
```python
prompt_model(model, "Please explain NLP in simple terms.")
```
### Weight merging
To decrease inference latency, the LoRA weights can be merged with the base model:
```python
model.merge_adapter(adapter_name)
```
## Architecture & Training
**Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)**.
The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf):
- `r=64`, `alpha=16`
- LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers
The adapter is trained similar to the Guanaco models proposed in the paper:
- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
- Quantization: 4-bit QLoRA
- Batch size: 16, LR: 2e-4, max steps: 1875
- Sequence length: 512
|