File size: 3,432 Bytes
67704a4 48abbc4 67704a4 48abbc4 67704a4 48abbc4 67704a4 48abbc4 67704a4 48abbc4 67704a4 48abbc4 67704a4 48abbc4 67704a4 48abbc4 fd12c98 48abbc4 67704a4 48abbc4 67704a4 48abbc4 67704a4 48abbc4 fd12c98 48abbc4 67704a4 3e9863a 67704a4 48abbc4 67704a4 48abbc4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
tags:
- llama
- adapter-transformers
- llama-2
datasets:
- timdettmers/openassistant-guanaco
license: apache-2.0
pipeline_tag: text-generation
---
# OpenAssistant QLoRA Adapter for Llama-2 13B
QLoRA adapter for the Llama-2 13B (`meta-llama/Llama-2-13b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
**This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**
## Usage
First, install `adapters`:
```
pip install -U adapters
```
Now, the model and adapter can be loaded and activated like this:
```python
import adapters
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "meta-llama/Llama-2-13b-hf"
adapter_id = "AdapterHub/llama2-13b-qlora-openassistant"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
),
torch_dtype=torch.bfloat16,
)
adapters.init(model)
adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
```
### Inference
Inference can be done via standard methods built in to the Transformers library.
We add some helper code to properly prompt the model first:
```python
from transformers import StoppingCriteria
# stop if model starts to generate "### Human:"
class EosListStoppingCriteria(StoppingCriteria):
def __init__(self, eos_sequence = [12968, 29901]):
self.eos_sequence = eos_sequence
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
return self.eos_sequence in last_ids
def prompt_model(model, text: str):
batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
batch = batch.to(model.device)
with torch.cuda.amp.autocast():
output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])
# skip prompt when decoding
decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
return decoded[:-10] if decoded.endswith("### Human:") else decoded
```
Now, to prompt the model:
```python
prompt_model(model, "Please explain NLP in simple terms.")
```
### Weight merging
To decrease inference latency, the LoRA weights can be merged with the base model:
```python
model.merge_adapter(adapter_name)
```
## Architecture & Training
**Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)**.
The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf):
- `r=64`, `alpha=16`
- LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers
The adapter is trained similar to the Guanaco models proposed in the paper:
- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
- Quantization: 4-bit QLoRA
- Batch size: 16, LR: 2e-4, max steps: 1875
- Sequence length: 512
|