File size: 3,345 Bytes

---
tags:
- llama
- adapter-transformers
- llama-2
datasets:
- timdettmers/openassistant-guanaco
license: apache-2.0
pipeline_tag: text-generation
---

# OpenAssistant QLoRA Adapter for Llama-2 13B

QLoRA adapter for the Llama-2 13B (`meta-llama/Llama-2-13b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.

**This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**

## Usage

First, install `adapters`:

```
pip install -U adapters
```

Now, the model and adapter can be loaded and activated like this:

```python
import adapters
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-13b-hf"
adapter_id = "AdapterHub/llama2-13b-qlora-openassistant"

model = AutoModelForCausalLM.from_pretrained(
    model_id,    
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
)
adapters.init(model)

adapter_name = model.load_adapter(adapter_id, set_active=True)

tokenizer = AutoTokenizer.from_pretrained(model_id)
```

### Inference

Inference can be done via standard methods built in to the Transformers library.
We add some helper code to properly prompt the model first:

```python
from transformers import StoppingCriteria

# stop if model starts to generate "### Human:"
class EosListStoppingCriteria(StoppingCriteria):
    def __init__(self, eos_sequence = [12968, 29901]):
        self.eos_sequence = eos_sequence

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
        return self.eos_sequence in last_ids

def prompt_model(model, text: str):
    batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
    batch = batch.to(model.device)
    
    with torch.cuda.amp.autocast():
        output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])

    # skip prompt when decoding
    return tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
```

Now, to prompt the model:

```python
prompt_model(model, "Please explain NLP in simple terms.")
```

### Weight merging

To decrease inference latency, the LoRA weights can be merged with the base model:
```python
model.merge_adapter(adapter_name)
```

## Architecture & Training

**Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama2_Finetuning.ipynb)**.

The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf):
- `r=64`, `alpha=16`
- LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers

The adapter is trained similar to the Guanaco models proposed in the paper:
- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
- Quantization: 4-bit QLoRA
- Batch size: 16, LR: 2e-4, max steps: 1875
- Sequence length: 512