File size: 3,432 Bytes

67704a4
 
 
48abbc4
 
67704a4
 
48abbc4
 
67704a4
 
48abbc4
67704a4
48abbc4
67704a4
48abbc4
67704a4
 
 
 
 
 
 
 
 
48abbc4
67704a4
 
48abbc4
 
 
 
 
 
67704a4
48abbc4
 
 
 
 
 
 
 
 
 
 
 
 
fd12c98
48abbc4
 
67704a4
 
48abbc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67704a4
48abbc4
 
 
 
 
 
67704a4
48abbc4
fd12c98
 
48abbc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67704a4
3e9863a
67704a4
48abbc4
 
 
67704a4
48abbc4

---
tags:
- llama
- adapter-transformers
- llama-2
datasets:
- timdettmers/openassistant-guanaco
license: apache-2.0
pipeline_tag: text-generation
---

# OpenAssistant QLoRA Adapter for Llama-2 13B

QLoRA adapter for the Llama-2 13B (`meta-llama/Llama-2-13b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.

**This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**

## Usage

First, install `adapters`:

```
pip install -U adapters
```

Now, the model and adapter can be loaded and activated like this:

```python
import adapters
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-13b-hf"
adapter_id = "AdapterHub/llama2-13b-qlora-openassistant"

model = AutoModelForCausalLM.from_pretrained(
    model_id,    
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
)
adapters.init(model)

adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True)

tokenizer = AutoTokenizer.from_pretrained(model_id)
```

### Inference

Inference can be done via standard methods built in to the Transformers library.
We add some helper code to properly prompt the model first:

```python
from transformers import StoppingCriteria

# stop if model starts to generate "### Human:"
class EosListStoppingCriteria(StoppingCriteria):
    def __init__(self, eos_sequence = [12968, 29901]):
        self.eos_sequence = eos_sequence

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
        return self.eos_sequence in last_ids

def prompt_model(model, text: str):
    batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
    batch = batch.to(model.device)
    
    with torch.cuda.amp.autocast():
        output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])

    # skip prompt when decoding
    decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
    return decoded[:-10] if decoded.endswith("### Human:") else decoded
```

Now, to prompt the model:

```python
prompt_model(model, "Please explain NLP in simple terms.")
```

### Weight merging

To decrease inference latency, the LoRA weights can be merged with the base model:
```python
model.merge_adapter(adapter_name)
```

## Architecture & Training

**Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)**.

The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf):
- `r=64`, `alpha=16`
- LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers

The adapter is trained similar to the Guanaco models proposed in the paper:
- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
- Quantization: 4-bit QLoRA
- Batch size: 16, LR: 2e-4, max steps: 1875
- Sequence length: 512