Text Generation
Adapters
llama
llama-2
File size: 3,427 Bytes
698e0cb
 
 
 
2226409
698e0cb
 
2226409
 
698e0cb
 
2226409
698e0cb
2226409
698e0cb
2226409
698e0cb
 
 
 
 
 
 
 
 
2226409
698e0cb
 
2226409
 
 
698e0cb
12959bc
2226409
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b72c066
2226409
 
698e0cb
 
2226409
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
698e0cb
2226409
b72c066
 
2226409
 
 
 
 
 
 
698e0cb
2226409
 
 
 
 
 
 
 
698e0cb
25a15bb
698e0cb
2226409
 
 
698e0cb
2226409
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
tags:
- llama
- adapter-transformers
- llama-2
datasets:
- timdettmers/openassistant-guanaco
license: apache-2.0
pipeline_tag: text-generation
---

# OpenAssistant QLoRA Adapter for Llama-2 7B

QLoRA adapter for the Llama-2 7B (`meta-llama/Llama-2-7b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.

**This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**

## Usage

First, install `adapters`:

```
pip install -U adapters
```

Now, the model and adapter can be loaded and activated like this:

```python
import adapters
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-7b-hf"
adapter_id = "AdapterHub/llama2-7b-qlora-openassistant"

model = AutoModelForCausalLM.from_pretrained(
    model_id,    
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
)
adapters.init(model)

adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True)

tokenizer = AutoTokenizer.from_pretrained(model_id)
```

### Inference

Inference can be done via standard methods built in to the Transformers library.
We add some helper code to properly prompt the model first:

```python
from transformers import StoppingCriteria

# stop if model starts to generate "### Human:"
class EosListStoppingCriteria(StoppingCriteria):
    def __init__(self, eos_sequence = [12968, 29901]):
        self.eos_sequence = eos_sequence

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
        return self.eos_sequence in last_ids

def prompt_model(model, text: str):
    batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
    batch = batch.to(model.device)
    
    with torch.cuda.amp.autocast():
        output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])

    # skip prompt when decoding
    decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
    return decoded[:-10] if decoded.endswith("### Human:") else decoded
```

Now, to prompt the model:

```python
prompt_model(model, "Please explain NLP in simple terms.")
```

### Weight merging

To decrease inference latency, the LoRA weights can be merged with the base model:
```python
model.merge_adapter(adapter_name)
```

## Architecture & Training

**Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)**.

The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf):
- `r=64`, `alpha=16`
- LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers

The adapter is trained similar to the Guanaco models proposed in the paper:
- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
- Quantization: 4-bit QLoRA
- Batch size: 16, LR: 2e-4, max steps: 1875
- Sequence length: 512