|
--- |
|
tags: |
|
- llama |
|
- adapter-transformers |
|
- llama-2 |
|
datasets: |
|
- timdettmers/openassistant-guanaco |
|
license: apache-2.0 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# OpenAssistant QLoRA Adapter for Llama-2 13B |
|
|
|
QLoRA adapter for the Llama-2 13B (`meta-llama/Llama-2-13b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset. |
|
|
|
**This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.** |
|
|
|
## Usage |
|
|
|
First, install `adapters`: |
|
|
|
``` |
|
pip install -U adapters |
|
``` |
|
|
|
Now, the model and adapter can be loaded and activated like this: |
|
|
|
```python |
|
import adapters |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
|
model_id = "meta-llama/Llama-2-13b-hf" |
|
adapter_id = "AdapterHub/llama2-13b-qlora-openassistant" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
device_map="auto", |
|
quantization_config=BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
), |
|
torch_dtype=torch.bfloat16, |
|
) |
|
adapters.init(model) |
|
|
|
adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
``` |
|
|
|
### Inference |
|
|
|
Inference can be done via standard methods built in to the Transformers library. |
|
We add some helper code to properly prompt the model first: |
|
|
|
```python |
|
from transformers import StoppingCriteria |
|
|
|
# stop if model starts to generate "### Human:" |
|
class EosListStoppingCriteria(StoppingCriteria): |
|
def __init__(self, eos_sequence = [12968, 29901]): |
|
self.eos_sequence = eos_sequence |
|
|
|
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: |
|
last_ids = input_ids[:,-len(self.eos_sequence):].tolist() |
|
return self.eos_sequence in last_ids |
|
|
|
def prompt_model(model, text: str): |
|
batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt") |
|
batch = batch.to(model.device) |
|
|
|
with torch.cuda.amp.autocast(): |
|
output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()]) |
|
|
|
# skip prompt when decoding |
|
decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True) |
|
return decoded[:-10] if decoded.endswith("### Human:") else decoded |
|
``` |
|
|
|
Now, to prompt the model: |
|
|
|
```python |
|
prompt_model(model, "Please explain NLP in simple terms.") |
|
``` |
|
|
|
### Weight merging |
|
|
|
To decrease inference latency, the LoRA weights can be merged with the base model: |
|
```python |
|
model.merge_adapter(adapter_name) |
|
``` |
|
|
|
## Architecture & Training |
|
|
|
**Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)**. |
|
|
|
The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf): |
|
- `r=64`, `alpha=16` |
|
- LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers |
|
|
|
The adapter is trained similar to the Guanaco models proposed in the paper: |
|
- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) |
|
- Quantization: 4-bit QLoRA |
|
- Batch size: 16, LR: 2e-4, max steps: 1875 |
|
- Sequence length: 512 |
|
|