AdapterHub
/

llama2-7b-qlora-openassistant

Text Generation

Model card Files Files and versions Community

llama2-7b-qlora-openassistant / README.md

calpt's picture

Update README.md

25a15bb verified 7 months ago

|

history blame contribute delete

3.43 kB

	---
	tags:
	- llama
	- adapter-transformers
	- llama-2
	datasets:
	- timdettmers/openassistant-guanaco
	license: apache-2.0
	pipeline_tag: text-generation
	---

	# OpenAssistant QLoRA Adapter for Llama-2 7B

	QLoRA adapter for the Llama-2 7B (`meta-llama/Llama-2-7b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.

	This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.

	## Usage

	First, install `adapters`:

	```
	pip install -U adapters
	```

	Now, the model and adapter can be loaded and activated like this:

	```python
	import adapters
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

	model_id = "meta-llama/Llama-2-7b-hf"
	adapter_id = "AdapterHub/llama2-7b-qlora-openassistant"

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	quantization_config=BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_use_double_quant=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	),
	torch_dtype=torch.bfloat16,
	)
	adapters.init(model)

	adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True)

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	```

	### Inference

	Inference can be done via standard methods built in to the Transformers library.
	We add some helper code to properly prompt the model first:

	```python
	from transformers import StoppingCriteria

	# stop if model starts to generate "### Human:"
	class EosListStoppingCriteria(StoppingCriteria):
	def __init__(self, eos_sequence = [12968, 29901]):
	self.eos_sequence = eos_sequence

	def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
	last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
	return self.eos_sequence in last_ids

	def prompt_model(model, text: str):
	batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
	batch = batch.to(model.device)

	with torch.cuda.amp.autocast():
	output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])

	# skip prompt when decoding
	decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
	return decoded[:-10] if decoded.endswith("### Human:") else decoded
	```

	Now, to prompt the model:

	```python
	prompt_model(model, "Please explain NLP in simple terms.")
	```

	### Weight merging

	To decrease inference latency, the LoRA weights can be merged with the base model:
	```python
	model.merge_adapter(adapter_name)
	```

	## Architecture & Training

	Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb).

	The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf):
	- `r=64`, `alpha=16`
	- LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers

	The adapter is trained similar to the Guanaco models proposed in the paper:
	- Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
	- Quantization: 4-bit QLoRA
	- Batch size: 16, LR: 2e-4, max steps: 1875
	- Sequence length: 512