Text Generation
Adapters
llama
llama-2
calpt commited on
Commit
2226409
1 Parent(s): 698e0cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -13
README.md CHANGED
@@ -2,15 +2,18 @@
2
  tags:
3
  - llama
4
  - adapter-transformers
 
5
  datasets:
6
  - timdettmers/openassistant-guanaco
 
 
7
  ---
8
 
9
- # Adapter `AdapterHub/llama2-7b-qlora-openassistant` for meta-llama/Llama-2-7b-hf
10
 
11
- An [adapter](https://adapterhub.ml) for the `meta-llama/Llama-2-7b-hf` model that was trained on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
12
 
13
- This adapter was created for usage with the **[Adapters](https://github.com/Adapter-Hub/adapters)** library.
14
 
15
  ## Usage
16
 
@@ -20,23 +23,85 @@ First, install `adapters`:
20
  pip install -U adapters
21
  ```
22
 
23
- Now, the adapter can be loaded and activated like this:
24
 
25
  ```python
26
- from adapters import AutoAdapterModel
 
 
27
 
28
- model = AutoAdapterModel.from_pretrained("meta-llama/Llama-2-7b-hf")
29
- adapter_name = model.load_adapter("AdapterHub/llama2-7b-qlora-openassistant", source="hf", set_active=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ```
31
 
32
- ## Architecture & Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- <!-- Add some description here -->
 
 
 
 
 
 
 
 
35
 
36
- ## Evaluation results
 
 
 
 
 
 
 
37
 
38
- <!-- Add some description here -->
39
 
40
- ## Citation
 
 
41
 
42
- <!-- Add some description here -->
 
 
 
 
 
2
  tags:
3
  - llama
4
  - adapter-transformers
5
+ - llama-2
6
  datasets:
7
  - timdettmers/openassistant-guanaco
8
+ license: apache-2.0
9
+ pipeline_tag: text-generation
10
  ---
11
 
12
+ # OpenAssistant QLoRA Adapter for Llama-2 7B
13
 
14
+ QLoRA adapter for the Llama-2 7B (`meta-llama/Llama-2-7b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
15
 
16
+ **This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**
17
 
18
  ## Usage
19
 
 
23
  pip install -U adapters
24
  ```
25
 
26
+ Now, the model and adapter can be loaded and activated like this:
27
 
28
  ```python
29
+ import adapters
30
+ import torch
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
32
 
33
+ model_id = "meta-llama/Llama-2-13b-hf"
34
+ adapter_id = "AdapterHub/llama2-7b-qlora-openassistant"
35
+
36
+ model = AutoModelForCausalLM.from_pretrained(
37
+ model_id,
38
+ device_map="auto",
39
+ quantization_config=BitsAndBytesConfig(
40
+ load_in_4bit=True,
41
+ bnb_4bit_quant_type="nf4",
42
+ bnb_4bit_use_double_quant=True,
43
+ bnb_4bit_compute_dtype=torch.bfloat16,
44
+ ),
45
+ torch_dtype=torch.bfloat16,
46
+ )
47
+ adapters.init(model)
48
+
49
+ adapter_name = model.load_adapter(adapter_id, set_active=True)
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
52
  ```
53
 
54
+ ### Inference
55
+
56
+ Inference can be done via standard methods built in to the Transformers library.
57
+ We add some helper code to properly prompt the model first:
58
+
59
+ ```python
60
+ from transformers import StoppingCriteria
61
+
62
+ # stop if model starts to generate "### Human:"
63
+ class EosListStoppingCriteria(StoppingCriteria):
64
+ def __init__(self, eos_sequence = [12968, 29901]):
65
+ self.eos_sequence = eos_sequence
66
+
67
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
68
+ last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
69
+ return self.eos_sequence in last_ids
70
+
71
+ def prompt_model(model, text: str):
72
+ batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
73
+ batch = batch.to(model.device)
74
+
75
+ with torch.cuda.amp.autocast():
76
+ output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])
77
 
78
+ # skip prompt when decoding
79
+ return tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
80
+ ```
81
+
82
+ Now, to prompt the model:
83
+
84
+ ```python
85
+ prompt_model(model, "Please explain NLP in simple terms.")
86
+ ```
87
 
88
+ ### Weight merging
89
+
90
+ To decrease inference latency, the LoRA weights can be merged with the base model:
91
+ ```python
92
+ model.merge_adapter(adapter_name)
93
+ ```
94
+
95
+ ## Architecture & Training
96
 
97
+ **Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama2_Finetuning.ipynb)**.
98
 
99
+ The LoRA architecture closely follows the configuration described in the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf):
100
+ - `r=64`, `alpha=16`
101
+ - LoRA modules added in output, intermediate and all (Q, K, V) self-attention linear layers
102
 
103
+ The adapter is trained similar to the Guanaco models proposed in the paper:
104
+ - Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
105
+ - Quantization: 4-bit QLoRA
106
+ - Batch size: 16, LR: 2e-4, max steps: 1875
107
+ - Sequence length: 512