Text Generation
Adapters
llama
llama-2
calpt commited on
Commit
0d80da7
1 Parent(s): 65505ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -14
README.md CHANGED
@@ -1,16 +1,19 @@
1
  ---
2
  tags:
3
- - adapter-transformers
4
  - llama
 
 
5
  datasets:
6
  - timdettmers/openassistant-guanaco
 
 
7
  ---
8
 
9
- # Adapter `AdapterHub/llama2-7b-qadapter-seq-openassistant` for meta-llama/Llama-2-7b-hf
10
 
11
- An [adapter](https://adapterhub.ml) for the `meta-llama/Llama-2-7b-hf` model that was trained on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
12
 
13
- This adapter was created for usage with the **[Adapters](https://github.com/Adapter-Hub/adapters)** library.
14
 
15
  ## Usage
16
 
@@ -20,23 +23,77 @@ First, install `adapters`:
20
  pip install -U adapters
21
  ```
22
 
23
- Now, the adapter can be loaded and activated like this:
24
 
25
  ```python
26
- from adapters import AutoAdapterModel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- model = AutoAdapterModel.from_pretrained("meta-llama/Llama-2-7b-hf")
29
- adapter_name = model.load_adapter("AdapterHub/llama2-7b-qadapter-seq-openassistant", source="hf", set_active=True)
30
  ```
31
 
32
- ## Architecture & Training
 
 
 
 
 
 
33
 
34
- <!-- Add some description here -->
 
 
 
35
 
36
- ## Evaluation results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- <!-- Add some description here -->
39
 
40
- ## Citation
41
 
42
- <!-- Add some description here -->
 
 
 
 
 
1
  ---
2
  tags:
 
3
  - llama
4
+ - adapter-transformers
5
+ - llama-2
6
  datasets:
7
  - timdettmers/openassistant-guanaco
8
+ license: apache-2.0
9
+ pipeline_tag: text-generation
10
  ---
11
 
12
+ # OpenAssistant Bottleneck QAdapter for Llama-2 7B
13
 
14
+ QAdapter sequential bottleneck adapter for the Llama-2 7B (`meta-llama/Llama-2-7b-hf`) model trained for instruction tuning on the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/) dataset.
15
 
16
+ **This adapter was created for usage with the [Adapters](https://github.com/Adapter-Hub/adapters) library.**
17
 
18
  ## Usage
19
 
 
23
  pip install -U adapters
24
  ```
25
 
26
+ Now, the model and adapter can be loaded and activated like this:
27
 
28
  ```python
29
+ import adapters
30
+ import torch
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
32
+
33
+ model_id = "meta-llama/Llama-2-7b-hf"
34
+ adapter_id = "AdapterHub/llama2-7b-qadapter-seq-openassistant"
35
+
36
+ model = AutoModelForCausalLM.from_pretrained(
37
+ model_id,
38
+ device_map="auto",
39
+ quantization_config=BitsAndBytesConfig(
40
+ load_in_4bit=True,
41
+ bnb_4bit_quant_type="nf4",
42
+ bnb_4bit_use_double_quant=True,
43
+ bnb_4bit_compute_dtype=torch.bfloat16,
44
+ ),
45
+ torch_dtype=torch.bfloat16,
46
+ )
47
+ adapters.init(model)
48
+
49
+ adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True)
50
 
51
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
 
52
  ```
53
 
54
+ ### Inference
55
+
56
+ Inference can be done via standard methods built in to the Transformers library.
57
+ We add some helper code to properly prompt the model first:
58
+
59
+ ```python
60
+ from transformers import StoppingCriteria
61
 
62
+ # stop if model starts to generate "### Human:"
63
+ class EosListStoppingCriteria(StoppingCriteria):
64
+ def __init__(self, eos_sequence = [12968, 29901]):
65
+ self.eos_sequence = eos_sequence
66
 
67
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
68
+ last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
69
+ return self.eos_sequence in last_ids
70
+
71
+ def prompt_model(model, text: str):
72
+ batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
73
+ batch = batch.to(model.device)
74
+
75
+ with torch.cuda.amp.autocast():
76
+ output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])
77
+
78
+ # skip prompt when decoding
79
+ decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
80
+ return decoded[:-10] if decoded.endswith("### Human:") else decoded
81
+ ```
82
+
83
+ Now, to prompt the model:
84
+
85
+ ```python
86
+ prompt_model(model, "Please explain NLP in simple terms.")
87
+ ```
88
+
89
+ ## Architecture & Training
90
 
91
+ **Training was run with the code in [this notebook](https://github.com/adapter-hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)**.
92
 
93
+ The adapter uses the sequential bottleneck architecture described in [Houlsby et al. (2019)](https://arxiv.org/pdf/1902.00751.pdf) and available in Adapters as `double_seq_bn`.
94
 
95
+ The adapter is trained similar to the Guanaco models proposed in the paper:
96
+ - Dataset: [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
97
+ - Quantization: 4-bit QLoRA
98
+ - Batch size: 16, LR: 2e-4, max steps: 1875
99
+ - Sequence length: 512