File size: 7,103 Bytes
5d0bf70
 
ac8b7b4
 
 
 
5d0bf70
ac8b7b4
7b17ba0
5d0bf70
 
705e567
 
ac8b7b4
5d0bf70
bc8fdf3
705e567
 
 
bc8fdf3
 
 
 
7b17ba0
 
5d0bf70
7b17ba0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d0bf70
 
 
7b17ba0
 
 
5d0bf70
44f439e
 
7b17ba0
44f439e
 
7b17ba0
 
5d0bf70
 
 
 
 
7b17ba0
 
44f439e
6f97173
 
 
 
5d0bf70
 
 
 
7b17ba0
5d0bf70
7b17ba0
5d0bf70
bc8fdf3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
license: apache-2.0
language:
- en
dataset:
- chargoddard/Open-Platypus-Chat
tags:
- axolotl
base_model: ai21labs/Jamba-v0.1
---

![image/webp](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/efmF8RtLLeKgQ9OwPqfD8.webp)

# Jambatypus-v0.1

This model is a QLoRA fine-tuned version of [ai21labs/Jamba-v0.1](https://huggingface.co/ai21labs/Jamba-v0.1) on the [chargoddard/Open-Platypus-Chat](https://huggingface.co/datasets/chargoddard/Open-Platypus-Chat) dataset.

It has been trained on 2xA100 80 GB using my [LazyAxolotl - Jamba](https://colab.research.google.com/drive/1alsgwZFvLPPAwIgkAxeMKHQSJYfW7DeZ?usp=sharing) notebook.

This repo contains both the adapter and the merged model in FP16 precision.

I recommend using the ChatML template to use this model.

[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
<details><summary>See axolotl config</summary>

axolotl version: `0.4.0`
```yaml

base_model: ai21labs/Jamba-v0.1
trust_remote_code: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: chargoddard/Open-Platypus-Chat
    type: sharegpt
chat_template: chatml
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
eval_sample_packing: false

use_wandb: true
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name: Jambatypus-v0.1
wandb_log_model:

adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true

low_cpu_mem_usage: true
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 4
save_total_limit: 2
debug:
deepspeed:
weight_decay: 0.0
special_tokens:

```

</details><br>

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 8
- total_train_batch_size: 16
- total_eval_batch_size: 2
- optimizer: Adam with betas=(0.9,0.95) and epsilon=1e-05
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 10
- num_epochs: 1

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 0.6274        | 0.01  | 1    | 1.0298          |
| 0.44          | 0.25  | 42   | 0.9770          |
| 0.4406        | 0.5   | 84   | 0.9653          |
| 0.4445        | 0.75  | 126  | 0.9645          |
| 0.4609        | 1.0   | 168  | 0.9641          |


### Framework versions

- PEFT 0.10.0
- Transformers 4.40.0.dev0
- Pytorch 2.1.2+cu118
- Datasets 2.18.0
- Tokenizers 0.15.0

## 💻 Usage

The following code creates a Gradio chat interface with Jambatypus.

```python
!pip install -qqq -U git+https://github.com/huggingface/transformers
!pip install -qqq mamba-ssm causal-conv1d>=1.2.0
!pip install -qqq accelerate bitsandbytes torch datasets peft gradio
!pip install -qqq flash-attn --no-build-isolation


import torch
import gradio as gr
from threading import Thread
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextIteratorStreamer

STOP_TOKEN = "<|im_end|>"

def predict(message, history, system_prompt, temperature, max_new_tokens, top_k, repetition_penalty, top_p):
    # Format history with a given chat template
    stop_token = "<|im_end|>"
    instruction = '<|im_start|>system\n' + system_prompt + '\n<|im_end|>\n'
    for human, assistant in history:
        instruction += '<|im_start|>user\n' + human + '\n<|im_end|>\n<|im_start|>assistant\n' + assistant
    instruction += '\n<|im_start|>user\n' + message + '\n<|im_end|>\n<|im_start|>assistant\n'
    
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    enc = tokenizer([instruction], return_tensors="pt", padding=True, truncation=True)
    input_ids, attention_mask = enc.input_ids, enc.attention_mask

    generate_kwargs = dict(
        {"input_ids": input_ids.to(device), "attention_mask": attention_mask.to(device)},
        streamer=streamer,
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_new_tokens,
        top_k=top_k,
        repetition_penalty=repetition_penalty,
        top_p=top_p
    )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()
    outputs = []
    for new_token in streamer:
        if STOP_TOKEN in new_token:
            outputs.append(new_token[:-len(stop_token)-1])
            yield "".join(outputs)
            break
        outputs.append(new_token)
        yield "".join(outputs)


# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")

# 4-bit precision quant config
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=["mamba"]
)
# Load model and tokenizer with ChatML format
model = AutoModelForCausalLM.from_pretrained(
    "ai21labs/Jamba-v0.1",
    trust_remote_code=True,
    torch_dtype=torch.cuda.is_bf16_supported() and torch.bfloat16 or torch.float16,
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True,
    quantization_config=quantization_config
)
config = PeftConfig.from_pretrained("mlabonne/Jambatypus-v0.1")
model = PeftModel.from_pretrained(model, "mlabonne/Jambatypus-v0.1")

# Create Gradio interface
gr.ChatInterface(
    predict,
    title="Jambatypus",
    description="Chat with Jambatypus!",
    examples=[
        ["Can you solve the equation 2x + 3 = 11 for x?"],
        ["Write an epic poem about Ancient Rome."],
        ["Who was the first person to walk on the Moon?"],
        ["Use a list comprehension to create a list of squares for numbers from 1 to 10."],
        ["Recommend some popular science fiction books."],
        ["Can you write a short story about a time-traveling detective?"]
    ],
    additional_inputs_accordion=gr.Accordion(label="⚙️ Parameters", open=False),
    additional_inputs=[
        gr.Textbox("Perform the task to the best of your ability.", label="System prompt"),
        gr.Slider(0, 1, 0.8, label="Temperature"),
        gr.Slider(128, 4096, 1024, label="Max new tokens"),
        gr.Slider(1, 80, 40, label="Top K sampling"),
        gr.Slider(0, 2, 1.1, label="Repetition penalty"),
        gr.Slider(0, 1, 0.95, label="Top P sampling"),
    ],
    theme=gr.themes.Soft(primary_hue="green"),
).queue().launch(share=True)
```