GGUF
English
sound language model
Inference Endpoints
conversational
munish0838 commited on
Commit
8d33bba
1 Parent(s): 39ce40f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +229 -0
README.md ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ datasets:
5
+ - homebrewltd/instruction-speech-whispervq-v2
6
+ language:
7
+ - en
8
+ license: apache-2.0
9
+ tags:
10
+ - sound language model
11
+
12
+ ---
13
+
14
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
15
+
16
+
17
+ # QuantFactory/Ichigo-llama3.1-s-instruct-v0.4-GGUF
18
+ This is quantized version of [homebrewltd/Ichigo-llama3.1-s-instruct-v0.4](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.4) created using llama.cpp
19
+
20
+ # Original Model Card
21
+
22
+
23
+ ## Model Details
24
+
25
+ We have developed and released the family [Ichigo-llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.
26
+
27
+ This model is a supervised fine-tuned (SFT) version of homebrewltd/Ichigo-llama3.1-s-base-v0.3, trained on over 1 billion tokens from the [Instruction Speech WhisperVQ v4](https://huggingface.co/datasets/homebrewltd/mixed-instruction-speech-whispervq-v4) dataset which built upon [Instruction Speech WhisperVQ v3](https://huggingface.co/datasets/homebrewltd/mixed-instruction-speech-whispervq-v3-full), adding multi-turn speech conversations and noise rejection capabilities for enhanced performance. As a result, the model demonstrates improved robustness against noisy environmental inputs and enhanced multi-turn conversation capabilities, making it more reliable in real-world applications.
28
+
29
+ **Model developers** Homebrew Research.
30
+
31
+ **Input** Text and sound.
32
+
33
+ **Output** Text.
34
+
35
+ **Model Architecture** Llama-3.
36
+
37
+ **Language(s):** English.
38
+
39
+ ## Intended Use
40
+
41
+ **Intended Use Cases** This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
42
+
43
+ **Out-of-scope** The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.
44
+
45
+ ## How to Get Started with the Model
46
+
47
+ Try this model using [Google Colab Notebook](https://colab.research.google.com/drive/18IiwN0AzBZaox5o0iidXqWD1xKq11XbZ?usp=sharing).
48
+
49
+ First, we need to convert the audio file to sound tokens
50
+
51
+ ```python
52
+ device = "cuda" if torch.cuda.is_available() else "cpu"
53
+ if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
54
+ hf_hub_download(
55
+ repo_id="jan-hq/WhisperVQ",
56
+ filename="whisper-vq-stoks-medium-en+pl-fixed.model",
57
+ local_dir=".",
58
+ )
59
+ vq_model = RQBottleneckTransformer.load_model(
60
+ "whisper-vq-stoks-medium-en+pl-fixed.model"
61
+ ).to(device)
62
+ vq_model.ensure_whisper(device)
63
+ def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
64
+
65
+ wav, sr = torchaudio.load(audio_path)
66
+ if sr != 16000:
67
+ wav = torchaudio.functional.resample(wav, sr, 16000)
68
+ with torch.no_grad():
69
+ codes = vq_model.encode_audio(wav.to(device))
70
+ codes = codes[0].cpu().tolist()
71
+
72
+ result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
73
+ return f'<|sound_start|>{result}<|sound_end|>'
74
+ ```
75
+
76
+ Then, we can inference the model the same as any other LLM.
77
+
78
+ ```python
79
+ def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
80
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
81
+
82
+ model_kwargs = {"device_map": "auto"}
83
+
84
+ if use_4bit:
85
+ model_kwargs["quantization_config"] = BitsAndBytesConfig(
86
+ load_in_4bit=True,
87
+ bnb_4bit_compute_dtype=torch.bfloat16,
88
+ bnb_4bit_use_double_quant=True,
89
+ bnb_4bit_quant_type="nf4",
90
+ )
91
+ elif use_8bit:
92
+ model_kwargs["quantization_config"] = BitsAndBytesConfig(
93
+ load_in_8bit=True,
94
+ bnb_8bit_compute_dtype=torch.bfloat16,
95
+ bnb_8bit_use_double_quant=True,
96
+ )
97
+ else:
98
+ model_kwargs["torch_dtype"] = torch.bfloat16
99
+
100
+ model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
101
+
102
+ return pipeline("text-generation", model=model, tokenizer=tokenizer)
103
+
104
+ def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
105
+ generation_args = {
106
+ "max_new_tokens": max_new_tokens,
107
+ "return_full_text": False,
108
+ "temperature": temperature,
109
+ "do_sample": do_sample,
110
+ }
111
+
112
+ output = pipe(messages, **generation_args)
113
+ return output[0]['generated_text']
114
+
115
+ # Usage
116
+ llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
117
+ pipe = setup_pipeline(llm_path, use_8bit=True)
118
+ ```
119
+
120
+ ## Training process
121
+ **Training Metrics Image**: Below is a snapshot of the training loss curve visualized.
122
+
123
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/DmZOYY_-NQtNS610HXR8L.png)
124
+
125
+ **[MMLU](https://huggingface.co/datasets/cais/mmlu)**:
126
+
127
+ | Model | MMLU Score |
128
+ | --- | --- |
129
+ | llama3.1-instruct-8b | 69.40 |
130
+ | ichigo-llama3.1-s-v0.4| **64.66** |
131
+ | ichigo-llama3.1-s-v0.3: phase 3 | 63.79 |
132
+ | ichigo-llama3.1-s-v0.3: phase 2 | 63.08 |
133
+ | ichigo-llama3.1-s-base-v0.3 | 42.11 |
134
+ | llama3.5-instruct-v0.2 | 50.27 |
135
+
136
+ **[AudioBench](https://arxiv.org/abs/2406.16020) Eval**:
137
+
138
+ | Model Bench | [Open-hermes Instruction Audio](https://huggingface.co/datasets/AudioLLMs/openhermes_instruction_test) (GPT-4-O judge 0:5) | [Alpaca Instruction Audio](https://huggingface.co/datasets/AudioLLMs/alpaca_audio_test) (GPT-4-O judge 0:5) |
139
+ | --- | --- | --- |
140
+ | [Llama3.1-s-v2](https://huggingface.co/homebrewltd/llama3-s-instruct-v0.2) | 3.45 | 3.53 |
141
+ | [Ichigo-llama3.1-s v0.4](homebrewltd/Ichigo-llama3.1-s-instruct-v0.4) | **3.5** | **3.52** |
142
+ | [Ichigo-llama3.1-s v0.3-phase2 -cp7000](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-2) | 3.42 | 3.62 |
143
+ | [Ichigo-llama3.1-s v0.3-phase2-cplast](https://huggingface.co/jan-hq/llama3-s-instruct-v0.3-checkpoint-last) | 3.31 | 3.6 |
144
+ | [Ichigo-llama3.1-s v0.3-phase3](https://huggingface.co/homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-3) | 3.64 | 3.68 |
145
+ | [Qwen2-audio-7B](https://huggingface.co/Qwen/Qwen2-Audio-7B) | 2.63 | 2.24 |
146
+
147
+ ### Hardware
148
+
149
+ **GPU Configuration**: Cluster of 8x NVIDIA H100-SXM-80GB.
150
+
151
+ **GPU Usage**:
152
+ - **Continual Training**: 12 hours.
153
+
154
+ ### Training Arguments
155
+
156
+ We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation.
157
+
158
+ | Parameter | Instruction Fine-tuning |
159
+ |----------------------------|-------------------------|
160
+ | **Epoch** | 1 |
161
+ | **Global batch size** | 256 |
162
+ | **Learning Rate** | 7e-5 |
163
+ | **Learning Scheduler** | Cosine with warmup |
164
+ | **Optimizer** | Adam torch fused |
165
+ | **Warmup Ratio** | 0.01 |
166
+ | **Weight Decay** | 0.005 |
167
+ | **Max Sequence Length** | 4096 |
168
+
169
+
170
+ ## Examples
171
+
172
+ 1. Good example:
173
+
174
+ <details>
175
+ <summary>Click to toggle Example 1</summary>
176
+
177
+ ```
178
+
179
+ ```
180
+ </details>
181
+
182
+ <details>
183
+ <summary>Click to toggle Example 2</summary>
184
+
185
+ ```
186
+
187
+ ```
188
+ </details>
189
+
190
+
191
+ 2. Misunderstanding example:
192
+
193
+ <details>
194
+ <summary>Click to toggle Example 3</summary>
195
+
196
+ ```
197
+
198
+ ```
199
+ </details>
200
+
201
+ 3. Off-tracked example:
202
+
203
+ <details>
204
+ <summary>Click to toggle Example 4</summary>
205
+
206
+ ```
207
+
208
+ ```
209
+ </details>
210
+
211
+
212
+ ## Citation Information
213
+
214
+ **BibTeX:**
215
+
216
+ ```
217
+ @article{Llama3-S: Sound Instruction Language Model 2024,
218
+ title={Llama3-S},
219
+ author={Homebrew Research},
220
+ year=2024,
221
+ month=August},
222
+ url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}
223
+ ```
224
+
225
+ ## Acknowledgement
226
+
227
+ - **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
228
+
229
+ - **[Meta-Llama-3.1-8B-Instruct ](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)**