Upload README.md
Browse files
README.md
CHANGED
@@ -7,23 +7,30 @@ license: apache-2.0
|
|
7 |
This is [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with a Llama-3 vocabulary.
|
8 |
|
9 |
The intended use is as a draft model for Llama-3-70B-Instruct. Llama3-8B-Instruct works for this purpose, but it's on
|
10 |
-
the heavier side for drafting.
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
## Procedure
|
13 |
|
14 |
-
The vocabulary was swapped by creating a new embedding layer (
|
15 |
the same) and initializing it as follows:
|
16 |
|
17 |
-
- every L3 token that
|
18 |
- every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
|
19 |
-
- there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete)
|
20 |
|
21 |
Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
|
22 |
-
struggles with numbers
|
|
|
23 |
|
24 |
This is remedied by subsequent finetuning, first on
|
25 |
[this 2.41 million row sample from Common Crawl](https://huggingface.co/datasets/agentlans/common-crawl-sample), and
|
26 |
-
subsequently on about 25000 completions produced by Llama3-8B-Instruct
|
|
|
27 |
|
28 |
I did try tuning just the tied embeddings, but this didn't achieve good results.
|
29 |
|
@@ -61,4 +68,4 @@ Qwama-0.5B-instruct:
|
|
61 |
|
62 |
## EXL2 Quants
|
63 |
|
64 |
-
EXL2 quants uploaded [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct-exl2).
|
|
|
7 |
This is [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with a Llama-3 vocabulary.
|
8 |
|
9 |
The intended use is as a draft model for Llama-3-70B-Instruct. Llama3-8B-Instruct works for this purpose, but it's on
|
10 |
+
the heavier side for drafting.
|
11 |
+
|
12 |
+
The secondary purpose is to explore the feasibility of vocabulary swaps, either for adapting small models like
|
13 |
+
Qwen2-0.5b to produce drafts for other models, or for interoperability between dissimilar language models in general.
|
14 |
+
The conclusion in this regard is that the method works, but, since finetuning is required, it will be expensive for
|
15 |
+
larger models. It would be interesting to explore low-rank or quantized finetuning as an alternative.
|
16 |
|
17 |
## Procedure
|
18 |
|
19 |
+
The vocabulary was swapped by creating a new embedding layer (original model uses tied embeddings so the output layer is
|
20 |
the same) and initializing it as follows:
|
21 |
|
22 |
+
- every L3 token that is an exact match for a Qwen2 token is initialized with the corresponding embedding
|
23 |
- every L3 token that decodes and re-encodes to multiple Qwen2 token is initialized with the mean of those embeddings
|
24 |
+
- there are no L3 tokens that cannot be translated to one or more Qwen2 tokens (both vocabularies are complete).
|
25 |
|
26 |
Swapping the vocabulary with the above method yields a mostly coherent but still very confused model. It especially
|
27 |
+
struggles with numbers, and of course the embeddings for the Llama-3 control tokens do not have the significance they
|
28 |
+
would in an instruct-tuned model.
|
29 |
|
30 |
This is remedied by subsequent finetuning, first on
|
31 |
[this 2.41 million row sample from Common Crawl](https://huggingface.co/datasets/agentlans/common-crawl-sample), and
|
32 |
+
subsequently 3 epochs on about 25000 instruct-formatted completions produced by Llama3-8B-Instruct, included
|
33 |
+
[here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct/blob/main/llama3-instruct-prompts.json) for reference.
|
34 |
|
35 |
I did try tuning just the tied embeddings, but this didn't achieve good results.
|
36 |
|
|
|
68 |
|
69 |
## EXL2 Quants
|
70 |
|
71 |
+
EXL2 quants uploaded [here](https://huggingface.co/turboderp/Qwama-0.5B-Instruct-exl2).
|