Update README.md
Browse files
README.md
CHANGED
@@ -41,30 +41,24 @@ Llama-2-Ko is an auto-regressive language model that uses an optimized transform
|
|
41 |
|
42 |
**Vocab Expansion**
|
43 |
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
- Use the same tokenization for English, but a shorter/merged tokenization for Korean.
|
49 |
-
- Tokenize "์๋
ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์."
|
50 |
-
- Llama-2:
|
51 |
-
```
|
52 |
-
['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์']
|
53 |
-
```
|
54 |
-
- **Llama-2-Ko**:
|
55 |
-
```
|
56 |
-
['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์']
|
57 |
-
```
|
58 |
-
- Tokenize "Llama 2: Open Foundation and Fine-Tuned Chat Models"
|
59 |
-
- Llama-2:
|
60 |
-
```
|
61 |
-
['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']
|
62 |
-
```
|
63 |
-
- **Llama-2-Ko**:
|
64 |
-
```
|
65 |
-
['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']
|
66 |
-
```
|
67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
|
69 |
# **Model Benchmark**
|
70 |
|
|
|
41 |
|
42 |
**Vocab Expansion**
|
43 |
|
44 |
+
| Model Name | Vocabulary Size | Description |
|
45 |
+
| --- | --- | --- |
|
46 |
+
| Original Llama-2 | 32000 | Sentencepiece BPE |
|
47 |
+
| **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
|
49 |
+
**Tokenizing "์๋
ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์."**
|
50 |
+
|
51 |
+
| Model | Tokens |
|
52 |
+
| --- | --- |
|
53 |
+
| Llama-2 | `['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์']` |
|
54 |
+
| Llama-2-Ko | `['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์']` |
|
55 |
+
|
56 |
+
**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
|
57 |
+
|
58 |
+
| Model | Tokens |
|
59 |
+
| --- | --- |
|
60 |
+
| Llama-2 | `['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']` |
|
61 |
+
| Llama-2-Ko | `['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']` |
|
62 |
|
63 |
# **Model Benchmark**
|
64 |
|