augmxnt
/

shisa-base-7b-v1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

jondurbin commited on Nov 28, 2023

Commit

f667067

·

1 Parent(s): fc0e492

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -26,7 +26,7 @@ Here we also compare `shisa-base-7b-v1` to other recently-released similar class
 ![7B llm-jp-eval Performance](https://huggingface.co/augmxnt/mistral-7b-ja-v0.1/resolve/main/llm-jp-eval.ja.png)
 ## Tokenizer
-As mentioned in the introduction, our tokenizer is an extended version of the Mistral 7B tokenizer, with a vocab size of  120073 and aligned to 128K. The remaining unused tokens are assigned as average-weighted `<|extra_{idx}|>` tokens.
 We use the "Fast" tokenizer, which should be the default for `AutoTokenizer`, but if you have problems, make sure to check `tokenizer.is_fast` or to initialize with `use_fast=True`.
@@ -101,7 +101,7 @@ Mistralのトークン化器を12万トークンまで拡張し、日本語の
 ![7B llm-jp-eval パフォーマンス]()
 ## トークン化器
-序文で触れたように、私たちのトークン化器はMistral 7Bトークン化器の拡張版で、語彙サイズは120073であり、128Kに合わせられています。残りの未使用トークンは、平均重み付けされた`<|extra_{idx}|>`トークンとして割り当てられています。
 私たちは「Fast」トークン化器を使用しており、これは`AutoTokenizer`のデフォルトであるべきですが、問題がある場合は`tokenizer.is_fast`をチェックするか、`use_fast=True`で初期化することを確認してください。

 ![7B llm-jp-eval Performance](https://huggingface.co/augmxnt/mistral-7b-ja-v0.1/resolve/main/llm-jp-eval.ja.png)
 ## Tokenizer
+As mentioned in the introduction, our tokenizer is an extended version of the Mistral 7B tokenizer, with a vocab size of  120073 and aligned to 120128 for better performance. The remaining unused tokens are assigned as average-weighted `<|extra_{idx}|>` tokens.
 We use the "Fast" tokenizer, which should be the default for `AutoTokenizer`, but if you have problems, make sure to check `tokenizer.is_fast` or to initialize with `use_fast=True`.
 ![7B llm-jp-eval パフォーマンス]()
 ## トークン化器
+序文で触れたように、私たちのトークン化器はMistral 7Bトークン化器の拡張版で、語彙サイズは120073であり、120128に合わせられています。残りの未使用トークンは、平均重み付けされた`<|extra_{idx}|>`トークンとして割り当てられています。
 私たちは「Fast」トークン化器を使用しており、これは`AutoTokenizer`のデフォルトであるべきですが、問題がある場合は`tokenizer.is_fast`をチェックするか、`use_fast=True`で初期化することを確認してください。