ken11
/

albert-base-japanese-v1

Inference Endpoints

Model card Files Files and versions Community

ken11 commited on Dec 21, 2021

Commit

0b768a8

•

1 Parent(s): 8012371

add TF

Files changed (2) hide show

README.md +29 -0
tf_model.h5 +3 -0

README.md CHANGED Viewed

@@ -24,6 +24,7 @@ widget:
 ### Fill-Mask
 このモデルではTokenizerにSentencepieceを利用しています
 そのままでは`[MASK]`トークンのあとに[余計なトークンが混入する問題](https://ken11.jp/blog/sentencepiece-tokenizer-bug)があるので、利用する際には以下のようにする必要があります
 ```py
 from transformers import (
     AlbertForMaskedLM, AlbertTokenizerFast
@@ -51,6 +52,34 @@ print(tokenizer.convert_ids_to_tokens(result.tolist()))
 # ['英語', '心理学', '数学', '医学', '日本語']
 ```
 ## Training Data
 学習には
 - [日本語Wikipediaの全文](https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3%83%BC%E3%82%B9%E3%83%80%E3%82%A6%E3%83%B3%E3%83%AD%E3%83%BC%E3%83%89)

 ### Fill-Mask
 このモデルではTokenizerにSentencepieceを利用しています
 そのままでは`[MASK]`トークンのあとに[余計なトークンが混入する問題](https://ken11.jp/blog/sentencepiece-tokenizer-bug)があるので、利用する際には以下のようにする必要があります
+#### for PyTorch
 ```py
 from transformers import (
     AlbertForMaskedLM, AlbertTokenizerFast
 # ['英語', '心理学', '数学', '医学', '日本語']
 ```
+#### for TensorFlow
+```py
+from transformers import (
+    TFAlbertForMaskedLM, AlbertTokenizerFast
+)
+import tensorflow as tf
+tokenizer = AlbertTokenizerFast.from_pretrained("ken11/albert-base-japanese-v1")
+model = TFAlbertForMaskedLM.from_pretrained("ken11/albert-base-japanese-v1")
+text = "大学で[MASK]の研究をしています"
+tokenized_text = tokenizer.tokenize(text)
+del tokenized_text[tokenized_text.index(tokenizer.mask_token) + 1]
+input_ids = [tokenizer.cls_token_id]
+input_ids.extend(tokenizer.convert_tokens_to_ids(tokenized_text))
+input_ids.append(tokenizer.sep_token_id)
+inputs = {"input_ids": [input_ids], "token_type_ids": [[0]*len(input_ids)], "attention_mask": [[1]*len(input_ids)]}
+batch = {k: tf.convert_to_tensor(v, dtype=tf.int32) for k, v in inputs.items()}
+output = model(**batch)[0]
+result = tf.math.top_k(output[0, input_ids.index(tokenizer.mask_token_id)], k=5)
+print(tokenizer.convert_ids_to_tokens(result.indices.numpy()))
+# ['英語', '心理学', '数学', '医学', '日本語']
+```
 ## Training Data
 学習には
 - [日本語Wikipediaの全文](https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3%83%BC%E3%82%B9%E3%83%80%E3%82%A6%E3%83%B3%E3%83%AD%E3%83%BC%E3%83%89)

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:675280c441280bcfc97f943903cdbf69daad6823151346f3f785a440c15e69f6
+size 65122008