Amharic WordPiece Tokenizer
This repo contains a WordPiece tokenizer trained on the Amharic subset of the oscar and mc4 datasets. It's the same as the BERT tokenizer but trained from scratch on an amharic text dataset, with a vocabulary size of 24576
.
How to use
You can load the tokenizer from huggingface hub as follows.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("α¨αααα αα αα» ααα΅ αα΅ααα΅ α΅α
αα΅α αααΈαα α αα°α¨αα α΅αα α αα± α αα αα£αͺα« ααα αα»α α₯α α¨αααααα΅ αα³α ααα’")
Output:
['α¨ααα', '##α ', '##αα', 'αα»', 'ααα΅', 'αα΅ααα΅', 'α΅α
αα΅α', 'αααΈαα', 'α αα°α¨αα', 'α΅αα', 'α αα±', 'α αα', 'αα£αͺα«', 'ααα', 'αα»α', 'α₯α', 'α¨αααα', '##αα΅', 'αα³α', 'αα', 'α’']