Amharic BPE Tokenizer

This repo contains a Byte-Pair Encoding tokenizer trained on the Amharic subset of the oscar dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of 24000.

How to use

You can load the tokenizer from huggingface hub as follows.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
tokenizer("αŠ α‰£α‹­αŠ• α‹«αˆ‹α‹¨ α‹¨α•αˆŒαŠ• α‰²αŠ¬α‰΅ αŠ₯α‰½αˆˆα‹‹αˆˆα‹α’")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train rasyosef/gpt2-oscar-amharic-tokenizer