Overview

This is a slightly smaller model trained on OSCAR Sinhala dedup dataset. As Sinhala is one of those low resource languages, there are only a handful of models been trained. So, this would be a great place to start training for more downstream tasks.

Model Specification

The model chosen for training is Roberta with the following specifications:

  1. vocab_size=52000
  2. max_position_embeddings=514
  3. num_attention_heads=12
  4. num_hidden_layers=6
  5. type_vocab_size=1

How to Use

You can use this model directly with a pipeline for masked language modeling:

from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline

model = AutoModelWithLMHead.from_pretrained("keshan/SinhalaBERTo")
tokenizer = AutoTokenizer.from_pretrained("keshan/SinhalaBERTo")

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

fill_mask("මම ගෙදර <mask>.")
Downloads last month
56
Safetensors
Model size
83.5M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train keshan/SinhalaBERTo

Spaces using keshan/SinhalaBERTo 2