EsperBERTo Model Card

Model Description

EsperBERTo is a RoBERTa-like model specifically trained from scratch on the Esperanto language using a large corpus from the OSCAR and Leipzig Corpora Collection. It is designed to perform masked language modeling and other text-based prediction tasks. This model is ideal for understanding and generating Esperanto text.

Datasets

OSCAR Corpus (Esperanto): Extracted from Common Crawl dumps, filtered by language classification.
Leipzig Corpora Collection (Esperanto): Includes texts from news, literature, and Wikipedia.

Preprocessing

Trained a byte-level Byte-pair encoding tokenizer with a vocabulary size of 52,000 tokens.

Hyperparameters

Number of Epochs: 1
Batch Size per GPU: 64
Training Steps for Saving: 10,000
Limit of Saved Models: 2
Loss Calculation: Prediction loss only

Software and Libraries

Transformers Library Version: Transformers
Training Script: run_language_modeling.py

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="SamJoshua/EsperBERTo",
    tokenizer="SamJoshua/EsperBERTo"
)

fill_mask("Jen la komenco de bela <mask>.")

Evaluation Results

The model has not yet been evaluated on a standardized test set. Future updates will include evaluation metrics such as perplexity and accuracy on a held-out validation set.

Intended Uses & Limitations

Intended Uses: This model is intended for researchers, developers, and language enthusiasts who wish to explore Esperanto language processing for tasks like text generation, sentiment analysis, and more.

Limitations:

The model is trained only for one epoch due to computational constraints, which may affect its understanding of more complex language structures.
As the model is trained on public web text, it may inadvertently learn and replicate social biases present in the training data.

Feel free to contribute to the model by fine-tuning on specific tasks or extending its training with more data or epochs. This model serves as a baseline for further research and development in Esperanto language modeling.