|
# EsperBERTo Model Card |
|
|
|
## Model Description |
|
EsperBERTo is a RoBERTa-like model specifically trained from scratch on the Esperanto language using a large corpus from the OSCAR and Leipzig Corpora Collection. It is designed to perform masked language modeling and other text-based prediction tasks. This model is ideal for understanding and generating Esperanto text. |
|
|
|
### Datasets |
|
- **OSCAR Corpus (Esperanto)**: Extracted from Common Crawl dumps, filtered by language classification. |
|
- **Leipzig Corpora Collection (Esperanto)**: Includes texts from news, literature, and Wikipedia. |
|
|
|
### Preprocessing |
|
- Trained a byte-level Byte-pair encoding tokenizer with a vocabulary size of 52,000 tokens. |
|
|
|
### Hyperparameters |
|
- **Number of Epochs**: 1 |
|
- **Batch Size per GPU**: 64 |
|
- **Training Steps for Saving**: 10,000 |
|
- **Limit of Saved Models**: 2 |
|
- **Loss Calculation**: Prediction loss only |
|
|
|
### Software and Libraries |
|
- **Transformers Library Version**: [Transformers](https://github.com/huggingface/transformers) |
|
- **Training Script**: `run_language_modeling.py` |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
fill_mask = pipeline( |
|
"fill-mask", |
|
model="SamJoshua/EsperBERTo", |
|
tokenizer="SamJoshua/EsperBERTo" |
|
) |
|
|
|
fill_mask("Jen la komenco de bela <mask>.") |
|
``` |
|
|
|
## Evaluation Results |
|
The model has not yet been evaluated on a standardized test set. Future updates will include evaluation metrics such as perplexity and accuracy on a held-out validation set. |
|
|
|
## Intended Uses & Limitations |
|
**Intended Uses**: This model is intended for researchers, developers, and language enthusiasts who wish to explore Esperanto language processing for tasks like text generation, sentiment analysis, and more. |
|
|
|
**Limitations**: |
|
- The model is trained only for one epoch due to computational constraints, which may affect its understanding of more complex language structures. |
|
- As the model is trained on public web text, it may inadvertently learn and replicate social biases present in the training data. |
|
|
|
Feel free to contribute to the model by fine-tuning on specific tasks or extending its training with more data or epochs. This model serves as a baseline for further research and development in Esperanto language modeling. |
|
|