File size: 2,229 Bytes
344505a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6656a39
 
344505a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# EsperBERTo Model Card

## Model Description
EsperBERTo is a RoBERTa-like model specifically trained from scratch on the Esperanto language using a large corpus from the OSCAR and Leipzig Corpora Collection. It is designed to perform masked language modeling and other text-based prediction tasks. This model is ideal for understanding and generating Esperanto text.

### Datasets
- **OSCAR Corpus (Esperanto)**: Extracted from Common Crawl dumps, filtered by language classification.
- **Leipzig Corpora Collection (Esperanto)**: Includes texts from news, literature, and Wikipedia.

### Preprocessing
- Trained a byte-level Byte-pair encoding tokenizer with a vocabulary size of 52,000 tokens.

### Hyperparameters
- **Number of Epochs**: 1
- **Batch Size per GPU**: 64
- **Training Steps for Saving**: 10,000
- **Limit of Saved Models**: 2
- **Loss Calculation**: Prediction loss only

### Software and Libraries
- **Transformers Library Version**: [Transformers](https://github.com/huggingface/transformers)
- **Training Script**: `run_language_modeling.py`

```python
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="SamJoshua/EsperBERTo",
    tokenizer="SamJoshua/EsperBERTo"
)

fill_mask("Jen la komenco de bela <mask>.")
```

## Evaluation Results
The model has not yet been evaluated on a standardized test set. Future updates will include evaluation metrics such as perplexity and accuracy on a held-out validation set.

## Intended Uses & Limitations
**Intended Uses**: This model is intended for researchers, developers, and language enthusiasts who wish to explore Esperanto language processing for tasks like text generation, sentiment analysis, and more.

**Limitations**: 
- The model is trained only for one epoch due to computational constraints, which may affect its understanding of more complex language structures.
- As the model is trained on public web text, it may inadvertently learn and replicate social biases present in the training data.

Feel free to contribute to the model by fine-tuning on specific tasks or extending its training with more data or epochs. This model serves as a baseline for further research and development in Esperanto language modeling.