Spaces:
Runtime error
Runtime error
Pablo
commited on
Commit
•
e951a81
1
Parent(s):
605d4d7
Format changes in project description
Browse files
app.py
CHANGED
@@ -57,10 +57,13 @@ st.markdown(
|
|
57 |
[Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104)
|
58 |
organised by HuggingFace.
|
59 |
|
60 |
-
All models are variations of RoBERTa-base trained from scratch in Spanish.
|
61 |
-
We
|
|
|
62 |
The idea is to favour examples with perplexities that are neither too small (short, repetitive texts) or too long (potentially poor quality).
|
63 |
-
|
|
|
|
|
64 |
The first models have been trained (250.000 steps) on sequence length 128, and training for Gaussian changed to sequence length 512 for the last 25.000 training steps.
|
65 |
"""
|
66 |
)
|
|
|
57 |
[Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104)
|
58 |
organised by HuggingFace.
|
59 |
|
60 |
+
All models are variations of RoBERTa-base trained from scratch in Spanish using the mc4 dataset.
|
61 |
+
We reduced the dataset size to 50 million documents to keep training times shorter, and also to be able to bias training examples based on their perplexity.
|
62 |
+
|
63 |
The idea is to favour examples with perplexities that are neither too small (short, repetitive texts) or too long (potentially poor quality).
|
64 |
+
* **Random** sampling simply takes documents at random to reduce the dataset size.
|
65 |
+
* **Gaussian** rejects documents with a higher probability for lower and larger perplexities, based on a Gaussian function.
|
66 |
+
|
67 |
The first models have been trained (250.000 steps) on sequence length 128, and training for Gaussian changed to sequence length 512 for the last 25.000 training steps.
|
68 |
"""
|
69 |
)
|