hooking-dev
commited on
Commit
•
95cb6ff
1
Parent(s):
ff85c3e
Update README.md
Browse files
README.md
CHANGED
@@ -73,7 +73,7 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
73 |
|
74 |
### Training Data
|
75 |
|
76 |
-
The model was trained on the OSCAR Hebrew dataset, a large-scale, open corpus consisting of diverse text collected from the web, reflecting common usage of Hebrew in various contexts.
|
77 |
|
78 |
### Training Procedure
|
79 |
|
@@ -116,3 +116,21 @@ If you use this model in your research, please cite it as follows:
|
|
116 |
year={2024},
|
117 |
url={https://huggingface.co/hooking-dev/Hebrew_v1.0}
|
118 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
|
74 |
### Training Data
|
75 |
|
76 |
+
The model was trained on the OSCAR Hebrew dataset, a large-scale, open corpus consisting of diverse text collected from the web, reflecting common usage of Hebrew in various contexts. For more details on the dataset, see the citations related to OSCAR below.
|
77 |
|
78 |
### Training Procedure
|
79 |
|
|
|
116 |
year={2024},
|
117 |
url={https://huggingface.co/hooking-dev/Hebrew_v1.0}
|
118 |
}
|
119 |
+
|
120 |
+
@article{2022arXiv221210440J,
|
121 |
+
author = {{Jansen}, Tim and {Tong}, Yangling and {Zevallos}, Victoria and {Ortiz Suarez}, Pedro},
|
122 |
+
title = "{Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data}",
|
123 |
+
journal = {arXiv e-prints},
|
124 |
+
year = 2022,
|
125 |
+
month = dec,
|
126 |
+
eid = {arXiv:2212.10440},
|
127 |
+
pages = {arXiv:2212.10440},
|
128 |
+
doi = {10.48550/arXiv.2212.10440},
|
129 |
+
archivePrefix = {arXiv},
|
130 |
+
eprint = {2212.10440},
|
131 |
+
primaryClass = {cs.CL},
|
132 |
+
adsurl = {https://ui.adsabs.harvard.edu/abs/2022arXiv221210440J},
|
133 |
+
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
|
134 |
+
}
|
135 |
+
|
136 |
+
}
|