nicholasKluge
commited on
Commit
•
055c256
1
Parent(s):
ccaa04e
Update README.md
Browse files
README.md
CHANGED
@@ -29,7 +29,7 @@ co2_eq_emissions:
|
|
29 |
---
|
30 |
# ToxicityModel
|
31 |
|
32 |
-
The
|
33 |
|
34 |
The model was trained with a dataset composed of `toxic_response` and `non_toxic_response`.
|
35 |
|
@@ -52,9 +52,9 @@ This repository has the [source code](https://github.com/Nkluge-correa/Aira) use
|
|
52 |
|
53 |
⚠️ THE EXAMPLES BELOW CONTAIN TOXIC/OFFENSIVE LANGUAGE ⚠️
|
54 |
|
55 |
-
The
|
56 |
|
57 |
-
Here's an example of how to use the
|
58 |
|
59 |
```python
|
60 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
@@ -138,4 +138,4 @@ Idiot, Dumbass, Moron, Stupid, Fool, Fuck Face. Score: -7.300
|
|
138 |
|
139 |
## License
|
140 |
|
141 |
-
|
|
|
29 |
---
|
30 |
# ToxicityModel
|
31 |
|
32 |
+
The ToxicityModel is a fine-tuned version of [RoBERTa](https://huggingface.co/roberta-base) that can be used to score the toxicity of a sentence.
|
33 |
|
34 |
The model was trained with a dataset composed of `toxic_response` and `non_toxic_response`.
|
35 |
|
|
|
52 |
|
53 |
⚠️ THE EXAMPLES BELOW CONTAIN TOXIC/OFFENSIVE LANGUAGE ⚠️
|
54 |
|
55 |
+
The ToxicityModel was trained as an auxiliary reward model for RLHF training (its logit outputs can be treated as penalizations/rewards). Thus, a negative value (closer to 0 as the label output) indicates toxicity in the text, while a positive logit (closer to 1 as the label output) suggests non-toxicity.
|
56 |
|
57 |
+
Here's an example of how to use the ToxicityModel to score the toxicity of a text:
|
58 |
|
59 |
```python
|
60 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
138 |
|
139 |
## License
|
140 |
|
141 |
+
ToxicityModel is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.
|