gilramos commited on
Commit
254a990
1 Parent(s): 190e59b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -62
README.md CHANGED
@@ -17,59 +17,29 @@ pipeline_tag: text-classification
17
 
18
  ---
19
 
20
- # HateBERTimbau
21
 
22
- **HateBERTimbau** is a foundation, large language model for European **Portuguese** from **Portugal** for Hate Speech content.
23
-
24
- It is an **encoder** of the BERT family, based on the neural architecture Transformer and
25
- developed over the [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) model, retrained on a dataset of 229,103 tweets specifically focused on potential hate speech.
26
 
27
  ## Model Description
28
 
29
  - **Developed by:** [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu)
30
  - **Funded by:** [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal)
31
- - **Model type:** Transformer-based model retrained for Hate Speech in Portuguese social media text
32
  - **Language:** Portuguese
33
- - **Retrained from model:** [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased)
34
-
35
- Several models were developed by fine-tuning Base HateBERTimbau for Hate Speech detection present in the table bellow:
36
-
37
- | HateBERTimbau's Family of Models |
38
- |---------------------------------------------------------------------------------------------------------|
39
- | [**HateBERTimbau YouTube**](https://huggingface.co/knowhate/HateBERTimbau-youtube) |
40
- | [**HateBERTimbau Twitter**](https://huggingface.co/knowhate/HateBERTimbau-twitter) |
41
- | [**HateBERTimbau YouTube+Twitter**](https://huggingface.co/knowhate/HateBERTimbau-yt-tt)|
42
 
43
  # Uses
44
 
45
- You can use this model directly with a pipeline for masked language modeling:
46
 
47
  ```python
48
  from transformers import pipeline
49
- unmasker = pipeline('fill-mask', model='knowhate/HateBERTimbau')
50
-
51
- unmasker("Os [MASK] são todos uns animais, deviam voltar para a sua terra.")
52
-
53
- [{'score': 0.6771652698516846,
54
- 'token': 12714,
55
- 'token_str': 'africanos',
56
- 'sequence': 'Os africanos são todos uns animais, deviam voltar para a sua terra.'},
57
- {'score': 0.08679857850074768,
58
- 'token': 15389,
59
- 'token_str': 'homossexuais',
60
- 'sequence': 'Os homossexuais são todos uns animais, deviam voltar para a sua terra.'},
61
- {'score': 0.03806231543421745,
62
- 'token': 4966,
63
- 'token_str': 'portugueses',
64
- 'sequence': 'Os portugueses são todos uns animais, deviam voltar para a sua terra.'},
65
- {'score': 0.035253893584012985,
66
- 'token': 16773,
67
- 'token_str': 'Portugueses',
68
- 'sequence': 'Os Portugueses são todos uns animais, deviam voltar para a sua terra.'},
69
- {'score': 0.023521048948168755,
70
- 'token': 8618,
71
- 'token_str': 'brancos',
72
- 'sequence': 'Os brancos são todos uns animais, deviam voltar para a sua terra.'}]
73
 
74
  ```
75
 
@@ -79,8 +49,8 @@ Or this model can be used by fine-tuning it for a specific task/dataset:
79
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
80
  from datasets import load_dataset
81
 
82
- tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau")
83
- model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau")
84
  dataset = load_dataset("knowhate/youtube-train")
85
 
86
  def tokenize_function(examples):
@@ -104,42 +74,33 @@ trainer.train()
104
 
105
  ## Data
106
 
107
- 229,103 tweets associated with offensive content were used to retrain the base model.
108
 
109
  ## Training Hyperparameters
110
 
111
- - Batch Size: 4 samples
112
- - Epochs: 100
113
- - Learning Rate: 5e-5 with Adam optimizer
114
- - Maximum Sequence Length: 512 sentence pieces
115
 
116
  # Testing
117
 
118
  ## Data
119
 
120
- We used two different datasets for testing, one for YouTube comments [here](https://huggingface.co/datasets/knowhate/youtube-test) and another for Tweets [here](https://huggingface.co/datasets/knowhate/twitter-test).
121
 
122
- ## Hate Speech Classification Results (with no fine-tuning)
123
 
124
  | Dataset | Precision | Recall | F1-score |
125
  |:----------------|:-----------|:----------|:-------------|
126
- | **YouTube** | 0.928 | 0.108 | **0.193** |
127
- | **Twitter** | 0.686 | 0.211 | **0.323** |
128
 
129
  # BibTeX Citation
130
 
 
 
131
  ``` latex
132
- @mastersthesis{Matos-Automatic-Hate-Speech-Detection-in-Portuguese-Social-Media-Text,
133
- title = {{Automatic Hate Speech Detection in Portuguese Social Media Text}},
134
- author = {Matos, Bernardo Cunha},
135
- month = nov,
136
- year = {2022},
137
- abstract = {{Online Hate Speech (HS) has been growing dramatically on social media and its uncontrolled spread has motivated researchers to develop a diversity of methods for its automated detection. However, the detection of online HS in Portuguese still merits further research. To fill this gap, we explored different models that proved to be successful in the literature to address this task. In particular, we have explored models that use the BERT architecture. Beyond testing single-task models we also explored multitask models that use the information on other related categories to learn HS. To better capture the semantics of this type of texts, we developed HateBERTimbau, a retrained version of BERTimbau more directed to social media language including potential HS targeting African descent, Roma, and LGBTQI+ communities. The performed experiments were based on CO-HATE and FIGHT, corpora of social media messages posted by the Portuguese online community that were labelled regarding the presence of HS among other categories.
138
- The results achieved show the importance of considering the annotator's agreement on the data used to develop HS detection models. Comparing different subsets of data used for the training of the models it was shown that, in general, a higher agreement on the data leads to better results.
139
- HATEBERTimbau consistently outperformed BERTimbau on both datasets confirming that further pre-training of BERTimbau was a successful strategy to obtain a language model more suitable for online HS detection in Portuguese.
140
- The implementation of target-specific models, and multitask learning have shown potential in obtaining better results.}},
141
- language = {eng},
142
- copyright = {embargoed-access},
143
  }
144
  ```
145
 
 
17
 
18
  ---
19
 
20
+ # HateBERTimbau-YouTube
21
 
22
+ **HateBERTimbau-YouTube** is a transformer-based encoder model for identifying hate speech in Portuguese social media text. It is a fine-tuned version of [HateBERTimbau](https://huggingface.co/knowhate/HateBERTimbau) model, retrained on a dataset of 23,912 YouTube comments specifically focused on hate speech.
 
 
 
23
 
24
  ## Model Description
25
 
26
  - **Developed by:** [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu)
27
  - **Funded by:** [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal)
28
+ - **Model type:** Transformer-based text classification model fine-tuned for Hate Speech detection in Portuguese social media text
29
  - **Language:** Portuguese
30
+ - **Fine-tuned from model:** [knowhate/HateBERTimbau](https://huggingface.co/knowhate/HateBERTimbau)
 
 
 
 
 
 
 
 
31
 
32
  # Uses
33
 
34
+ You can use this model directly with a pipeline for text classification:
35
 
36
  ```python
37
  from transformers import pipeline
38
+ classifier = pipeline('text-classification', model='knowhate/HateBERTimbau-youtube')
39
+
40
+ classifier("as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é deixar de ser humano 😂😂")
41
+
42
+ [{'label': 'Hate Speech', 'score': 0.9228119850158691}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ```
45
 
 
49
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
50
  from datasets import load_dataset
51
 
52
+ tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau-youtube")
53
+ model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau-youtube")
54
  dataset = load_dataset("knowhate/youtube-train")
55
 
56
  def tokenize_function(examples):
 
74
 
75
  ## Data
76
 
77
+ 23,912 YouTube comments associated with offensive content were used to fine-tune the base model.
78
 
79
  ## Training Hyperparameters
80
 
81
+ - Batch Size: 32
82
+ - Epochs: 3
83
+ - Learning Rate: 2e-5 with Adam optimizer
84
+ - Maximum Sequence Length: 350 tokens
85
 
86
  # Testing
87
 
88
  ## Data
89
 
90
+ The dataset used to test this model was: [knowhate/youtube-test](https://huggingface.co/datasets/knowhate/youtube-test)
91
 
92
+ ## Results
93
 
94
  | Dataset | Precision | Recall | F1-score |
95
  |:----------------|:-----------|:----------|:-------------|
96
+ | **YouTube** | 0.856 | 0.892 | **0.874** |
 
97
 
98
  # BibTeX Citation
99
 
100
+ Currently in Peer Review
101
+
102
  ``` latex
103
+
 
 
 
 
 
 
 
 
 
 
104
  }
105
  ```
106