gilramos commited on
Commit
e6e1d10
1 Parent(s): e8c7c0f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -121
README.md CHANGED
@@ -2,164 +2,149 @@
2
  license: cc
3
  language:
4
  - pt
5
- pipeline_tag: text-classification
6
  tags:
7
  - Hate Speech
8
  - kNOwHATE
9
  widget:
10
- - text: "as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é deixar de ser humano kkk"
 
11
  ---
 
 
 
 
 
12
 
13
- # Model Card for Model ID
14
-
15
- <!-- Provide a quick summary of what the model is/does. -->
16
-
17
- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
18
-
19
- ## Model Details
20
-
21
- ### Model Description
22
-
23
- <!-- Provide a longer summary of what this model is. -->
24
-
25
-
26
-
27
- - **Developed by:** [More Information Needed]
28
- - **Funded by [optional]:** [More Information Needed]
29
- - **Shared by [optional]:** [More Information Needed]
30
- - **Model type:** [More Information Needed]
31
- - **Language(s) (NLP):** [More Information Needed]
32
- - **License:** [More Information Needed]
33
- - **Finetuned from model [optional]:** [More Information Needed]
34
-
35
- ### Model Sources [optional]
36
-
37
- <!-- Provide the basic links for the model. -->
38
-
39
- - **Repository:** [More Information Needed]
40
- - **Paper [optional]:** [More Information Needed]
41
- - **Demo [optional]:** [More Information Needed]
42
-
43
- ## Uses
44
-
45
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
-
47
- ### Direct Use
48
-
49
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
-
51
- [More Information Needed]
52
-
53
- ### Downstream Use [optional]
54
-
55
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
-
57
- [More Information Needed]
58
-
59
- ### Out-of-Scope Use
60
-
61
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
-
63
- [More Information Needed]
64
-
65
- ## Bias, Risks, and Limitations
66
-
67
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
-
69
- [More Information Needed]
70
-
71
- ### Recommendations
72
-
73
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
-
75
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
-
77
- ## How to Get Started with the Model
78
-
79
- Use the code below to get started with the model.
80
-
81
- [More Information Needed]
82
-
83
- ## Training Details
84
-
85
- ### Training Data
86
-
87
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
-
89
- [More Information Needed]
90
-
91
- ### Training Procedure
92
-
93
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
-
95
- #### Preprocessing [optional]
96
-
97
- [More Information Needed]
98
-
99
-
100
- #### Training Hyperparameters
101
 
102
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
 
 
104
 
105
- ## Evaluation
 
106
 
107
- <!-- This section describes the evaluation protocols and provides the results. -->
108
 
109
- ### Testing Data, Factors & Metrics
 
 
 
 
110
 
111
- #### Testing Data
112
 
113
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
 
114
 
115
- [More Information Needed]
116
 
117
- #### Factors
118
 
119
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
120
 
121
- [More Information Needed]
122
 
123
- #### Metrics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
126
 
127
- [More Information Needed]
128
 
129
- ### Results
 
 
130
 
131
- [More Information Needed]
 
 
132
 
133
- #### Summary
 
134
 
 
135
 
 
 
 
 
 
 
 
136
 
137
- ## Technical Specifications [optional]
138
 
139
- ### Model Architecture and Objective
140
 
141
- [More Information Needed]
142
 
143
- ### Compute Infrastructure
144
 
145
- [More Information Needed]
146
 
147
- #### Hardware
148
 
149
- [More Information Needed]
 
 
 
150
 
151
- #### Software
152
 
153
- [More Information Needed]
154
 
155
- ## Citation [optional]
156
 
157
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
158
 
159
- **BibTeX:**
 
 
 
160
 
161
- [More Information Needed]
162
 
163
- **APA:**
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
- [More Information Needed]
 
 
 
 
 
2
  license: cc
3
  language:
4
  - pt
 
5
  tags:
6
  - Hate Speech
7
  - kNOwHATE
8
  widget:
9
+ - text: >-
10
+ Os [MASK] são todos uns animais, deviam voltar para a sua terra.
11
  ---
12
+ ---
13
+ <img align="left" width="140" height="140" src="https://ilga-portugal.pt/files/uploads/2023/06/logo_HATE_cores_page-0001-1024x539.jpg">
14
+ <p style="text-align: center;">&nbsp;&nbsp;&nbsp;&nbsp;This is the model card for HateBERTimbau.
15
+ You may be interested in some of the other models from the <a href="https://huggingface.co/knowhate">kNOwHATE project</a>.
16
+ </p>
17
 
18
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ # HateBERTimbau
21
 
22
+ **HateBERTimbau** is a foundation, large language model for European **Portuguese** from **Portugal** for Hate Speech content.
23
 
24
+ It is an **encoder** of the BERT family, based on the neural architecture Transformer and
25
+ developed over the [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) model, retrained on a dataset of 229,103 tweets specifically focused on potential hate speech.
26
 
27
+ ## Model Description
28
 
29
+ - **Developed by:** [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu)
30
+ - **Funded by:** [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal)
31
+ - **Model type:** Transformer-based model retrained for Hate Speech in Portuguese social media text
32
+ - **Language:** Portuguese
33
+ - **Retrained from model:** [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased)
34
 
35
+ Several models were developed by fine-tuning Base HateBERTimbau for Hate Speech detection present in the table bellow:
36
 
37
+ | HateBERTimbau's Family of Models |
38
+ |---------------------------------------------------------------------------------------------------------|
39
+ | [**HateBERTimbau YouTube**](https://huggingface.co/knowhate/HateBERTimbau-youtube) |
40
+ | [**HateBERTimbau Twitter**](https://huggingface.co/knowhate/HateBERTimbau-twitter) |
41
+ | [**HateBERTimbau YouTube+Twitter**](https://huggingface.co/knowhate/HateBERTimbau-yt-tt)|
42
 
43
+ # Uses
44
 
45
+ You can use this model directly with a pipeline for masked language modeling:
46
 
47
+ ```python
48
+ from transformers import pipeline
49
+ unmasker = pipeline('fill-mask', model='knowhate/HateBERTimbau')
50
 
51
+ unmasker("Os [MASK] são todos uns animais, deviam voltar para a sua terra.")
52
 
53
+ [{'score': 0.6771652698516846,
54
+ 'token': 12714,
55
+ 'token_str': 'africanos',
56
+ 'sequence': 'Os africanos são todos uns animais, deviam voltar para a sua terra.'},
57
+ {'score': 0.08679857850074768,
58
+ 'token': 15389,
59
+ 'token_str': 'homossexuais',
60
+ 'sequence': 'Os homossexuais são todos uns animais, deviam voltar para a sua terra.'},
61
+ {'score': 0.03806231543421745,
62
+ 'token': 4966,
63
+ 'token_str': 'portugueses',
64
+ 'sequence': 'Os portugueses são todos uns animais, deviam voltar para a sua terra.'},
65
+ {'score': 0.035253893584012985,
66
+ 'token': 16773,
67
+ 'token_str': 'Portugueses',
68
+ 'sequence': 'Os Portugueses são todos uns animais, deviam voltar para a sua terra.'},
69
+ {'score': 0.023521048948168755,
70
+ 'token': 8618,
71
+ 'token_str': 'brancos',
72
+ 'sequence': 'Os brancos são todos uns animais, deviam voltar para a sua terra.'}]
73
 
74
+ ```
75
 
76
+ Or this model can be used by fine-tuning it for a specific task/dataset:
77
 
78
+ ```python
79
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
80
+ from datasets import load_dataset
81
 
82
+ tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau")
83
+ model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau")
84
+ dataset = load_dataset("knowhate/youtube-train")
85
 
86
+ def tokenize_function(examples):
87
+ return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
88
 
89
+ tokenized_datasets = dataset.map(tokenize_function, batched=True)
90
 
91
+ training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch")
92
+ trainer = Trainer(
93
+ model=model,
94
+ args=training_args,
95
+ train_dataset=tokenized_datasets["train"],
96
+ eval_dataset=tokenized_datasets["validation"],
97
+ )
98
 
99
+ trainer.train()
100
 
101
+ ```
102
 
103
+ # Training
104
 
105
+ ## Data
106
 
107
+ 229,103 tweets associated with offensive content were used to retrain the base model.
108
 
109
+ ## Training Hyperparameters
110
 
111
+ - Batch Size: 4 samples
112
+ - Epochs: 100
113
+ - Learning Rate: 5e-5 with Adam optimizer
114
+ - Maximum Sequence Length: 512 sentence pieces
115
 
116
+ # Testing
117
 
118
+ ## Data
119
 
120
+ We used two different datasets for testing, one for YouTube comments [here](https://huggingface.co/datasets/knowhate/youtube-test) and another for Tweets [here](https://huggingface.co/datasets/knowhate/twitter-test).
121
 
122
+ ## Hate Speech Classification Results (with no fine-tuning)
123
 
124
+ | Dataset | Precision | Recall | F1-score |
125
+ |:----------------|:-----------|:----------|:-------------|
126
+ | **YouTube** | 0.928 | 0.108 | **0.193** |
127
+ | **Twitter** | 0.686 | 0.211 | **0.323** |
128
 
129
+ # BibTeX Citation
130
 
131
+ ``` latex
132
+ @mastersthesis{Matos-Automatic-Hate-Speech-Detection-in-Portuguese-Social-Media-Text,
133
+ title = {{Automatic Hate Speech Detection in Portuguese Social Media Text}},
134
+ author = {Matos, Bernardo Cunha},
135
+ month = nov,
136
+ year = {2022},
137
+ abstract = {{Online Hate Speech (HS) has been growing dramatically on social media and its uncontrolled spread has motivated researchers to develop a diversity of methods for its automated detection. However, the detection of online HS in Portuguese still merits further research. To fill this gap, we explored different models that proved to be successful in the literature to address this task. In particular, we have explored models that use the BERT architecture. Beyond testing single-task models we also explored multitask models that use the information on other related categories to learn HS. To better capture the semantics of this type of texts, we developed HateBERTimbau, a retrained version of BERTimbau more directed to social media language including potential HS targeting African descent, Roma, and LGBTQI+ communities. The performed experiments were based on CO-HATE and FIGHT, corpora of social media messages posted by the Portuguese online community that were labelled regarding the presence of HS among other categories.
138
+ The results achieved show the importance of considering the annotator's agreement on the data used to develop HS detection models. Comparing different subsets of data used for the training of the models it was shown that, in general, a higher agreement on the data leads to better results.
139
+ HATEBERTimbau consistently outperformed BERTimbau on both datasets confirming that further pre-training of BERTimbau was a successful strategy to obtain a language model more suitable for online HS detection in Portuguese.
140
+ The implementation of target-specific models, and multitask learning have shown potential in obtaining better results.}},
141
+ language = {eng},
142
+ copyright = {embargoed-access},
143
+ }
144
+ ```
145
 
146
+ # Acknowledgements
147
+
148
+ This work was funded in part by the European Union under Grant CERV-2021-EQUAL (101049306).
149
+ However the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project.
150
+ Neither the European Union nor the Knowhate Project can be held responsible.