File size: 2,405 Bytes
3e7e156
c68b60f
 
3e7e156
c68b60f
 
3e7e156
c68b60f
3e7e156
 
c68b60f
3e7e156
 
 
c68b60f
3e7e156
2e80a9b
63b30c4
c68b60f
2e80a9b
63b30c4
3e7e156
c68b60f
3e7e156
78f4396
 
3e7e156
 
c68b60f
3e7e156
78f4396
3e7e156
c68b60f
3e7e156
c68b60f
 
 
 
3e7e156
c68b60f
 
 
3e7e156
c68b60f
 
3e7e156
c68b60f
 
 
 
3e7e156
c68b60f
 
78f4396
e9fc23b
78f4396
b398a89
78f4396
 
e9fc23b
78f4396
c68b60f
7a81dc3
c68b60f
7a81dc3
8d111d4
 
 
 
7a81dc3
 
5a5fe12
c68b60f
78f4396
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
base_model: INSAIT-Institute/BgGPT-7B-Instruct-v0.2
library_name: peft
license: apache-2.0
language:
- en
tags:
- propaganda
---

# Model Card for identrics/BG_propaganda_detector



## Model Description

- **Developed by:** [`Identrics`](https://identrics.ai/)
- **Language:** English
- **License:** apache-2.0
- **Finetuned from model:** [`google-bert/bert-base-cased`](https://huggingface.co/google-bert/bert-base-cased)
- **Context window :** 512 tokens

## Model Description

This model consists of a fine-tuned version of google-bert/bert-base-cased for a propaganda detection task. It is effectively a binary classifier, determining whether propaganda is present in the output string.
This model was created by [`Identrics`](https://identrics.ai/), in the scope of the WASPer project. The detailed taxonomy of the full pipeline could be found [here](https://github.com/Identrics/wasper/).


## Uses

Designed as a binary classifier to determine whether a traditional or social media comment contains propaganda.

### Example

First install direct dependencies:
```
pip install transformers torch accelerate
```

Then the model can be downloaded and used for inference:
```py
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("identrics/EN_propaganda_detector", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("identrics/EN_propaganda_detector")

tokens = tokenizer("Our country is the most powerful country in the world!", return_tensors="pt")
output = model(**tokens)
print(output.logits)
```

## Training Details

The training dataset for the model consists of a balanced collection of English examples, including both propaganda and non-propaganda content. These examples were sourced from a variety of traditional media and social media platforms and manually annotated by domain experts. Additionally, the dataset is enriched with AI-generated samples.

The model achieved an F1 score of **0.807** during evaluation.

## Compute Infrastructure
The model was fine-tuned using a **GPU / 2xNVIDIA Tesla V100 32GB**.

## Citation [this section is to be updated soon]

If you find our work useful, please consider citing WASPer:

```
@article{...2024wasper,
  title={WASPer: Propaganda Detection in Bulgarian and English}, 
  author={....},
  journal={arXiv preprint arXiv:...},
  year={2024}
}
```