File size: 2,557 Bytes
d478c42
 
 
 
 
 
 
 
 
fddb769
5690a18
75a2762
 
 
 
fd16f1b
75a2762
75e5a7d
d478c42
 
147a576
fd16f1b
 
 
75a2762
fd16f1b
5690a18
75a2762
 
 
 
5690a18
75a2762
d478c42
75a2762
fd16f1b
 
 
 
 
 
 
 
 
 
1b00dc1
fd16f1b
ad300c1
fd16f1b
 
 
75a2762
 
 
5690a18
75a2762
8ec7d64
75a2762
5690a18
 
75a2762
5690a18
75a2762
70f9a6f
 
 
1c75b9e
 
 
 
70f9a6f
 
1c75b9e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
base_model: INSAIT-Institute/BgGPT-7B-Instruct-v0.2
library_name: peft
license: apache-2.0
language:
- bg
tags:
- propaganda
---

# Model Card for identrics/wasper_propaganda_detection_bg




## Model Description

- **Developed by:** [`Identrics`](https://identrics.ai/)
- **Language:** Bulgarian
- **License:** apache-2.0
- **Finetuned from model:** [`INSAIT-Institute/BgGPT-7B-Instruct-v0.2`](https://huggingface.co/INSAIT-Institute/BgGPT-7B-Instruct-v0.2)
- **Context window :** 8192 tokens

## Model Description

This model consists of a fine-tuned version of BgGPT-7B-Instruct-v0.2 for a propaganda detection task. It is effectively a binary classifier, determining wether propaganda is present in the output string.
This model was created by [`Identrics`](https://identrics.ai/), in the scope of the WASPer project.The detailed taxonomy of the full pipeline could be found [here](https://github.com/Identrics/wasper/).


## Uses

Designed as a binary classifier to determine whether a traditional or social media comment contains propaganda.

### Example

First install direct dependencies:
```
pip install transformers torch accelerate
```

Then the model can be downloaded and used for inference:
```py
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("identrics/EN_propaganda_detector", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("identrics/BG_propaganda_detector")

tokens = tokenizer("Газа евтин, американското ядрено гориво евтино, пълно с фотоволтаици а пък тока с 30% нагоре. Защо ?", return_tensors="pt")
output = model(**tokens)
print(output.logits)
```


## Training Details
The training dataset for the model consists of a balanced collection of Bulgarian examples, including both propaganda and non-propaganda content. These examples were sourced from a variety of traditional media and social media platforms and manually annotated by domain experts. Additionally, the dataset is enriched with AI-generated samples.

The model achieved an F1 score of **0.836** during evaluation.

## Compute Infrastructure
This model was fine-tuned using a **GPU / 2xNVIDIA Tesla V100 32GB**.

## Citation [this section is to be updated soon]

If you find our work useful, please consider citing WASPer:

```
@article{...2024wasper,
  title={WASPer: Propaganda Detection in Bulgarian and English}, 
  author={....},
  journal={arXiv preprint arXiv:...},
  year={2024}
}
```