File size: 2,458 Bytes
3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 2e80a9b 63b30c4 c68b60f 2e80a9b 63b30c4 3e7e156 c68b60f 3e7e156 c68b60f 7a81dc3 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f 3e7e156 c68b60f a3b8f57 e9fc23b b398a89 7a81dc3 e9fc23b c68b60f 7a81dc3 c68b60f 7a81dc3 c68b60f 7a81dc3 c68b60f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
base_model: INSAIT-Institute/BgGPT-7B-Instruct-v0.2
library_name: peft
license: apache-2.0
language:
- en
tags:
- propaganda
---
# Model Card for identrics/BG_propaganda_detector
## Model Description
- **Developed by:** [`Identrics`](https://identrics.ai/)
- **Language:** English
- **License:** apache-2.0
- **Finetuned from model:** [`google-bert/bert-base-cased`](https://huggingface.co/google-bert/bert-base-cased)
- **Context window :** 512 tokens
## Model Description
This model consists of a fine-tuned version of google-bert/bert-base-cased for a propaganda detection task. It is effectively a binary classifier, determining wether propaganda is present in the output string.
This model was created by [`Identrics`](https://identrics.ai/), in the scope of the WASPer project.
## Uses
To be used as a binary classifier to identify if propaganda is present in a string containing a comment from a social media site
### Example
First install direct dependencies:
```
pip install transformers torch accelerate
```
Then the model can be downloaded and used for inference:
```py
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("identrics/EN_propaganda_detector", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("identrics/EN_propaganda_detector")
tokens = tokenizer("Our country is the most powerful country in the world!", return_tensors="pt")
output = model(**tokens)
print(output.logits)
```
## Training Details
The training datasets for the model consist of a balanced set totaling 840 English examples that include both propaganda and non-propaganda content. These examples are collected from a variety of traditional media and social media sources, ensuring a diverse range of content. Aditionally, the training dataset is enriched with AI-generated samples. The total distribution of the training data is shown in the table below:
The model was then tested on a smaller evaluation dataset, achieving an f1 score of 0.807.
## Citation
If you find our work useful, please consider citing WASPer:
```
@article{bai2024longwriter,
title={LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs},
author={Yushi Bai and Jiajie Zhang and Xin Lv and Linzhi Zheng and Siqi Zhu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},
journal={arXiv preprint arXiv:2408.07055},
year={2024}
}
```
- PEFT 0.11.1 |