mmochtak
/

authdetect

PyTorch

English

roberta

Model card Files Files and versions Community

mmochtak commited on Oct 10, 2024

Commit

760ac33

verified ·

1 Parent(s): 70313ab

first model card

Browse files

Files changed (1) hide show

README.md +85 -3

README.md CHANGED Viewed

@@ -1,3 +1,85 @@
----
-license: cc-by-nc-sa-4.0
----

+---
+license: cc-by-nc-sa-4.0
+language:
+- en
+base_model:
+- FacebookAI/roberta-base
+---
+**Overview**
+***<code>authdetect</code>*** is a classification model for detecting authoritarian discourse in political speeches, leveraging a novel approach to studying latent political concepts through language modeling. Rather than relying on predefined rules or rigid definitions of authoritarian discourse, the model operates on the premise that authoritarian leaders naturally exhibit such discourse in their speech patterns. Essentially, the model assumes that "authoritarians talk like authoritarians," allowing it to discern instances of authoritarian rhetoric from speech segments. Structured as a regression problem with weak supervision logic, the model classifies text segments based on their association with either authoritarian or democratic discourse. By training on speeches from both authoritarian and democratic leaders, it learns to distinguish between these two distinct forms of political rhetoric.
+**Data**
+The model is finetuned on top of <code>roberta-base</code> model using 77 years of speech data from the UN General Assembly. Training design combines the transcripts of political speeches in English with a weak supervision setup under which the training data are annotated with the V-Dem polyarchy index (i.e., polyarchic status) as the reference labels. The model is trained for predicting the index value of a speech, linking the presented narratives with the virtual quality of democracy of the speaker’s country (rather than with the speaker himself). The corpus quality ensures robust temporal (1946–2022) and spatial (197 countries) representation, resulting in a well-balanced training dataset. Although the training data are domain-specific (the UN General Assembly), the model trained on the UNGD corpus appears to be robust across various sub-domains, demonstrating its capacity to scale well across regions and contexts. Rather than using whole speeches as input data for training, I utilize a sliding window of sentence trigrams splitting the raw transcripts into uniform snippets of text mapping the political language of world leaders. As the goal is to model the varying context of presented ideas in the analyzed speeches rather than the context of the UN General Assembly debates, the main focus is on the particularities of the language of reference groups (authoritarian/democratic leaders). The final dataset counts 1 062 286 sentence trigrams annotated with EDI scores inherited from the parent documents (μ = 0.430, 95% CI [0.429, 0.430]).
+**Usage**
+The model is designed with accessibility in mind, allowing anyone to use it. The example below contains a simplified inference pipeline, with a primary focus on social scientists and their empirical research needs. In addition to that, the repository includes a Jupyter notebook and a sample corpus that can be downloaded, uploaded to Google Drive, and tested “in full” in Google Colab, free of charge. Similar analyses can be performed on any spreadsheet with just two columns: document ID and the raw text. For users with fewer technical skills, there is also a video tutorial on how to start analyzing your data in a matter of minutes.
+For more details, evaluation tests, and discussion, please refer to the original paper (see details below).
+**Full inference pipeline** (from raw text to sentence trigrams with authoritarian discourse indices)
+```python
+# install required libraries if needed
+pip install simpletransformers
+pip install trankit
+pip install sweetviz
+# load all libraries
+import simpletransformers.classification as cl
+import trankit
+import pandas as pd
+import sweetviz as sv
+# sample text (exceprt from Donald Trump's inaugural adress in 2017)
+sample_text = 'We must protect our borders from the ravages of other countries making our products, stealing our companies, and destroying our jobs. Protection will lead to great prosperity and strength. I will fight for you with every breath in my body – and I will never, ever let you down. America will start winning again, winning like never before. We will bring back our jobs. We will bring back our borders. We will bring back our wealth. And we will bring back our dreams. We will build new roads, and highways, and bridges, and airports, and tunnels, and railways all across our wonderful nation. We will get our people off of welfare and back to work – rebuilding our country with American hands and American labor. We will follow two simple rules: Buy American and Hire American. We will seek friendship and goodwill with the nations of the world – but we do so with the understanding that it is the right of all nations to put their own interests first.'
+# load the trankit pipeline with the English model; this pipe uses a deep learning model for sentence tokenization (much more precise than rule-based models)
+p = trankit.Pipeline(lang='english', embedding='xlm-roberta-base', gpu=True, cache_dir='./cache')
+# split the text into sentences
+sentences_raw = pd.DataFrame.from_dict(p.ssplit(sample_text))
+# normalized dataframe
+sentences_norm = pd.json_normalize(sentences_raw['sentences'].tolist())
+# helper function for creating sentence trigrams
+def create_ngram(text):
+    no_steps = len(text) - 2
+    indexes = [list(range(x, x + 3)) for x in range(no_steps)]
+    return [' '.join(text[i] for i in index) for index in indexes]
+# Create sentence trigrams
+sentence_trigram = create_ngram(sentences_norm['text'].tolist())
+# create a DataFrame with sentence trigrams
+sentence_df = pd.DataFrame({'sent_trigram': sentence_trigram})
+# load the pretrained authdetect model from the Huggingface Hub
+model = cl.ClassificationModel("roberta", "mmochtak/authdetect")
+# apply the model on the prepared sentence trigrams
+prediction = model.predict(to_predict = sentence_df["sent_trigram"].tolist())
+# add scores to the existing dataframe
+sentence_df = sentence_df.assign(predict = prediction[1])
+print(sentence_df)
+```
+**If you use the model, please cite:**
+```
+@article{mochtak_chasing_2024,
+	title = {Chasing the {Authoritarian} {Specter}: {Detecting} {Authoritarian} {Discourse} with {Large} {Language} {Models}},
+	volume = {forthcoming},
+	abstract = {The paper introduces a deep-learning model fine-tuned for detecting authoritarian discourse in political speeches. Set up as a regression problem with weak supervision logic, the model is trained for the task of classification of segments of text for being/not being associated with authoritarian discourse. Rather than trying to define what an authoritarian discourse is, the model builds on the assumption that authoritarian leaders inherently define it. In other words, authoritarian leaders talk like authoritarians. When combined with the discourse defined by democratic leaders, the model learns the instances that are more often associated with authoritarians on the one hand and democrats on the other. The paper discusses several evaluation tests using the model and advocates for its usefulness in a broad range of research problems. It presents a new methodology for studying latent political concepts and positions it as an alternative to more traditional research strategies.},
+	language = {en},
+	journal = {European Journal of Political Resarch},
+	author = {Mochtak, Michal},
+	year = {2024},
+}
+```