README.md · mmochtak/authdetect at main

metadata

license: cc-by-nc-sa-4.0
language:
  - en
base_model:
  - FacebookAI/roberta-base

Overview

authdetect is a classification model for detecting authoritarian discourse in political speeches, leveraging a novel approach to studying latent political concepts through language modeling. Rather than relying on predefined rules or rigid definitions of authoritarian discourse, the model operates on the premise that authoritarian leaders naturally exhibit such discourse in their speech patterns. Essentially, the model assumes that "authoritarians talk like authoritarians," allowing it to discern instances of authoritarian rhetoric from speech segments. Structured as a regression problem with weak supervision logic, the model classifies text segments based on their association with either authoritarian or democratic discourse. By training on speeches from both authoritarian and democratic leaders, it learns to distinguish between these two distinct forms of political rhetoric.

Data

The model is finetuned on top of roberta-base model using 77 years of speech data from the UN General Assembly. Training design combines the transcripts of political speeches in English with a weak supervision setup under which the training data are annotated with the V-Dem polyarchy index (i.e., polyarchic status) as the reference labels. The model is trained for predicting the index value of a speech, linking the presented narratives with the virtual quality of democracy of the speaker’s country (rather than with the speaker himself). The corpus quality ensures robust temporal (1946–2022) and spatial (197 countries) representation, resulting in a well-balanced training dataset. Although the training data are domain-specific (the UN General Assembly), the model trained on the UNGD corpus appears to be robust across various sub-domains, demonstrating its capacity to scale well across regions and contexts. Rather than using whole speeches as input data for training, I utilize a sliding window of sentence trigrams splitting the raw transcripts into uniform snippets of text mapping the political language of world leaders. As the goal is to model the varying context of presented ideas in the analyzed speeches rather than the context of the UN General Assembly debates, the main focus is on the particularities of the language of reference groups (authoritarian/democratic leaders). The final dataset counts 1 062 286 sentence trigrams annotated with EDI scores inherited from the parent documents (μ = 0.430, 95% CI [0.429, 0.430]).

Usage

The model is designed with accessibility in mind, allowing anyone to use it. The example below contains a simplified inference pipeline, with a primary focus on social scientists and their empirical research needs. In addition to that, the repository includes a Jupyter notebook and a sample corpus that can be downloaded, uploaded to Google Drive, and tested “in full” in Google Colab, free of charge. Similar analyses can be performed on any spreadsheet with just two columns: document ID and the raw text. For users with fewer technical skills, there is also a video tutorial on how to start analyzing your data in a matter of minutes.

For more details, evaluation tests, and discussion, please refer to the original paper (see details below) and the official Zenodo repository (https://zenodo.org/records/13920400).

The repository includes also a comprehensive walkthrough tutorial that demonstrates how to use the authdetect model (authdetect/tutorial/). This tutorial is designed to help users quickly analyze their data with ease. By downloading the interactive Jupyter notebook and the sample data, anyone can follow the step-by-step instructions and run the pipeline effortlessly using Google Colab, enabling them to try it themselves and get results in no time. The whole process can also be followed in a YT video available at: https://www.youtube.com/watch?v=CRy9uxMChoE.

Simplified inference pipeline (from raw text to sentence trigrams with authoritarian discourse indices)

# install required libraries if needed
pip install simpletransformers
pip install trankit==1.1.1

# load all libraries
import simpletransformers.classification as cl
import trankit
import pandas as pd

# sample text (excerpt from the UNGD 2024 speech delivered by Song Kim, Permanent Representative of the Democratic People’s Republic of Korea at the UN.)
sample_text = "Joining here are the member states of NATO, which is an outside force beyond the region and an exclusive military bloc. They are strengthening military cooperation with the U.S. and ROK, abusing the signboard of UN command, which should have been dismantled decades ago, in accordance with the UNGA resolution. They are storing up military confrontation still further by deploying warships and aircrafts in the hotspot region of the Korean Peninsula. Such being the case, they blame us for threatening them. and the peace and stability of the region and beyond with nuclear weapons. Then who had developed and used nuclear weapons against humanity for the first time in history? Who has introduced nuclear weapons into the Korean Peninsula in the last century and posed a nuclear threat to the DPRK over the century? Who on earth is talking unhesitatingly about the end of regime of a sovereign state and maintaining first use of nuclear weapons against the DPRK as its national policy? It is not that the DPRK's position of nuclear weapons makes the U.S. hostile towards us."

# load the trankit pipeline with the English model; this pipe uses a deep learning model for sentence tokenization (much more precise than rule-based models)
p = trankit.Pipeline(lang='english', embedding='xlm-roberta-base', gpu=True, cache_dir='./cache')

# split the text into sentences
sentences_raw = pd.DataFrame.from_dict(p.ssplit(sample_text))

# normalized dataframe
sentences_norm = pd.json_normalize(sentences_raw['sentences'].tolist())

# helper function for creating sentence trigrams
def create_ngram(text):
    no_steps = len(text) - 2
    indexes = [list(range(x, x + 3)) for x in range(no_steps)]
    return [' '.join(text[i] for i in index) for index in indexes]

# Create sentence trigrams
sentence_trigram = create_ngram(sentences_norm['text'].tolist())

# create a DataFrame with sentence trigrams
sentence_df = pd.DataFrame({'sent_trigram': sentence_trigram})

# load the pretrained authdetect model from the Huggingface Hub
model = cl.ClassificationModel("roberta", "mmochtak/authdetect")

# apply the model on the prepared sentence trigrams
prediction = model.predict(to_predict = sentence_df["sent_trigram"].tolist())

# add scores to the existing dataframe
sentence_df = sentence_df.assign(predict = prediction[1])

print(sentence_df)

Known biases and issues

This model, like all machine learning models, exhibits biases shaped by its training data and task-specific nuances. Trained primarily on speeches from the UN General Assembly, it has learned discourse patterns unique to that context, which may influence how it classifies leaders along the authoritarian-democratic spectrum. This limitation is compounded by a slight imbalance in the training data, which skews towards authoritarian discourse (mean = 0.430). Although no systematic bias was detected in testing, the model may occasionally lean towards assigning lower values in certain cases. Additionally, the model’s classification may be sensitive to cultural or ideological markers, such as religious phrases commonly used by leaders from majority-Muslim countries, or ideological language like "comrades," which is often associated with authoritarian states. These biases can influence the model’s predictions and may be more apparent with shorter texts or less structured data formats, such as tweets or informal statements. While the model performs best with longer texts, evaluation on any new format, both qualitative and quantitative, is highly recommended to ensure robust performance. Fine-tuning may be necessary to mitigate specific biases and enhance reliability across different applications.

If you use the model, please cite:

@article{mochtak_chasing_2024,
    title = {Chasing the authoritarian spectre: {Detecting} authoritarian discourse with large language models},
    issn = {1475-6765},
    shorttitle = {Chasing the authoritarian spectre},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/1475-6765.12740},
    doi = {10.1111/1475-6765.12740},
    journal = {European Journal of Political Research},
    author = {Mochtak, Michal},
    keywords = {authoritarian discourse, deep learning, detecting authoritarianism, model, political discourse},
}