Detoxifying the Commons

Community Article Published October 31, 2024

We’re releasing the Toxic Commons Collection , a toolkit and pipeline for removing harmful content from training data, especially multilingual, historical, and OCR data. This release includes:

Large pre-training corpora often contain a lot of harmful content, including stereotypes about certain groups of people or offensive language. Reducing harmful content in these corpora is non-trivial and an active research area. But the problems compound when we consider using public domain data. Earlier this year, we released a collection of cultural heritage texts as part of the largest public domain pre-training corpus. These datasets are composed of text from monographs and periodicals in several languages, which are primarily older texts, mostly from the 18th and 19th centuries.

We found that existing toxicity classifiers work very poorly on our data because our cultural heritage data are:

  • multilingual: data in Common Corpus primarily covers English, French, Dutch, Spanish, German and Italian, in addition to small amounts of data from other languages.
  • historical: the type of language that was acceptable has changed drastically over the last few centuries. We need to remove types of biased and harmful language that was more common in past centuries, which most classifiers would never have been trained to detect.
  • OCR data: the vast majority of the texts in Common Corpus come from digitized texts. Depending on when and how they were digitized, some of the texts contain significant amounts of OCR errors. This makes it even harder for existing toxicity classifiers to work well on this data.

Our classifier, Celadon, is a modified DeBERTa-v3-small model (~140M parameters), which we trained from scratch on 600k annotated samples from Toxic Commons. Annotations were generated with Llama 3.1 8B. See the preprint for details. Celadon identifies five kinds of harmful content:

  • Race and origin-based bias: includes racism as well as bias against someone’s country or region of origin or immigration status, especially immigrant or refugee status.
  • Gender and sexuality-based bias: includes sexism and misogyny, homophobia, transphobia, and sexual harassment.
  • Religious bias: any bias or stereotype based on someone’s religion.
  • Ability bias: bias according to someone’s physical, mental, or intellectual ability or disability.
  • Violence and abuse: overly graphic descriptions of violence, threats of violence, or calls or incitement of violence.

We used Celadon to identify toxic content in our pre-training data and then either generate a content warning for the text or synthetically rewrite it to be not harmful, depending on the toxicity level. The reason we trained a classifier to do this, as opposed to using Llama 3.1 8B, is to reduce cost. We found that it takes 3.4 hours to generate 100k samples with Llama, where it takes 5 minutes with Celadon. Celadon is over 40x faster than LLM annotations.

According to the toxicity score of a text across the five dimensions, we can deal with the texts in a few ways. The most obvious treatment is to remove toxic content, especially that with very high toxicity scores. This process will undoubtedly lead to less toxic model behaviors, but it may also have unintended consequences. Filtering out harmful content has been shown to reduce texts by or relating to marginalized groups such as LGBTQ+ people or minoritized dialects like African American English (Dodge et al., 2021; Zhou et al., 2021). This is also a problem, as public domain data is already scarce. Removing too much data may result in a dataset that is too small for training a language model. Instead, we propose two treatments for content that is labeled as toxic:

  • Synthetic Content Warnings: based on the dimensions along which a text was classified as toxic, we use an LLM like Llama 3.1 to generate a specific content warning to appear before the text during training. This is an ideal approach for mildly toxic content. The intention is to develop a model that is able to reason about harmful content.
  • Synthetic Rewriting: for content that is extremely toxic, we use an LLM to rewrite the content in a way that is not toxic at all. This helps us retain a similar dataset size, while still removing the most harmful portions of the dataset. We note that in our datasets, only a very small portion of the data (~1%) got this treatment.

For more details on what samples we would consider “toxic,” refer to the “Pre-Training Data Curation” section of our paper.

In future work we hope to provide further empirical evidence to demonstrate the efficacy of this procedure for reducing harmful model behaviors.

We release ToxicCommons as part of open science practices, but it is useful in its own right as an object of study. It can be used to study changes in harmful and toxic language over time, by genre (book or newspaper), and across nine languages). For NLP practitioners, Toxic Commons could also be a tool for developing safe and helpful model responses to toxic content. Apart from the specific tools we release here, we also demonstrate a pipeline that can be adapted for any dataset or application. We hope to show here that there are efficient and effective ways to reduce harmful behaviors in language models through dataset curation as a complement to existing procedures.

Code for creating the annotations and training the classifier are available on Github: https://github.com/Pleias/toxic-commons. Further details can be found in the paper.