This document describes five technologies for automated text moderation, each roughly correspond to an historical phase.

As a working example we will use the detection of “toxic” comments. In practice many different definitions of “toxic” have been used in the industry, and there are a variety of related concepts, e.g. “hate speech” and “offensive”.

(1) Keywords

The simplest technology is to hard-code a list of words which are considered “toxic”, e.g. a list of curse words. This can be implemented with regular expression. This has obvious limits on the accuracy and cannot be easily maintained, however many platforms still maintain a keyword block list for some sensitive terms.

(2) Simple classifier (“Bag of words”)

We can collect a large set of human-labeled data on whether individual messages are toxic, and then predict toxicity from the appearance of individual words e.g. using logistic regression or naive Bayes. These classifiers will find that certain words are highly predictive of toxicity. Simple classifiers often have reasonable accuracy but will have many important false positives and false negatives, and they are easy to evade by rewording or misspelling text.

1961: Maron (1961) proposes the Naive Bayes classifier

(3) Embedding-based classifier (2013-2018)

These models have two stages:

Pretrain: for each word calculate an embedding (a vector of numbers) which predicts its likelihood of co-occurring with other words. Pairs of words which are nearby in embedding-space typically have similar meanings.
Train: train a model to predict toxicity of a comment using the embedding of the words in a message (e.g. the average embedding).

An advantage over simple classifiers is that these models require much less labeled data for an equal performance, because the pre-training stage has already learned (crudely) the meanings of different words. Thus these models can identify words that are diagnostic of toxicity even if they never appeared in the toxicity training set.

However embedding-based classifiers are still bad at edge cases, e.g. when a word is used inside a negation (“is an idiot” vs “is not an idiot”), or if a word is mis-spelt, or if harmless words are used to express an meaning that is toxic (“your brain is a bowl of jello”).

2013: Word2Vec: a word embedding using a 2-layer neural network, (Mikolov et al. (2013))
2014: GloVe: Global Vectors for Word Representation. They say “training is performed on aggregated global word-word co-occurrence statistics from a corpus” (Pennington, Socher, and Manning (2014)).
2015: fastText: word embedding from FAIR. They released pre-trained models for 294 languages (Joulin et al. (2016))
2017: Jigsaw Perspective Toxicity API v1 from Google.¹

¹ I couldn’t find any authoritative documentation on the architecture of this classifier: I found one reference to it using the GloVe embeddings.

(4) LLM-based classifiers (2018-2023)

These models have three stages:

Embedding: Compute embedding of each token (a token is roughly equal to a word).
Pretrain: Train a deep neural net to predict a token from surrounding tokens (or prior tokens), using attention (i.e. don’t weight all words equally) on an enormous training set of text from books and the internet.
Train: Train a model to predict toxicity from labeled data using the top-level neurons in the net as features.

Conceptually these are similar to embeddings but (1) they can represent the meaning of entire sentences instead of just words, (2) have more layers so tend to have more sophisticated representations of meaning.

2017: Transformer architecture (Vaswani et al. (2017))
2018: BERT transformer LLM, this model has been widely used as base model for a variety of natural language tasks, including content moderation (Devlin et al. (2018))

(5) Zero-shot LLMs (2023-)

These models have three stages:

Embedding: Compute the embedding of each token.
Pretrain: Train a deep net to predict the next token from previous tokens, as above.
Directly ask the model whether a given message violates a given policy, e.g. “is the following sentence toxic? ___”

Notably this method does not use any human-labeled data, it only needs to be told what type of text it is looking for. This is referred to as “zero shot”, meaning it needs zero training data. These models can also use “few shot” learning, where a small number of examples are given instead of the thousands of examples that had ordinarily been used.

This has big benefits: it allows you to very quickly refine policy, and the LLM can generate explanations for why it made a decision.

2020: GPT-3: reasonable zero-shot performance (Brown et al. (2020))
2022: ChatGPT published: very good zero-shot performance on many tasks.
2023: OpenAI provides GPT-4-based content moderation tools (Weng, Goel, and Vallone (2023))
2023: Startups providing LLM-based content moderation: SafetyKit, CheckStep, Hive, Cove.
2023: Stanford CoPE:an open-source LLM for moderation.

Discussion

Q: now that we can use LLMs for arbitrary labeling, will we change policies?

Proposals are coming out of Michael Bernstein’s lab, e.g. Jia et al. (2023), in using LLMs to substantially change how content is ranked.
Dave Wilner has argued that because LLMs offer much greater flexibility then platforms will find it easier to write more complex policies and update them more frequently.

Q: what do we know about degree of accuracy across languages?

AI typically has a strong anglophone bias. Performance in non-English languages tends to be proportional to the distance from English, e.g. European languages tend to be worse. However many also noted that there is typically a large anglophone bias in human moderation.
Some literature shows that LLMs have good performance in languages with relatively little training data, e.g. Armengol-Estapé, Gibert Bonet, and Melero (2021).

Q: will censorship change when using LLMs instead of humans?

Jeff noted that an advantage of human censors over machine censors is that humans might exercise their judgment to refuse to censor while machines will not.

References

Armengol-Estapé, Jordi, Ona de Gibert Bonet, and Maite Melero. 2021. “On the Multilingual Capabilities of Very Large-Scale English Language Models.” https://arxiv.org/abs/2108.13349.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” https://arxiv.org/abs/2005.14165.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.

Jia, Chenyan, Michelle S Lam, Minh Chau Mai, Jeff Hancock, and Michael S Bernstein. 2023. “Embedding Democratic Values into Social Media AIs via Societal Objective Functions.” arXiv Preprint arXiv:2307.13912.

Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” https://arxiv.org/abs/1607.01759.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Empirical Methods in Natural Language Processing (EMNLP), 1532–43. http://www.aclweb.org/anthology/D14-1162.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.

Weng, Lilian, Vik Goel, and Andrea Vallone. 2023. “Using GPT-4 for Content Moderation.” https://openai.com/blog/using-gpt-4-for-content-moderation.