Data-Centric AI for Online Safety Research

Online Safety Research by The Alan Turing Institute’s public policy programme

We are aiming to create a step-change in the use of socially responsible AI for ensuring online safety.


Our research on data-centric AI for online safety sets out to achieve three objectives.


To build high performing, robust, fair
and explainable tools for detecting content that could inflict harm on people online.

Inspection  free icon


To understand the limitations, biases and failings of technologies for detecting content that could inflict harm on people online.

Database free icon


To produce datasets and resources that can be used by the research community to
understand and increase online safety.


We regularly publish our work in a range of academic journals. Click below to read the abstract.


Annotating abusive language is expensive, logistically complex and creates a risk of psychological harm. However, most machine learning research has prioritized maximizing effectiveness (i.e., F1 or accuracy score) rather than data efficiency (i.e., minimizing the amount of data that is annotated). In this paper, we use simulated experiments over two datasets at varying percentages of abuse to demonstrate that transformers-based active learning is a promising approach to substantially raise efficiency whilst still maintaining high effectiveness, especially when abusive content is a smaller percentage of the dataset. This approach requires a fraction of labeled data to reach performance equivalent to training over the full dataset.

Radical right actors routinely use social media to spread highly divisive, disruptive, and anti-democratic messages. Assessing and countering such content is crucial for ensuring that online spaces can be open, accessible, and constructive. However, previous work has paid little attention to understanding factors associated with radical right content that goes viral. We investigate this issue with a new dataset (the ‘ROT’ dataset) which provides insight into the content, engagement, and followership of a set of 35 radical right actors who are active in the UK. ROT contains over 50,000 original entries and over 40 million retweets, quotes, replies and mentions, as well as detailed information about followership. We use a multilevel model to assess engagement with tweets and show the importance of both actor- and content-level factors, including the number of followers each actor has, the toxicity of their content, the presence of media and explicit requests for retweets. We argue that it is crucial to account for role of actors in radical right viral tweets, and therefore, moderation efforts should be taken not only on a post-to-post level but also on an account level.

Labelled data is the foundation of most natural language processing tasks. However, labelling data is difficult and there often are diverse valid beliefs about what the correct data labels should be. So far, dataset creators have acknowledged annotator subjectivity, but rarely actively managed it in the annotation process. This has led to partly-subjective datasets that fail to serve a clear downstream use. To address this issue, we propose two contrasting paradigms for data annotation. The descriptive paradigm encourages annotator subjectivity, whereas the prescriptive paradigm discourages it. Descriptive annotation allows for the surveying and modelling of different beliefs, whereas prescriptive annotation enables the training of models that consistently apply one belief. We discuss benefits and challenges in implementing both paradigms, and argue that dataset creators should explicitly aim for one or the other to facilitate the intended use of their dataset. Lastly, we conduct an annotation experiment using hate speech data that illustrates the contrast between the two paradigms.

Textual data can pose a risk of serious harm. These harms can be categorised along three axes: (1) the harm type (e.g. misinformation, hate speech or racial stereotypes) (2) whether it is elicited as a feature of the research design from directly studying harmful content (e.g. training a hate speech classifier or auditing unfiltered large-scale datasets) versus spuriously invoked from working on unrelated problems (e.g. language generation or part of speech tagging) but with datasets that nonetheless contain harmful content, and (3) who it affects, from the humans (mis)represented in the data to those handling or labelling the data to readers and reviewers of publications produced from the data. It is an unsolved problem in NLP as to how textual harms should be handled, presented, and discussed; but, stopping work on content which poses a risk of harm is untenable. Accordingly, we provide practical advice and introduce HARMCHECK, a resource for reflecting on research into textual harms. We hope our work encourages ethical, responsible, and respectful research in the NLP community.

Online hate is a growing concern on many social media platforms, making them unwelcoming and unsafe. To combat this, technology companies are increasingly developing techniques to automatically identify and sanction hateful users. However, accurate detection of such users remains a challenge due to the contextual nature of speech, whose meaning depends on the social setting in which it is used. This contextual nature of speech has also led to minoritized users, especially African–Americans, to be unfairly detected as ‘hateful’ by the very algorithms designed to protect them. To resolve this problem of inaccurate and unfair hate detection, research has focused on developing machine learning (ML) systems that better understand textual context. Incorporating social networks of hateful users has not received as much attention, despite social science research suggesting it provides rich contextual information. We present a system for more accurately and fairly detecting hateful users by incorporating social network information through geometric deep learning. Geometric deep learning is a ML technique that dynamically learns information-rich network representations. We make two main contributions: first, we demonstrate that adding network information with geometric deep learning produces a more accurate classifier compared with other techniques that either exclude network information entirely or incorporate it through manual feature engineering. Our best performing model achieves an AUC score of 90.8% on a previously released hateful user dataset. Second, we show that such information also leads to fairer outcomes: using the ‘predictive equality’ fairness criteria, we compare the false positive rates of our geometric learning algorithm to other ML techniques and find that our best-performing classifier has no false positives among a subset of African–American users. A neural network without network information has the largest number of false positives at 26, while a neural network incorporating manual network features has 13 false positives among African–American users. The system we present highlights the importance of effectively incorporating social network features in automated hateful user detection, raising new opportunities to improve how online hate is tackled.


Accurate detection and classification of online hate is a difficult task. Implicit hate is particularly challenging as such content tends to have unusual syntax, polysemic words, and fewer markers of prejudice (e.g., slurs). This problem is heightened with multimodal content, such as memes (combinations of text and images), as they are often harder to decipher than unimodal content (e.g., text alone). This paper evaluates the role of semantic and multimodal context for detecting implicit and explicit hate. We show that both text- and visual- enrichment improves model performance, with the multimodal model (0.771) outperforming other models’ F1 scores (0.544, 0.737, and 0.754). While the unimodal-text context-aware (transformer) model was the most accurate on the subtask of implicit hate detection, the multimodal model outperformed it overall because of a lower propensity towards false positives. We find that all models perform better on content with full annotator agreement and that multimodal models are best at classifying the content where annotators disagree. To conduct these investigations, we undertook highquality annotation of a sample of 5,000 multimodal entries. Tweets were annotated for primary category, modality, and strategy. We make this corpus, along with the codebook, code, and final model, freely available. Full article

Online misogyny is a pernicious social problem that risks making online platforms toxic and unwelcoming to women. We present a new hierarchical taxonomy for online misogyny, as well as an expert labelled dataset to enable automatic classification of misogynistic content. The dataset consists of 6567 labels for Reddit posts and comments. As previous research has found untrained crowdsourced annotators struggle with identifying misogyny, we hired and trained annotators and provided them with robust annotation guidelines. We report baseline classification performance on the binary classification task, achieving accuracy of 0.93 and F1 of 0.43. The codebook and datasets are made freely available for future researchers. Full article

Detecting online hate is a complex task, and low-performing detection models have harmful consequences when used for sensitive applications such as content moderation. Emojibased hate is a key emerging challenge for online hate detection. We present HATEMOJICHECK, a test suite of 3,930 short-form statements that allows us to evaluate how detection models perform on hateful language expressed with emoji. Using the test suite, we expose weaknesses in existing hate detection models. To address these weaknesses, we create the HATEMOJITRAIN dataset using an innovative human-and-model-in-the-loop approach. Models trained on these 5,912 adversarial examples perform substantially better at detecting emoji-based hate, while retaining strong performance on text-only hate. Both HATEMOJICHECK and HATEMOJITRAIN are made publicly available. Full article

We present the results and main findings of the shared task at WOAH 5 on hateful memes detection. The task include two subtasks relating to distinct challenges in the fine-grained detection of hateful memes: (1) the protected category attacked by the meme and (2) the attack type. 3 teams submitted system description papers. This shared task builds on the hateful memes detection task created by Facebook AI Research in 2020. Full article

Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HATECHECK, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HATECHECK’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses. Full article

Online abuse can inflict harm on users and communities, making online spaces unsafe and toxic. Progress in automatically detecting and classifying abusive content is often held back by the lack of high quality and detailed datasets. We introduce a new dataset of primarily English Reddit entries which addresses several limitations of prior work. It (1) contains six conceptually distinct primary categories as well as secondary categories, (2) has labels annotated in the context of the conversation thread, (3) contains rationales and (4) uses an expert-driven group-adjudication process for high quality annotations. We report several baseline models to benchmark the work of future researchers. The annotated dataset, annotation guidelines, models and code are freely available. Full article


The Russian state-funded international broadcaster RT (formerly Russia Today) has attracted much attention as a purveyor of Russian propaganda. To date, most studies of RT have focused on its broadcast, website, and social media content, with little research on its audiences. Through a data-driven application of network science and other computational methods, we address this gap to provide insight into the demographics and interests of RT’s Twitter followers, as well as how they engage with RT. Building upon recent studies of Russian state-sponsored media, we report three main results. First, we find that most of RT’s Twitter followers only very rarely engage with its content and tend to be exposed to RT’s content alongside other mainstream news channels. This indicates that RT is not a central part of their online news media environment. Second, using probabilistic computational methods, we show that followers of RT are slightly more likely to be older and male than average Twitter users, and they are far more likely to be bots. Third, we identify thirty-five distinct audience segments, which vary in terms of their nationality, languages, and interests. This audience segmentation reveals the considerable heterogeneity of RT’s Twitter followers. Accordingly, we conclude that generalizations about RT’s audience based on analyses of RT’s media content, or on vocal minorities among its wider audiences, are unhelpful and limit our understanding of RT and its appeal to international audiences. Full article

Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets. Full article

Far-right actors are often purveyors of Islamophobic hate speech online, using social media to spread divisive and prejudiced messages which can stir up intergroup tensions and conflict. Hateful content can inflict harm on targeted victims, create a sense of fear amongst communities and stir up intergroup tensions and conflict. Accordingly, there is a pressing need to better understand at a granular level how Islamophobia manifests online and who produces it. We investigate the dynamics of Islamophobia amongst followers of a prominent UK far right political party on Twitter, the British National Party. Analysing a new data set of five million tweets, collected over a period of one year, using a machine learning classifier and latent Markov modelling, we identify seven types of Islamophobic far right actors, capturing qualitative, quantitative and temporal differences in their behaviour. Notably, we show that a small number of users are responsible for most of the Islamophobia that we observe. We then discuss the policy implications of this typology in the context of social media regulation. Full article

During COVID-19 concerns have heightened about the spread of aggressive and hateful language online, especially hostility directed against East Asia and East Asian people. We report on a new dataset and the creation of a machine learning classifier that categorizes social media posts from Twitter into four classes: Hostility against East Asia, Criticism of East Asia, Meta-discussions of East Asian prejudice, and a neutral class. The classifier achieves a macro-F1 score of 0.83. We then conduct an in-depth ground-up error analysis and show that the model struggles with edge cases and ambiguous content. We provide the 20,000 tweet training dataset (annotated by experienced analysts), which also contains several secondary categories and additional flags. We also provide the 40,000 original annotations (before adjudication), the full codebook, annotations for COVID-19 relevance and East Asian relevance and stance for 1,000 hashtags, and the final model. Full article

We investigate the use of machine learning classifiers for detecting online abuse in empirical research. We show that uncalibrated classifiers (i.e. where the ‘raw’ scores are used) align poorly with human evaluations. This limits their use for understanding the dynamics, patterns and prevalence of online abuse. We examine two widely used classifiers (created by Perspective and Davidson et al.) on a dataset of tweets directed against candidates in the UK’s 2017 general election. A Bayesian approach is presented to recalibrate the raw scores from the classifiers, using probabilistic programming and newly annotated data. We argue that interpretability evaluation and recalibration is integral to the application of abusive content classifiers. Full article


Online abusive content detection is an inherently difficult task. It has received considerable attention from academia, particularly within the computational linguistics community, and performance appears to have improved as the field has matured. However, considerable challenges and unaddressed frontiers remain, spanning technical, social and ethical dimensions. These issues constrain the performance, efficiency and generalizability of abusive content detection systems. In this article we delineate and clarify the main challenges and frontiers in the field, critically evaluate their implications and discuss potential solutions. We also highlight ways in which social scientific insights can advance research. We discuss the lack of support given to researchers working with abusive content and provide guidelines for ethical research. Full article

Islamophobic hate speech on social media is a growing concern in contemporary Western politics and society. It can inflict considerable harm on any victims who are targeted, create a sense of fear and exclusion amongst their communities, toxify public discourse and motivate other forms of extremist and hateful behavior. Accordingly, there is a pressing need for automated tools to detect and classify Islamophobic hate speech robustly and at scale, thereby enabling quantitative analyses of large textual datasets, such as those collected from social media. Previous research has mostly approached the automated detection of hate speech as a binary task. However, the varied nature of Islamophobia means that this is often inappropriate for both theoretically informed social science and effective monitoring of social media platforms. Drawing on in-depth conceptual work we build an automated software tool which distinguishes between non-Islamophobic, weak Islamophobic and strong Islamophobic content. Accuracy is 77.6% and balanced accuracy is 83%. Our tool enables future quantitative research into the drivers, spread, prevalence and effects of Islamophobic hate speech on social media. Full article