Enabling Online Safety
through rigorous academic research


As part of The Alan Turing Institute’s public policy programme we provide objective, evidence-driven insight into the technical, social, empirical and ethical aspects of online safety, supporting the work of policymakers and regulators, informing civic discourse and extending academic knowledge.
We are working to tackle online hate, harassment, extremism and mis/disinformation.

To achieve our goal of providing evidence-driven insight into all aspects of
online safety we have three core workstreams


Data-Centric AI

Building cutting-edge tools and critically examining technologies to create a step-change in the use of AI for online safety.

Online Harms Observatory

Mapping the scope, prevalence, impact and motivations behind content and activity that could inflict harm on people online.

Policymaking for Online Safety

Working to understand the challenges faced to ensure online safety, and supporting the creation of ethical and innovative solutions.

Where we are and where we are headed


This year we are producing academic research, creating dashboards, curating open source resources, writing policy reports and much more! Curious to find out how we are helping to solve online safety, or would you like to get involved? Watch our video introducing our ongoing projects and reach out to us at onlinesafety@turing.ac.uk.

Online Safety Research


Our most recent publications are listed below. Click to read the abstract, or see our full list of publications here.

Labelled data is the foundation of most natural language processing tasks. However, labelling data is difficult and there often are diverse valid beliefs about what the correct data labels should be. So far, dataset creators have acknowledged annotator subjectivity, but rarely actively managed it in the annotation process. This has led to partly-subjective datasets that fail to serve a clear downstream use. To address this issue, we propose two contrasting paradigms for data annotation. The descriptive paradigm encourages annotator subjectivity, whereas the prescriptive paradigm discourages it. Descriptive annotation allows for the surveying and modelling of different beliefs, whereas prescriptive annotation enables the training of models that consistently apply one belief. We discuss benefits and challenges in implementing both paradigms, and argue that dataset creators should explicitly aim for one or the other to facilitate the intended use of their dataset. Lastly, we conduct an annotation experiment using hate speech data that illustrates the contrast between the two paradigms.

Textual data can pose a risk of serious harm. These harms can be categorised along three axes: (1) the harm type (e.g. misinformation, hate speech or racial stereotypes) (2) whether it is elicited as a feature of the research design from directly studying harmful content (e.g. training a hate speech classifier or auditing unfiltered large-scale datasets) versus spuriously invoked from working on unrelated problems (e.g. language generation or part of speech tagging) but with datasets that nonetheless contain harmful content, and (3) who it affects, from the humans (mis)represented in the data to those handling or labelling the data to readers and reviewers of publications produced from the data. It is an unsolved problem in NLP as to how textual harms should be handled, presented, and discussed; but, stopping work on content which poses a risk of harm is untenable. Accordingly, we provide practical advice and introduce HARMCHECK, a resource for reflecting on research into textual harms. We hope our work encourages ethical, responsible, and respectful research in the NLP community.

Online hate is a growing concern on many social media platforms, making them unwelcoming and unsafe. To combat this, technology companies are increasingly developing techniques to automatically identify and sanction hateful users. However, accurate detection of such users remains a challenge due to the contextual nature of speech, whose meaning depends on the social setting in which it is used. This contextual nature of speech has also led to minoritized users, especially African–Americans, to be unfairly detected as ‘hateful’ by the very algorithms designed to protect them. To resolve this problem of inaccurate and unfair hate detection, research has focused on developing machine learning (ML) systems that better understand textual context. Incorporating social networks of hateful users has not received as much attention, despite social science research suggesting it provides rich contextual information. We present a system for more accurately and fairly detecting hateful users by incorporating social network information through geometric deep learning. Geometric deep learning is a ML technique that dynamically learns information-rich network representations. We make two main contributions: first, we demonstrate that adding network information with geometric deep learning produces a more accurate classifier compared with other techniques that either exclude network information entirely or incorporate it through manual feature engineering. Our best performing model achieves an AUC score of 90.8% on a previously released hateful user dataset. Second, we show that such information also leads to fairer outcomes: using the ‘predictive equality’ fairness criteria, we compare the false positive rates of our geometric learning algorithm to other ML techniques and find that our best-performing classifier has no false positives among a subset of African–American users. A neural network without network information has the largest number of false positives at 26, while a neural network incorporating manual network features has 13 false positives among African–American users. The system we present highlights the importance of effectively incorporating social network features in automated hateful user detection, raising new opportunities to improve how online hate is tackled.


Online Safety Resources


We aim to maximise the accessibility of our work. Below is a collection of tools and resources can be used to monitor, understand and counter online hate.

Online Hate Research Hub

This ongoing project collates and organises resources for research and policymaking on online hate. These resources aim to cover all aspects of research, policymaking, the law and civil society activism to monitor, understand and counter online hate. Resources are focused on the UK, but include international work as well.

Catalogue of datasets annotated for Hate Speech

We have catalogued 50+ datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language. The catalogue includes datasets in 15 languages, including Arabic, Danish, English, French, Hindu-English, Indonesian and Turkish.

Online Harms Observatory

The Online Harms Observatory is a new platform which will provide real-time insight into the scope, prevalence and dynamics of harmful online content. The observatory will help policymakers, regulators, security services and other stakeholders better understand the landscape of online harms.