Harmful content detection system

_{scroll ↓ to Resources}

Clarifying Requirements
Frame the Problem as an ML Task
Data preparation
Model development
Evaluation
Serving
Other talking points
Resources

Clarifying Requirements

Objective

We define our ML objective as accurately predicting harmful posts. The reason is that if we can accurately detect harmful posts, we can remove or demote them, leading to a safer platform.

Requirements

Harmful content: Posts that contain violence, nudity, self-harm, hate speech, etc.
Bad actors: Fake accounts, spam, phishing, other unsafe behaviors.

Misinformation is controversial and out of scope for this system design.

Content can be text, image, video or any combination of those.
The system should work with any language, but we can assume having a multi language model for that.
- Comments. author info, reactions
Around 500 million posts per day, of which around 10k have labels.
Users can report harmful content.
If the post is removed, we provide users an explanation or link to the rules.
Some threats must be real-time, such as violent content, Other threads can be handled via batch processing.

Frame the Problem as an ML Task

Inputs and Outputs

Choice: Fusing methods

To make accurate predictions, the system should consider all modalities. There are two fusing methods to combine heterogeneous data: late fusion and early fusion.

Late fusion

Process different input modalities independently using different models and then combine the predictions of each model for the final prediction.

Advantage:
- Each model can be trained, evaluated and improved independently.
Disadvantage:
- Separate training data for each modality is needed.
- Late fusion fails to detect cases where, each of the modalities contains unharmful content, unless combined. Think of memes with images and accompanying text.

Early fusion

The modalities are combined first and then the model makes a prediction.

Advantages:
- No need to collect separate data sets for each modality.
Disadvantages:
- More challenging to learn complex relationships between modalities.

The early fusion method is preferred, because we need to capture posts that may be harmful overall. And besides this, with such an immense amount of data, the model should be able to learn the patterns pretty well.

Choice: ML approach

Single binary classifier

After fusing all the features, supply it to a single model which will output a single number.
As the model doesn’t discriminate categories of harm, this violates one of the requirements.

Binary classifier per category

One model will detect violence, another nudity, the third one - hate, and so on. This option will comply with the requirement but it is time consuming and expensive to train so many models.

Multi-label classifier

This will work as single model will output probabilities of each harmful category, but it can be a suboptimal solution, because Fused features, pre-processed in the same way, are used for different category classifications. Allowing separate preprocessing pipelines may yield better results.

Multi-task classifier

Multi-task classification strives to create a model which will learn multiple tasks simultaneously, allowing it to learn similarities between tasks and avoid unnecessary computations when a certain input transformation is beneficial for multiple tasks. Therefore, it requires two layers - shared layers and task-specific layers.

Shared layers

A set of hidden layers that transform input features into new ones. These newly transformed features are used to make predictions for each of the harmful classes.

Task-specific layers

These are independent classification heads transforming features in a way that is optimal for predicting a specific harm category.

Multi-task classifier is chosen for this problem.

High-level ML system

All post modalities are fused together to create features. Engineered features Undergo transformation in shared layers and emerge as embeddings. Each of the classification heads in task specific layers looks at those transformed features and makes their own prediction. After that we have a probability value for each of the charm category. We can apply some sort of heuristic to produce a single number out of multiple probabilities.

Data preparation

Available data

Users

ID	Username	Age	Gender	City	Country	Email

Posts

Post ID	Author ID	On-device	Timestamp	Textual content	Images or videos	Links
1	1	73.93.220.240	1658469431	Today, I am starting my diet.	http: //cdn.mysite.com/u1.jpg	-
2	11	89.42.110.250	1658471428	The video amazed me! Please donate	http: //cdn.mysite.com/t3.mp4	gofundme.com/f/3u1njd32
3	4	39.55.180.020	1658489233	What is a good restaurant in the Bay area?	http: //cdn.mysite.com/t5.jpg	-

User-post interactions

User ID	Post ID	Interaction type	Interaction value	Timestamp
11	6	Impression	-	1658450539
4	20	Like	-	1658451341
11	7	Comment	This is disgusting	1658451365
4	20	Share	-	1658435948
11	7	Report	violence	1658451849

Feature Engineering

Each post contains several content types each with their own preprocessing and feature extraction steps

Textual content
Image or video
User reactions to the post
Author
Contextual information

Textual content

Pre-processing
- normalization
- tokenization
Vectorization: Creates a meaningful feature vector using either statistical (Bag-of-Words, TF-IDF) or ML-based methods.
- Statistical methods do not encode the semantics of a text, therefore, we must use a ML-based method.
- Language model such as BERT is a large model, hence producing embedding takes time, which is not available during online prediction. In addition, BERT was trained on English-only data, so it will not produce good results for other languages.
- DistilBERT is optimized in such a way that if two sentences have the same meaning, but are in two different languages, their embeddings still is very similar.

Image or video

Pre-processing
- decode, resize, normalize
- Feature extraction
  - Use a pre-trained model to convert unstructured data into a feature vector. Viable options are CLIP visual encoder, SimCLR. For videos - VideoMoCo.

User reactions to the post

Feature extraction
- The number of likes, shares, comments, and reports
  - Similar to textual content, obtain embeddings for each comment and average them.
  - Scale numeric values for better convergence

Author

Feature extraction
- Violation history
- Demographics
- Account information such as account age, number of followers and followings.
- Contextual information, such as time of the day and the device.

Joint feature engineering pipeline

Model development

Model selection

Neural network. Architecture is not detailed.

Model training

The model training happens in a standard way with forward propagation leading to prediction and measuring a loss function. then backward propagation to optimize the model parameters and reduce the loss function in the next iteration.

Constructing the dataset

data annotation
- manual labeling
  - ==slow and expensive, but good for Evaluation data to prioritize accuracy==
- natural labeling: Rely on users or reports to label posts automatically
  - fast and suitable for training data assuming enormous dataset sizes
    This way, the best model will pick-up such model weights from naturally labeled data. which will result in the smallest loss on manually labeled data

Choosing the loss function

In multitask training, each task is assigned a loss function based on its machine learning category. In this system, all heads are binary classifiers, so they all can use a standard classification loss such as cross-entropy. So loss values are computed for each task and the overall loss is the result of combining task specific losses.
A common challenge in training multi-modal systems is overfitting because one modality can dominate the learning process of all others.
- gradient blending
- focal loss

Hyperparameter tuning

Optuna, grid search

Evaluation

Offline metrics

PR-curve

Shows the trade-off between precision and recall of the model.

ROC-curve

Shows the trade-offs between the true positive rate (recall) and the false positive rate.

Online metrics

Prevalence

the ratio of harmful posts which we didn’t prevent and all posts on the platform
it treats harmful posts equally, but in practice, one harmful post with 100K views or impressions is more harmful than two posts with 10 views each. > use Harmful impressions==

Harmful impressions

Valid appeals

Percentage of posts that were deemed harmful, but appealed and reversed.

Proactive rate

Percentage of harmful posts found and deleted by the system before users report it.

User reports for harmful category

Serving

After a new post appears, the service predicts the probability of harm.
If the harm is detected with high confidence, Violation enforcement service removes the post right away. It also notifies the user why the post was removed.
If the harm is detected with low confidence, the demoting service temporarily demotes a post and decreases its chance to spread among users. The demoted post is Scheduled for manual review by humans. The review team manually reviews the post and labels it. Labeled posts are used for future training and model improvement.

Other talking points

Handle biases introduced by human labeling
Adapt the system to detect trending harmful classes (e.g., Covid-19, elections)
How to build a harmful content detection system that leverages temporal information such as users’ sequence of actions
How to effectively select post samples for human review
- Building and Scaling Human Review Systems
How to detect authentic and fake accounts
How to deal with borderline contents, i.e., types of content that are not prohibited by guidelines, but come close to the red lines drawn by those policies.
How to make the harmful content detection system efficient, so we can deploy it on-device
How to substitute Transformer-based architectures with linear Transformers to create a more efficient system
- How Facebook uses super-efficient AI models to detect hate speech

Resources

Links to this File

table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"

Notitia Restante 🌱

On this site

Harmful content detection system

Contents

Clarifying Requirements

Objective

Requirements

Frame the Problem as an ML Task

Inputs and Outputs

Choice: Fusing methods

Late fusion

Early fusion

Choice: ML approach

Single binary classifier

Binary classifier per category

Multi-label classifier

Multi-task classifier

Shared layers

Task-specific layers

High-level ML system

Data preparation

Available data

Users

Posts

User-post interactions

Feature Engineering

Textual content

Image or video

User reactions to the post

Author

Joint feature engineering pipeline

Model development

Model selection

Model training

Constructing the dataset

Choosing the loss function

Hyperparameter tuning

Evaluation

Offline metrics

PR-curve

ROC-curve

Online metrics

Prevalence

Harmful impressions

Valid appeals

Proactive rate

User reports for harmful category

Serving

Other talking points

Resources

Links to this File

Graph View

On this page

Backlinks