Deploying ML for Voice Safety

Our mission is to connect a billion people with optimism and civility, which will require us to help people feel truly together with one another. For 3D immersive worlds, much like in the physical world, few things are more authentic or powerful than the human voice in forging lasting friendships and connections. But how do we scale the immersiveness and richness of voice communication on Roblox while keeping our community safe and civil?

In this blog, we’ll share how we brought to life Real-time Safety, an end-to-end machine learning (ML) model—operating at a scale of millions of minutes of voice activity per day—that detects policy violations in voice communication more accurately than human moderation. The outputs from this system are fed into another model, which determines the appropriate consequences. The consequence model triggers notifications for people who have violated our policies, initially with warnings and then with more drastic actions if the behavior persists.

This end-to-end Real-time Safety system was an audacious goal as we are one of the first in the industry to deliver multilingual, near real-time voice safety features to users. Voice classification depends on both audio style, including volume and tone, and content, including the words spoken. We are excited to share how we developed this system from essentially no prior automation—effectively zero labeled data and no models—going from zero to 60 for real-time voice safety.

And finally, we are excited to share our first open-source model, which is one of our voice safety models. In open sourcing this model and making it available for commercial use, we hope to provide an industry baseline for policy violation detection that can accelerate the development of newer ML models for voice safety. This open-source model is our first version, and we’ve since made significant improvements that we are currently testing.

Overcoming Data Scarcity

We began our ML efforts as many companies do—by assessing the quality of available data for training and evaluating our models. The ideal dataset pairing would include voice utterance along with a high-quality labeled safety categorization for that utterance. However, when we started, we had almost no large-scale human-labeled real-world data. To train a high-quality voice safety detection model using a supervised approach, we needed thousands of audio hours of labeled data for each language we supported, which would have taken years to gather and would’ve been prohibitively resource and time intensive.

Instead of relying on thousands of hours of hand-labeled data, we developed several more efficient methods:

  • Machine-labeled data for training. Instead of getting stuck on the pursuit of perfect hand-labeled data for training, we opted for a large volume of training data from machine labeling of voice utterances. Using large amounts of machine-labeled data with weak supervision generated training models that were robust to some noise in the labels. The keys to making this approach work were access to great open-source speech-to-text libraries and years of experience using ML to detect Community Standards violations in people’s textual communications. This machine labeling approach allowed us to label the volume of training data we needed for our models in weeks instead of years.

  • Human-labeled data for evaluation. Although high-quality yet imperfect machine-labeled data was good enough to train a highly performant model, we didn’t trust machine labels to perform the final validation of the resulting model. The next question, then, was where we could get enough human-labeled data for evaluation. Luckily, while it was impossible to gather enough human-labeled data for training in a timely way, it was possible to gather enough for evaluation of our model using our in-house moderators, who were already classifying abuse reports from people on Roblox to manually issue consequences. This allowed us to enjoy the best of both worlds: machine-labeled training data that was good and plentiful enough to produce a highly performant model, and human-labeled evaluation data that was much smaller in volume but more than enough to give us confidence that the model truly worked.

Another area where we faced data scarcity was in policy violation categories where we have very low prevalence, such as references to drugs and alcohol or self-harm. To address this issue, we combined several low-prevalence categories into an “other” category. As a result, our eventual model could identify the categories of profanity, bullying, discrimination, dating, and “other.” In order to understand these “other” categories, so we can better protect our community and ensure safe and civil discourse on Roblox, we will continue monitoring these for more examples. Over time, the subcategories in “other” will also become named categories as the number of training examples in those subcategories reaches a critical mass.

Machine Labeling Pipeline for Training Data

We designed a fully automatic machine labeling pipeline for extracting high-quality labels from voice chat sequences. Our pipeline consists of three stages:

  1. Audio chunk splitting. The first stage of the pipeline involves splitting the audio into chunks, or shorter segments, wherever we detect periods of silence between sentences. It allows us to identify and label policy-violating content more efficiently.

  2. Audio transcription. The second stage of the pipeline consists of transcribing these audio chunks into text using an automatic speech recognition (ASR) model. We use publicly available open source ASR models.

  3. Text classification. The final stage of the pipeline involves classifying the transcribed text using our in-house text filter. This filter is designed to detect and block inappropriate content in text-based communications. We adapted the filter to work with the transcribed audio data, allowing us to label the audio chunks with policy-violation classes and keywords. The text filter is an ensemble model trained on human-labeled policy-violating text data comprising an extended DistilBERT model and regular expression rules.

It’s important to note that this pipeline was used only for generating training data for our ultimate production model. You might wonder, however, why train a model at all if there’s already a pipeline here that generates the labels we are looking for? The answer is efficiency—we need to be incredibly accurate, in far less time. At Roblox scale, invoking the ASR to transcribe all voice communications would be prohibitively slow and resource intensive. However, a compact ML model trained from this data, specifically designed to detect policy violations in voice communications without doing a full transcription, is equally accurate, yet significantly faster and can be used at Roblox scale.

Scaling the Machine Labeling Pipeline

With most large AI initiatives, the mechanism to obtain quality training data is itself a production ML system, which needs to be created from scratch. For this project, we needed to develop our machine labeling pipeline as a first-class production system with 24/7 uptime and the ability to scale to thousands of concurrent CPU or equivalent numbers of GPU. We implemented a training data cluster with thousands of CPU cores that automatically process the incoming audio streams in parallel to generate machine labels. This system had to run flawlessly for maximal throughput, and any mistakes or downtime could result in days or weeks of lost time in training data generation.

Below is a high-level overview of the architecture that supported the scale we needed to machine label tens of thousands of audio hours in a matter of just weeks. The key takeaway here was that investing in queues at key points in our processing allowed us to remove bottlenecks by horizontally scaling worker threads across many machines. These worker threads performed the audio chunk splitting, audio transcription, and text classification steps mentioned in the previous section.

ML Architecture

A central requirement for our model search was low latency, i.e., near real-time speeds for model inference, which led us to architectures that directly operate on raw audio and return a score. We use Transformer-based architectures, which work very well for sequence summarization and are very successful in the industry for natural language processing (NLP) and audio modeling. Our challenge was to find a sweet spot that balances complexity with low-latency inference—i.e., handling multiple languages plus accents, robustness to background noise, and audio quality, while satisfying our product latency constraints.

Model Selection

An immediate design question was to determine the size of the context window needed to train the Transformer models. We looked at the histogram of length of utterances in voice chat data across several days of usage and determined that a 15-second window provided the trade-off between latency and sufficient context needed for classification. We use “no-violation” as a category to detect the absence of policy violations. Given that a single audio clip can embody multiple types of violations, the task inherently becomes multilabel rather than a conventional multiclass classification problem. We fine-tuned the entire network, including head layers for this task, with binary cross-entropy (BCE) loss.

Caption: Histogram of voice utterances from chat data, showing that 75 percent of utterances are less than 15 seconds.

We evaluated several popular open source encoder models from the audio research community and narrowed down our choices to WavLM and Whisper. Our first experiment was to fine-tune the pretrained WavLM base+ with 2,300 hours of Roblox machine-labeled voice data and evaluate the classification results on two real-world eval datasets. We obtained very encouraging classification results (see Model Evaluation, below), but found that the latency was larger than our thresholds for production deployment. As a follow-up, we implemented a custom version of the WavLM architecture with fewer Transformer layers and trained an end-to-end model from scratch on 7,000 hours of Roblox machine-labeled voice data. This model produces robust classifications in conversational settings and was more compact compared with the fine-tuned model. Our final model candidate used a student-teacher distillation setup, with a Whisper encoder as the teacher network and the WavLM end-to-end architecture as the student network. When we trained it on 4,000 hours of audio, we saw classification accuracies similar to the fine-tuned model but with a substantial improvement in latency and reduced model size. The image below summarizes the model parameters for the three experiments described above. We continue to iterate the data sampling strategies, evaluation strategies, and model hyperparameters as we extend the models for multilingual voice safety classification.



Dataset size

Model size

Inference latency/ second of input

Real-time factor

Fine-tuned WavLM

2300h

96M parameters

102 ms

9.80

End-to-end Trained

7071h

52M parameters

83 ms

12.08

Distilled

4080h

48M parameters

50 ms

19.95



Model Optimization

We employed standard industry methods, including quantizing selected Transformer layers to achieve a more than 25 percent speedup without compromising quality. Switching the feature extraction stage to MFCC inputs combined with convolutional neural networks (CNNs) instead of only CNNs also resulted in greater than 40 percent speedups during inference. Additionally, introducing a voice activity detection (VAD) model as a preprocessing step significantly increased the robustness of the overall pipeline, especially for users with noisy microphones. VAD allowed us to filter out noise and apply our safety pipeline only when we detect human speech in the audio, which reduced the overall volume of inference by approximately 10 percent and provided higher-quality inputs to our system.

Model Evaluation

Although we used many different datasets and metrics for evaluation, we can share how our voice classifier performed on an English-language dataset with high policy violation prevalence (such as what we would find in voice abuse reports from users). This dataset was 100 percent human labeled by our moderators. When we combined all violation types (profanity, bullying, dating, etc.) into a single binary category, we observed a PR-AUC (area under precision-recall curve) score of over 0.95, as shown below. This means that on this evaluation dataset, the classifier can typically catch a great majority of violations without falsely flagging too many non-violations.

The strong evaluation results above, however, do not necessarily translate directly across all use cases. For example, in the case of our notifications about policy-violating speech, the classifier is evaluating all Roblox voice chats and finding a lower prevalence of violations, and there is a greater chance of false-positive results. In the case of voice abuse reports, the classifier is evaluating only speech that has been flagged for potential violations, so the prevalence is higher. Still, the results above were encouraging enough for us to initiate experiments with the classifier in production (at conservative thresholds) to notify users about their policy-violating language. The results of these experiments greatly exceeded our expectations.

What’s Next?

By leveraging our own CPU infrastructure and carefully designing the pipeline for large scale, we were able to successfully deploy this model at Roblox scale. During peak hours, the model is successfully serving over 2,000 requests per second (the majority of which contain no violations). We have also observed a significant reduction in policy-violating behavior on the platform due to the use of the model for notifying people about policy-violating language. In particular, from our initial rollout, we are seeing a 15.3 percent reduction in severe-level voice abuse reports and an 11.4 percent decrease in violations per minute of speech.

We are extending our models with multilingual training data, which allows us to deploy a single classification model across the platform to handle several languages as well as language mixing. We are also exploring new multitask architectures for identifying select keywords in addition to the classification objective without resorting to full ASR. The detection of these keywords in addition to violation labels improves the quality of the classification and provides an opportunity to give people context while issuing consequences.

The research described here was a joint effort across many teams at Roblox. This was a great display of our core value of respecting the community and a great collaboration across multiple disciplines.