Launching More Languages for Our Open-Source Voice Safety Model

ByNaren Koneru, Vice President, Engineering, and Janne Pylkkonen

PublishedApr 2, 2025

We’re updating our open-source voice safety classifier by increasing its parameters from 94.6 million to 120.2 million and expanding it to seven additional languages.
Since the first version of the classifier, we’ve increased accuracy to a recall of 59.1% on English-language voice chat data at a 1% false-positive rate. This is a 92% improvement over the previous release’s 30.9% recall.

Promoting safety and civility has always been foundational to everything we do at Roblox. We’ve spent nearly two decades building strong safety systems, and we’re continually growing and evolving them as new technology becomes available. In 2024, we shipped more than 40 safety improvements, including a revamp to our Parental Controls, which we’re updating again today. We also launched one of the industry’s first-ever open-source voice safety classifiers, which has been downloaded more than 23,000 times. Today, we’re releasing an updated version, which is even more accurate and works in more languages.

Many of the safety systems that help protect our users, including this classifier, are powered by AI models. We open-source some of these because we know that sharing AI safety advances benefits our entire industry. That’s also why we recently joined ROOST—a new nonprofit dedicated to tackling important areas in digital safety by promoting open-source safety tools—as a founding partner.

When managing the volume of content and interactions that occur on our platform every day around the world, AI is an essential element to keeping users safe. We’re confident that the models we’ve built are helping support our needs. In the fourth quarter of 2024, for example, Roblox users uploaded 300 billion pieces of content. Just 0.01% of those billions of videos, audios, texts, voice chats, avatars, and 3D experiences were detected as violating our policies. And nearly all of that policy-violating content was automatically prescreened and removed before users ever saw it.

We’ve updated the open-source version of our voice safety classifier to make it more accurate and to help us moderate content across more languages. The new model:

Detects violations in seven additional languages—Spanish, German, French, Portuguese, Italian, Korean, and Japanese—thanks to training on multilingual data.
Has an increased overall recall rate of 59.1%, a 92% improvement over 30.9% recall in the previous release, with low false-positive rates.
Is optimized to serve at scale, serving up to 8,300 requests (the majority of which contain no violations) per second at peak.

Since the release of the first model, we’ve seen a reduction in abuse report rates among U.S. users of more than 50% per hour of speech. It’s also helped us moderate millions of minutes of voice chat per day more accurately than human moderators. We never stop advancing our safety systems and we’ll continue to update the open-source version as well.

Efficient Multilingual Voice Safety Classifier

Our initial open-source voice safety classifier was based on a WavLM base+ model, fine-tuned with machine-labeled English-language voice chat audio samples. The encouraging results of this end-to-end architecture led to further experiments with a customized architecture. We used knowledge distillation to optimize the model’s complexity and accuracy, which is appealing for large-scale inference serving. Our new classifier uses these fundamental building blocks, and scales up and extends the work in data usage and architecture refinements.

By training on multilingual data, our single classifier model can seamlessly operate on any of our top eight supported languages. And our training improvements mean that the model is both more accurate and 20% to 30% faster to run in a typical inference scenario than the first version.

The new voice safety classifier is still based on the WavLM architecture, but the layer configuration deviates from the previous release and those of the WavLM pretrained models. In particular, we added an additional convolutional layer to reduce the internal time resolution of the transformer layers. In total, our new model architecture has 120.2 million parameters, an increase of 27% compared with 94.6 million in the previous version. Despite this increase, the new model consumes 20% to 30% less compute time when used with 4- to 15-second input segments. This is possible because the model compresses the input signal into a shorter representation than before.

Utilizing a Variety of Labeling Strategies

Supervised training of an end-to-end model requires curated pairs of audio and class labels. We made significant improvements to our data pipeline that ensured a steady stream of labeled data. The foundation of the training material is a large machine-labeled dataset of more than 100,000 hours of speech comprising the supported languages. We automatically transcribed the speech and ran it through our in-house text-based toxicity classifier, which shares the desired policy and toxicity categories. The data collection samples abusive content with a higher probability than benign speech to better capture edge cases and less common policy violations.

Labels based on speech transcripts and text-based classification can’t fully capture the nuances observed in voice chat content. So we utilized human-labeled data to fine-tune the model from the previous training stage. While the classification task is the same, the latter training stage helps refine the decision boundaries and emphasize the responsiveness to expressions specific to voice chat. This is a form of curriculum learning that helps us maximally benefit from the valuable human-labeled examples.

One challenge with end-to-end model training is that the target labels can become obsolete if the labeling policy changes over time. So as we refine our acceptable voice policy, we need special handling for data that uses older labeling standards. For this, we utilized a multitask approach that allows the model to learn from datasets that don’t match the current voice chat policy. This involves dedicating a separate classification head for the old policy, allowing the model trunk to learn from the old dataset without affecting targeted labels or the primary head.

A Calibrated Model for Easier Deployment

Using the classification model requires deciding on the operating point and matching the classifier sensitivity according to the task requirements. To facilitate easier model deployment, we calibrated the model outputs, tuned for voice chat moderation. We estimated piecewise-linear transformations from a held-out dataset, doing so separately for each output head and supported language. These transformations were applied during the model distillation, which ensured that the final model was natively calibrated. This eliminated the need for post-processing during inference.

We are excited to share this new open-source model with the community and look forward to sharing future updates as we have them.