Open Sourcing Roblox Sentinel: Our Approach to Preemptive Risk Detection

Using AI to Help Detect Abnormal Chat Patterns Early

  • Every day, more than 100 million users of all ages have a safe and positive experience on Roblox.
  • We strive to make our systems as safe as possible by default, especially for our youngest users. We do this with our extremely conservative policies and leverage AI to filter inappropriate messages in chat that we detect, including personally identifiable information (outside of Trusted Connections). We proactively moderate content and do not allow the sharing of real-world images in chat.
  • Of course, no system is perfect, and one of the biggest challenges in the industry is to detect critical harms like potential child endangerment. A series of friendly chats and supportive messages might take on a different meaning over the course of a longer conversational history, especially when it happens between users of different age groups.
  • We have developed Roblox Sentinel, an AI system built on contrastive learning that helps us detect early signals of potential child endangerment, such as grooming, allowing us to investigate even sooner and, when relevant, report to law enforcement.
  • In the first half of 2025, Sentinel helped our team to submit approximately 1,200 reports of potential attempts at child exploitation to the National Center for Missing and Exploited Children. This includes attempts to circumvent our filtering mechanisms and other safeguards.
  • We are excited to open source Roblox Sentinel, and we are actively seeking community engagement, which we hope will help build a safer internet.

Spending time with friends and competing with other players is a central component of Roblox, and communication is at the heart of those activities. In fact, every day, more than 111 million users come to Roblox, where the community sends an average of 6.1 billion chat messages and generates 1.1 million hours of voice communications in dozens of languages. This communication mirrors the real world—the vast majority is everyday chat, from casual conversations to discussing gameplay, but a small number of bad actors seek to circumvent our systems and possibly attempt to cause harm.

Last month, we shared our vision for age-based communication. We strive to make our systems as safe as possible by default, especially for our youngest users. For example, we do not allow user-to-user image or video sharing via chat. Our systems, while not perfect, are continuously improving and are designed to proactively block personally identifiable information—like phone numbers and usernames—and chat between non-age-verified users is strongly filtered (and not allowed for users under 13). Roblox is one of the largest platforms that require facial age estimation to chat more freely with the people you know. Our goal is to lead the world in safety for online gaming, and we’re committed to open sourcing key safety technology.

Today, we’re releasing our latest open-source model, Sentinel, an AI system to help detect interactions that could potentially lead to child endangerment. Long before something becomes explicit, Sentinel enables us to detect and investigate subtle patterns early, and when relevant, report to law enforcement.

Sentinel has been running on Roblox since late 2024 and is the latest addition to our open-source safety toolkit. In the first half of 2025, 35% of the cases we’ve detected are due to this proactive approach, in many cases catching them before an abuse report could be filed. When combined with our other moderation systems, Sentinel expands the arsenal of tools we have to detect and act on these potentially serious violations.

Understanding the Challenge

Child endangerment is a challenge across the industry, making new technologies and open collaboration incredibly valuable. Online grooming—the systematic building of trust and emotional connection with the ultimate goal of exploitation—is by nature a subtle and gradual process. These interactions are rare and often start as a series of friendly chats, supportive messages, and shared interests. Messages that initially appear innocuous can take on a different meaning over the course of a longer conversation history. Bad actors often use subtle, indirect, or coded language—purposely making patterns hard to detect, even for human reviewers. Therefore, our detection systems continually evolve to keep pace with new ways bad actors attempt to evade our systems. On top of this, training data for grooming is rare—making it difficult to train machine learning systems.

Proactive Impact and Operational Insights

Sentinel is currently running in production at scale. In the first half of 2025, its proactive capabilities have helped our team submit approximately 1,200 reports to the National Center for Missing and Exploited Children. While we will always have room to improve, Sentinel’s early detection capabilities are already helping us identify and investigate potential bad actors earlier in the process, when messages are still subtle and before they are surfaced by user-submitted abuse reports.

Human experts are essential for investigating and intervening in the cases Sentinel detects. Trained analysts, typically former CIA or FBI agents and other experts, review cases that Sentinel flags as potentially violative. The decisions made by these analysts create a feedback loop that enables us to continuously refine and update the examples, indexes, and training sets. This human-in-the-loop process is essential to help Sentinel adapt to and keep pace with new and evolving patterns and methods of bad actors working to evade our detection.

Sentinel is an important part of Roblox’s larger layered safety system, which blends innovative AI tools and thousands of human experts. As of today, it’s also part of our Roblox open-source safety toolkit. We believe fostering a safer digital world is a shared responsibility. By open sourcing safety systems like Sentinel, and sharing our approaches and becoming founding members of organizations such as the Robust Open Online Safety Tools (ROOST) and the Tech Coalition’s Lantern project, we hope to contribute to the collective advancement of online safety practices and the online communities that rely on them.

Our longer-term vision for Sentinel extends beyond conversation. The principles of using embeddings and contrastive measuring are highly adaptable. We’re actively exploring and developing capabilities to apply these techniques to a wider range of user interactions, moving toward multimodal understanding—across text, image, video, and more. By analyzing these signals together, we hope to work toward a more holistic, robust understanding of user behavior so we can better identify potential safety risks that single-modality systems could miss.

Inside the Tech: How Sentinel Powers Preemptive Detection

To help enable our moderation system to act swiftly, before intent to harm goes beyond intent, Sentinel needs to run the full analysis pipeline in near real time—at a massive scale, across 6 billion-plus chat messages every day. Sentinel continuously captures text chat in one-minute snapshots. Messages are automatically analyzed by ML, with the sole intent of identifying potential harms, such as grooming or child endangerment. Additionally, we aggregate this information over time, identifying concerning cases and patterns for human analysts to assess and investigate.

Unlike tools that rely on static rules and labeled examples, Sentinel uses self-supervised training to learn how to spot—and generalize—communication patterns as they occur. This allows Sentinel to identify new and evolving threats.

The team achieved this by developing two indexes. One is made up of communication from users who interact with safe, benign messaging—the positive index. The other is composed of communications that were removed because we determined they were child-endangerment policy violations—the negative index. This contrastive approach helps the system generalize and spot evolving threats even if they don’t precisely match the previously detected communication patterns from the index. One of Sentinel’s key advantages is that it does not require a large number of exemplars to function. This is particularly important in light of the low prevalence of negative exemplars. Our current production system operates with just 13,000 exemplars in the negative index, while still successfully identifying potential harm.

To build the positive index, we use a curated sample of chat history from users with no history of safety-related Community Standard violations and consistent, long-term positive engagement on Roblox. By using this curated sample of Roblox chat history, rather than generic text datasets, we were able to help Sentinel learn new slang and Roblox-specific language patterns and styles. This helps the system make more accurate comparisons, reducing false positives and allowing it to better differentiate between typical Roblox communication and violative communication.

The negative index is built from conversations reviewed by our human moderators, where we’ve found clear evidence of child endangerment policy violations (which we’ve already taken action on). When a user’s interactions show sustained, concerning activity, we label specific snippets of those conversations as examples of harmful communication. Those labeled segments are transformed into embedding vectors and added to the negative index. With this training, Sentinel learns to go beyond flagging certain words or phrases; it learns from the contextual patterns and progressions that real intent-to-harm conversations follow. Because of this, the system can recognize harmful communications that our other AI moderation systems may not, even when they appear subtle.

For example, simple messages such as “Hey, how are you?” would match the positive index because the language is benign. A message like “Where are you from?” would match the negative index because it matches patterns of potential grooming conversations. The system compares new messages with these indexes, and if it sees a user asking “Where are you from?” it may start gathering more information to see if the conversation continues down the negative path. While one message wouldn’t be enough to flag for human review, a continued pattern would be.

Contrastive Measuring

This contrastive measuring approach is inspired by SimCLR, a self-supervised learning framework that uses contrastive measuring to train image representation models without labeled data. We’ve adapted this technique to work with text and voice data, enabling Sentinel to understand what a user says and how it conforms with or diverges from known patterns. This works in three stages: interaction scoring, pattern tracking, and taking action.

Measuring Individual Interactions: Each message is converted into an embedding, or a vector that captures the action’s semantic and communication features. Sentinel compares this embedding against the positive and negative indexes. Using cosine similarity, the system then measures which index the interaction is closer to.

If the interaction is more aligned with the harmful patterns in the negative index, it receives a higher risk indicator. Messages that don’t meaningfully align with either safe or harmful communication patterns are filtered out, so the system can focus only on the interactions that carry a potential signal. This can help reduce false positives and improve the accuracy of measuring interactions over time.

Tracking Patterns With Skewness, Not Just Averages: Bad actors often mask their intent by mixing it in among harmless content. If we simply averaged a user’s measurement over time, the negative messages we want to detect could become lost in the noise. Instead, Sentinel looks at the distribution of measurements over time and measures statistical skewness—a way of detecting whether there are rare, high-risk messages pulling the risk profile upward.

This helps us detect early signs of escalation toward dangerous communication, even if most interactions seem benign. When we analyze skew, we also correct for volume. Highly active users might look riskier because their communication shows a larger absolute number of matches. By emphasizing statistical skew rather than overall volume, we can avoid false positives involving chatty but compliant users. With this, Sentinel is not just scalable, it is more precise, capable of processing vast communication flows to find the rare but critical signals that help us detect intent to harm.

Going From Signal to Action: As more interactions are measured, the system builds a dynamic risk profile. When a user’s pattern shows strong alignment with intent-to-harm communication, or a skew moving in that direction, Sentinel triggers a flag for deeper review and investigation.