Open Sourcing Roblox PII Classifier: Our Approach to AI PII Detection in Chat
Using Context to Improve Recall
Every day, users generate an average of 6.1 billion chat messages on Roblox. We use robust moderation systems, set age-based restrictions, and provide parental controls to help keep communication safe and civil. The vast majority of messages on the platform are everyday conversations, like two friends discussing gameplay strategy, but in a small percentage of messages, users attempt to share personally identifiable information (PII) that could be sensitive. PII takes many forms, and users share it for many innocuous reasons: A user might share their username from another platform to coordinate gameplay or a phone number to build a budding friendship. However, there are rare cases where bad actors seek PII to lure users away from Roblox to other platforms where there could be a higher risk of real-world harm. In practice, these differences in intent are difficult to discern, which is why we have strict policies against sharing or seeking PII. We use multiple tools to block all detected PII in chat by default, and we loosen restrictions only for users who are 18 or older and users 13 to 17 who have verified each other as Trusted Connections.
PII detection is an industry-wide technical challenge. Industry standard detection tools can be bypassed and lack the ability to adapt to emerging language patterns. While no tool is perfect, we developed an AI model, Roblox PII Classifier, to account for the evolving nature of language and use context to detect situations where users are trying to bypass filters to ask for or share PII.
We’re excited to announce that today we’re open sourcing PII Classifier alongside the other tools in our open-source safety toolkit. Since implementing PII Classifier in late 2024, we’ve seen rapid and continuing improvement in recall, with performance surpassing other available models. The version of our PII model that we're open sourcing today has a 98% recall of potential PII conversations in English text1 on Roblox. The model has also achieved an F1 score of 94% on our production data, outperforming other state-of-the-art safety models, like LlamaGuard v3 8B (28%) and Piiranha NER (14%).
The Challenges
Detecting PII effectively at scale boils down to three main challenges:
-
Adversarial patterns: Users are creative and continuously find new ways to bypass filters. An effective system must adapt as language evolves and new patterns emerge.
-
Training and evaluation: In order to build the most effective model, we must also create effective training datasets and measurement methods. Since the model must account for emerging patterns, current production data isn’t sufficient for training.
-
Performance: Serving such a model at scale requires thoughtful architecture and optimization decisions to prevent negative impact to the user experience.
Adversarial Patterns
Existing PII detection solutions mainly rely on named-entity recognition (NER), token-level detection of certain types of nouns, like social media handles, phone numbers, and addresses. But detection of nouns is only part of the challenge. Savvy bad actors intentionally alter their language to bypass NER detection (e.g., by using alpha, bravo, and Charlie to represent A, B, and C or referencing a platform without explicitly naming it). It’s possible for a bad actor to signal their intention to connect on another platform without ever sharing the sensitive information a NER filter would catch. The task for PII Classifier is not just to detect and obfuscate explicit PII text shared on Roblox, but also to understand the context of communication and stop bad actors from engaging in PII-related conversations in the first place.
Here are some representative bypassing patterns using a hypothetical social platform, StarTalk:
Character-level manipulation
- "do u have like 5tärtālk u wanna call? i made an acc like xouple days ao"
- "ggrr i hate it tags What's your name That's S And T"
Implicit references to popular social media
- "again whats ur rats ppa Reverse"
- "hey you mind chck my name on yellow sun app. let's chat there?"
Language and slang terms evolve over time, and bad actors are continually searching for new ways to evade filters. PII Classifier’s strength lies in its ability to adapt to new language patterns and workarounds as they emerge. When we detect real-world adversarial patterns, we incorporate them back into the model to help train it on an ongoing basis.
Training and Evaluation
To train the model initially, we manually reviewed and labeled PII-related data. That gave us a starting point, but it wouldn’t allow us to quickly scale and capture an expansive variety of scenarios. Rather than trying to manually comb through every term and permutation found in billions of chat messages per day and apply the appropriate label, we built and tested data samplers to select relevant samples for training. Our goal was to exclude innocuous conversations and focus on conversations that contained PII-related data to reduce the possibility of human labeling errors and cover more ground. Two samplers have proved to be most effective:
- Uncertainty sampling using model score outputs: This sampler selected samples that didn’t evoke a strong positive or negative signal, allowing us to further refine ambiguous cases.
- Samples from consecutive PII blocks: This sampler selected samples from users that had been flagged in some conversations but not in consecutive conversations. These followup conversations were more likely to contain atypical language that had bypassed the current PII filter. In practice, this could look like a user failing to bypass the system and trying again until they found a clever loophole.
This combination of data sampling and human labeling on current production data provided a strong foundation for training the model, but since our goal was to account for emerging patterns, we needed a way to train on data that didn’t yet exist in our samples.
AI-generated synthetic data
Relying solely on current sampled data could introduce biases and limit the model’s ability to adapt as new communication patterns evolve. For example, the most common PII requests on Roblox are for popular social media platform handles. A model trained only on production data could develop a bias toward the most common requests and underperform on rarer ones, such as lesser-known social media platforms, email addresses, and phone numbers. User communication also tends to converge on popular vocabulary and language patterns. A model trained only on production data could become biased toward common language patterns and fail to identify violations expressed in atypical or emergent ways.
To eliminate these and other biases, we designed an AI data-generation pipeline that targets any weaknesses inherited from the initial training dataset. First, we generated prompts using a combination of variables, including context, PII type, user persona, language, and example chat lines. Then, we generated new chat lines based on these prompts and fed them into the model.
Human and AI red teaming
We employed both human and AI red teaming (where teams simulate adversarial attacks to test a system’s defenses) during development to test the model’s effectiveness and refine training. We invited moderators to experiment with different methods of asking for and sharing PII and prompted LLMs to augment these methods in various ways, then added any samples the model missed to its training dataset. AI red teaming helped us quickly test many variations and cover methods that moderators may not have covered. For example:
Original: the password is xxxx
AI augmented: THE PAAS WURD IS xxxx
Original: Bella my phone number is 346
AI augmented: Bella my numb3r is actually threefour6
Red teaming helped us better understand gaps in our current training data and adapt our synthetic data to close them. It also allowed us to measure differences between model iterations, which becomes increasingly difficult as two versions of a model start to saturate the evaluation set. We served multiple versions of the model under the red-teaming tool to directly compare bypass rates in the same environment and determine which model was statistically more effective.
Performance
With an average of 6.1 billion chat messages exchanged per day, PII Classifier receives a peak of over 200,000 queries per second on Roblox. We handle this volume with <100ms P90 latency. To balance serving and quality, we chose encode-only architecture and fine-tuned our model from XLM-RoBERTa-Large2. We separate tokenizer and pre- and post-processing services to run efficiently on the CPU and serve the pure transformer architecture on the GPU to lower costs. We also use dynamic batching on Triton servers to increase throughput.
Benchmarking on Public and Internal Datasets
We benchmarked PII Classifier against other state-of-the-art models using our own production data and public datasets. Most public PII datasets focus on the PII text itself rather than on the surrounding text that could signal intention, so nothing perfectly aligned with our platform requirements for benchmarking. We nevertheless wanted to see how our model stacked up to current detection solutions using popular PII datasets, like The Learning Agency Lab's PII Data Detection Dataset3 on Kaggle.
We used F1 scores because LLMs in the comparison only provide one (recall, precision) pair. For models that output classification scores, we reported the optimal F1 score on the test set. Note that our model requires a snippet of user chat lines as input and outputs a PII score, which we use to make a binary decision on the chat lines. For a fair comparison, we split the public dataset by sentence and labeled each sentence positive if it contained any positive NER PII tokens.
| PII V1.1 | LlamaGuard-v3 1B | LlamaGuard-v3 8B | LlamaGuard-v4 12B | NemoGuard 8B | Piiranha NER | |
| Kaggle PII dataset | 45.48% | 5.90% | 5.46% | 3.72% | 3.26% | 33.20% |
| Roblox Eval English | 94.34% | 3.17% | 27.73% | 26.55% | 26.29% | 13.88% |
In our benchmarks, our model dramatically outperformed other open source models on both The Learning Agency Lab's public dataset and our internal production data, which includes more than 47,000 diverse, real-world samples on Roblox. The focus on incorporating broader conversational context and continually adapting to the fluid nature of language has proved to be an effective approach to detecting more conversations where a user intends to ask for or share PII.
PII Classifier is just one of the many innovative systems we use to promote safety and civility on Roblox. The ability to detect when a conversation veers toward a request for PII means we can capture cryptic requests that may otherwise bypass detection. While no system is perfect, the results from our first year in production are already promising, and we’re excited to share the tool with the open-source community alongside the other tools in our open-source safety toolkit.
- The 98% recall is measured on a Roblox internal test set at 1% FPR. The dataset is collected from production data and is multireviewed and labeled by safety experts.
- Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Holmes, L., Crossley, S. A., Sikka, H., and Morris, W. 2023. PIILO: An open-source system for personally identifiable information labeling and obfuscation. Information and Learning Science, 124 (9/10), 266-284.