State-of-the-Art LLM Helps Safeguard Unlimited Text Generation on Roblox

RoGuard 1.0: Advancing Safety With Robust Guardrails

  • Today, we’re announcing RoGuard 1.0, an open-source safety toolkit for developers and platforms.
  • The first RoGuard capability, a state-of-the-art (SOTA) guardrail model for LLM safety, is now available, setting a new standard across leading safety benchmarks.

  • We’re also releasing RoGuard-Eval, a dataset for safety benchmarking.

The Challenge

We recently released a Text Generation API that enables developers to harness the power of large language models (LLMs) to build richer, more immersive experiences by generating text within their experiences. For example, a developer could create a fully interactive NPC, or provide an interactive tutorial on how to play the game.

We’ve proactively moderated most content on Roblox from the early days, as we work to keep our products in line with Roblox’s high safety and civility standards. Before we released the Text Generation API, we looked at how to build in safety first. We developed a new model to help safeguard both the inputs (prompts from users) and outputs (generated text from the API).

The Innovation

The first capability in the RoGuard 1.0 toolkit is a SOTA instruction fine-tuned LLM, designed to help safeguard our Text Generation API. It performs safety classification at both the prompt and response levels, deciding whether or not each input or output violates our policies. This dual-level assessment is essential for moderating both user queries and the model’s own generated outputs.

Our LLM is currently outperforming popular LLM guardrail models such as Llama Guard from Meta, ShieldGemma from Google AI, NVIDIA NeMo Guardrails, GPT-4o from OpenAI, and others on standard benchmarks. The RoGuard 1.0 LLM also shows strong generalization on out-of-domain datasets with unseen taxonomy. We’ve open-sourced both the LLM weights for our first capability and our RoGuard-Eval benchmarking dataset.

At the heart of our system is an LLM that’s been fine-tuned from the Llama-3.1-8B-Instruct model. We trained this LLM with a particular focus on high-quality instruction tuning to optimize for safety judgment performance. A crucial step in this process was carefully curating prompts and responses to reflect a diverse range of real-world safety scenarios.

Our instruction set uses no proprietary data—only a combination of synthetic (LLM-generated) and open-source data, which allows us to more easily scale training data and leverage scaling laws—making this first RoGuard LLM SOTA. While merging various open-source and synthetic data sets, we found that using dataset-specific taxonomy was the best approach for curating instruction, because task diversity helps the LLMs train on different types of prompts. This resulted in a robust model that can be generalized for different safety taxonomies. We also incorporated chain-of-thought rationales, in which the model is encouraged to articulate its reasoning process, into the instruction set. These intermediate reasoning steps gave the model stronger contextual grounding.

The Results

Our safety team developed a custom high-quality evaluation dataset across Roblox’s content safety taxonomy—representing 25 subcategories. This evaluation set is created by internal red-teaming, in which we test the system by simulating adversarial attacks to look for vulnerabilities, and doesn’t contain user-generated or personal data. This evaluation dataset contains prompt and response pairs with the responses hand-labeled by a set of policy experts to help ensure their quality. It spans a wide spectrum of violation types, helping us create more precise and meaningful labels for evaluation. The final evaluation set includes 2,873 examples. We’ve open-sourced this evaluation dataset, which features an extensible safety taxonomy to help benchmark LLM guardrails and moderation systems.

We benchmark our models on a comprehensive set of open-source datasets for both prompt and response, as well as on RoGuard-Eval. This allows us to evaluate our model on both in-domain and out-of-domain datasets. We report our results in terms of F-1 score for binary violating/non-violating classification. In the table above, we compare our performance with that of several well-known models. This first RoGuard capability outperforms other models while generalizing on out-of-domain datasets.

We are continually improving our safety systems, including our RoGuard 1.0 tools and plan to release additional capabilities in the near future. Please watch our pages on HuggingFace and GitHub for future updates and improvements, as well as future open-source releases.