Running AI Inference at Scale in the Hybrid Cloud
-
The areas where Roblox uses AI, and generative AI in particular, have grown rapidly over the past few years.
-
We are in the final stretch of a three-phase process to build and optimize the infrastructure necessary to support this level of AI tooling.
-
We are sharing the steps we’ve taken to build a hybrid cloud infrastructure capable of supporting ML inference at a massive scale.
At last week’s RDC, we announced our latest AI incubation project: to develop a multimodal 3D foundational model to power generative creation. Powering AI for an always-on, immersive 3D global platform used by millions of people requires a massive amount of computational power. In early 2023, we supported fewer than 50 machine learning (ML) inference pipelines. Today, our infrastructure supports approximately 250 of these pipelines. We maintain tens of thousands of CPUs and more than a thousand GPUs across two data centers and our hybrid cloud infrastructure to run all of these models. And we’re not done yet.
We’ve previously shared how we’re thinking about generative AI for our creators, how we use AI to keep people on Roblox safe, and how AI translations help people around the world communicate. But those are only a few examples: With approximately 250 models in production, virtually every interaction on Roblox has some form of AI powering it. When a person first comes to Roblox and is looking at which experience to join, AI is at work through our recommendation and search systems. And when that person chooses an experience and hits the play button, our matchmaking algorithm identifies the best server to join.
Millions of creators already have access to the power of our generative AI tools. With Assistant, they can use simple prompts to generate scripts and actions to help accelerate experience creation. With our Texture and Material Generator tools, they can quickly change and iterate on the look and style of objects. And we are now entering the era of 4D generative AI with the recent launch of Avatar Auto Setup, which simplifies the process of creating an avatar, saving creators hours of work. As of August 2024, approximately 8% of the UGC avatar bodies published on Roblox were produced using Avatar Auto Setup.
We are now entering the final stretch of a three-phase process that has been in motion for several years. This journey began in late 2021. At that time, the lack of a unified Roblox AI platform led engineering teams to construct their own mini platforms and select disparate frameworks. We saw teams developing critical components, including our avatar Marketplace, homepage, and search, each building their own custom feature engineering. Rather than leveraging a centralized feature store, teams were piecing together ad hoc solutions. Furthermore, each team was burdened with developing its own optimizations and tackling inference scaling challenges independently, without the support of a core platform. This fragmented approach highlighted the urgent need for a cohesive, centralized platform to streamline our processes and enhance efficiency across the board.
Phase One: Building a Strong Foundation for ML
We adopted Kubeflow early on to take advantage of its packaging of core building blocks for ML, including notebooks, pipelines, offline experimentation, and model serving. A feature store was still necessary, so we adopted a third-party solution to start. To make ML more accessible for engineers at Roblox, we developed roblox-ml
, a Python library that further reduced the complexities of getting a model deployed to production.
We used Jupyter notebooks to provide a development environment optimized for model iteration, with servers configured for necessary data access and GPU resources. Scaling a training job or running it regularly to retrain a model generally required us to write a pipeline. Our roblox-ml
library enabled engineers to easily convert notebook code into Kubeflow pipelines by snapshotting the runtime environment and source code without needing to build Docker images, and by selecting compute resources with priorities, setting up notifications, and handling authentication.
Models are only effective if they have the right features at the right time. Our feature store simplified the process for defining new features, while promoting the sharing of more than 900 features across over 100 feature services. This allowed teams to create and deploy new models more quickly as our collection of features grew.
Once our ML pipelines platform was functional and stable, we saw an increased demand for online inference support—with personalization, search, and Marketplace leading the charge. While we recommend batch inference as a waypoint to mature ML operations, we developed our model registry and serving platform to support real-time inference. With our model registry, Roblox engineers can use roblox-ml
to upload and download models, which are tagged and automatically versioned to facilitate traceability, rollbacks, and A/B testing. As an example, our personalization models are trained and deployed daily, and we are often running approximately 20 A/B tests in parallel. For our serving platform, we used KServe with Triton Inference Server as the underlying model serving runtime because of its strong performance, as well as its support for multiple ML frameworks using both GPUs and CPUs.
Whether operating in batch or online, models at Roblox go through extensive testing before release. This includes offline experiments, shadow testing, and A/B testing. After being released, models are continually monitored to ensure that they are performing as expected both operationally (e.g., inference latency) and in terms of accuracy. As part of our commitment to safety and civility, human moderators also evaluate any reported disagreements in inferences, which helps ensure that we get critical decisions correct and helps improve the training dataset for our models.
Phase Two: Preparing to Scale Inference
In early 2023, we saw enormous potential for generative AI to accelerate creation on Roblox. To take full advantage of that potential, we spent much of 2023 optimizing the performance and efficiency of our ML training and inference infrastructure. Thanks to these optimizations, we have significantly reduced the compute cost of CLIP embedding creation. First, we expanded our distributed training systems to enable training on large datasets and running models with billions of parameters across multiple worker nodes.
As we began to build out a distributed workflow, we realized that our existing setup for offline inference would not support the rate of growth we were seeing over the long term. Our initial setup was designed for real-time inference, in which the input and output data are sequential. While it worked well for our early efforts, it did not easily support task parallelism or multistage processing, nor was it resource-efficient enough to support the scale we now required. In addition, engineers were required to write their own data chunking and error-handling logic, which became increasingly time-consuming as our inference needs scaled.
To address these challenges, we added support for Ray, an open-source compute framework that makes it easy to scale batch inference workloads. By building out a Ray-based distributed task pipeline for batch inference, we were able to optimize resource utilization, enable multistage processing, and provide robust task parallelism and greater fault tolerance. In addition, the Ray Data library allows engineers to define a pipeline with streaming execution in just a few lines, which helps improve developer velocity and efficiency. We’ve seen tremendous efficiency gains so far using Ray for batch inference.
As our inference needs continued to grow, we moved all of our CPU inference to our own data centers, which gave us more direct control over latency and privacy settings. We process approximately 1 billion personalization requests daily for our 79.5 million daily active users (as of June 30, 2024). Moving this workload to our own data centers has helped us maintain our efficiency without compromising user experience. To save inference costs, many systems cache requests—this would have led to outdated recommendations since many users visit the Roblox homepage multiple times a day. This also improved our efficiency, enabling us to better optimize where inference is run and to distribute workloads to reduce the compute resources required.
As we continued to scale, we realized the need for a custom feature store solution that could support high throughput, low latency, and cost-efficiency, while also enabling rapid iterations for various services. Existing third-party solutions did not meet these requirements, so we developed our own custom feature store, built on top of the open-source project Feast. Our feature store provided a custom domain-specific language for defining transformations for both batch and streaming features. Flink was adopted as the stream processing engine for enabling real-time features, which were critical for models that needed to incorporate the freshest information possible. On the opposite end of the spectrum were features that needed to be derived from processing a vast number of 3D assets in batch by rerunning the Roblox game engine in a distributed environment. Our feature store now ingests approximately 30 billion records and serves approximately 70 billion records per day with a P99 latency of 50ms—and it supports more than 100 feature services.
The use of embeddings by models has also grown rapidly, driven by the growing demand for semantic understanding, whether through NLP, computer vision, or recommendation systems. This motivated us to build out a vector database to efficiently store and retrieve vectors as high-dimensional points. The vector database has enabled fast nearest neighbor lookups to power capabilities such as multimodal search and content violation detection.
As more teams began utilizing ML models, we wanted to find efficiencies of scale and help engineers succeed more quickly, so we established our own ground truth team. This team helps engineers design their own dataset production pipeline, train and validate data using human evaluators, and deliver high-quality data. This has helped us standardize the process of building a data pipeline and validating datasets, as well as the format in which data is delivered, tracked, and monitored.
Phase Three: Operationalizing Massive Inference
With the launch of Roblox Assistant, we’ve seen the number of tokens processed increase to 1.5 billion per week. We also released new features, including real-time AI chat translation and our voice safety model (now open sourced), that significantly increased the demand for inference capacity. We embarked on two core projects to supercharge AI application development: our ML gateway, and a large language model operations (LLMOps) platform based around the vLLM project. Together, these two projects will be foundational for the next generation of ML at Roblox.
We built our unified ML gateway to centralize access to all large models, both open source and internally developed, across a variety of environments, including CPUs and GPUs in the cloud and on premises. Our goal was to create an efficient, streamlined system for managing AI resources across the company. On the back end, the gateway provides a common API interface, user-friendly configuration options, and efficient resource sharing between all the models we have deployed.
The gateway improved the resilience of our inference services by providing centralized throttling by token count for generative AI workloads and latency-aware load balancing between regions. Additionally, it enhances security by centralizing API key management, allows for comprehensive usage tracking and potential implementation of entitlements, and integrates with monitoring tools for improved observability. All of these features will help us optimize usage of large models, reduce costs, and provide valuable insights for engineers across Roblox.
In addition, we have adopted vLLM as our primary inference engine for LLMs, leveraging vLLM’s high-performance capabilities to power AI applications across Roblox. Since moving to vLLM, we’ve seen an almost 2x improvement in both latency and throughput, and we currently serve approximately 4 billion tokens per week.
Our choice of vLLM aligns with our commitment to leveraging open-source and cutting-edge technologies that can scale efficiently to meet the demands of our vast user base and diverse array of experiences. Roblox is an active contributor to the open-source vLLM project, spearheading the development of multimodal support for vLLM, which enables the engine to handle not just text but also images and potentially other types of data in the future. We have also implemented speculative decoding techniques to further improve inference performance, allowing for faster, more efficient processing of language tasks.
With ML gateway and vLLM, we can efficiently support the hundreds of ML pipelines in use across Roblox—and continue to scale inference as the demand for AI-powered features continues to grow. And we are nowhere near done with this work. We have big plans for the future of AI at Roblox. We’re working on new AI-powered tools to make creation more efficient for both novice and expert creators. As always, we are working on ways to improve the performance and efficiency of our infrastructure to better support the AI tools that we, and our creators, use on a daily basis.
Our Commitment to Open Source
We have come this far on the shoulders of several successful open-source projects. Much of our technical stack is built using the open-source technology mentioned above.
We are committed to being a strong partner in the open-source AI community and contributing some of our own open-source technology. We recently announced our first open-source model—our voice safety classifier — and we are currently working on our ML gateway, with the hope of making that open source as well. We believe that the future of AI should include openness and transparency, and we’re excited to be an active member of this community.