Roblox’s Path to 2 Trillion Analytics Events a Day
Building a Scalable Analytics Ingestion Infrastructure
Every day, an average of 97.8 million users* visit Roblox to communicate, create, and play together. Together, these interactions generate 2 petabytes of analytics events data. Thanks to a new scalable ingestion system, we recently reached a major milestone: Our system is now processing more than 2 trillion events per day. This system enables the personalization, safety, and economy algorithms that power the Roblox platform.
Previously, a cloud queue service ingested Roblox-generated analytics data into a single logical table, called events_hourly. It was partitioned by date, hour, and arbitrarily defined tags such as web, mobile, or friendService. Our data scientists and engineers relied on batch jobs scheduled to extract specific events into dedicated tables. Creating and sending new analytic events required no upfront schema. Engineers controlled their own table schema downstream at the extract, transform, and load (ETL) pipeline stage.
This setup was flexible and enabled engineers to move quickly, but it presented challenges.
-
As the volume of events grew, interacting with 2 trillion rows partitioned only by date, hour, and tags became increasingly inefficient.
-
A six-hour end-of-day delay for the events_hourly table and 24-hour delay for events_daily created periods in which data pipelines were blocked.
-
Managing dataset-level permissions, tier, retention, and alert became more complex.
-
Event documentation, history, and ownership were missing, resulting in poor data usability and traceability.
-
The ingestion infrastructure, built with the cloud queue service, incurred 23 Gbps cloud ingestion cost.
We recognized an opportunity to support Roblox’s continued growth and modernize the analytics ingestion pipeline. The event ingestion pipeline is a large system spanning multiple teams. It supports the Roblox app and other microservices, producing analytics events, which backend services collect and transform into data lake tables. Given our large surface area and available resources, we focused on the biggest pain point: eliminating an inefficient batch process and controlling the computational cost of serving analytic events.
Eliminating Expensive Event Extraction
Data analytics previously relied on extracting data from a single logical table via many batch pipelines. This was necessary to run large, performant queries—but it also slowed down processing. Using the ingestion backend service to route these events to dedicated tables eliminates the batch extraction pipelines by giving analytic events a schema and defining a destination table beforehand.
We chose Protobuf (proto) as the schema language for analytics events at Roblox. This was a natural choice, since proto and gRPC are our preferred building services frameworks. Additionally, proto offers great support for defining custom options that we leverage to collect additional metadata, such as ownership, retention, productivity software channels, and event schema.
After choosing our schema language, we examined what happens when a schema is updated and which updates should be allowed. To support the largest number of downstream consumers using the published schema, the data team adopted the backward transitive mode described at Schema Registry. With this approach, adding and softly deleting a field is allowed. This enables schema changes without requiring coordination with downstream consumers.
In the example above, we can add and delete a field by updating the proto file.
Schemas offer many benefits, but requiring them up front adds friction. Data scientists and engineers need to move quickly and iterate without obstacles. To support this, we introduced a centralized schemas repository and built a suite of tools to make authoring schemas as automated and streamlined as possible.
For example, we built a custom proto linter to validate that each schema has the required metadata and conforms to Roblox conventions. We also built a proto plug-in to translate an event schema to Hive data definition language so that the corresponding Hive table stays in sync wherever a schema is created or updated. All these tools are integrated into a CI/CD pipeline and run automatically when a pull request is created. This allows engineers to catch schema issues early and verify events in test Hive tables before their schemas are merged. As a result, deploying a schema to production is as simple as merging.
With a streamlined developer experience in place, we examined where in the ingestion pipeline an event should be schematized and converted to proto. Asking event producers to adopt and send serialized proto bytes would be a significant change spanning multiple teams. To address pain points and deliver value incrementally, we decoupled the schematization effort from the event producers by updating the ingestion backend service to convert incoming events to proto. Now, converted events are collected into parquet files, uploaded to distributed storage, and registered as individual Hive tables.
Real-Time Event Ingestion With Roblox’s Data Centers
Next, we focused on the costs of serving analytics events. Previously, the ingestion backend was built on the cloud infrastructure. Analytics events were sent to a queue service, which buffered and then stored them in durable cloud storage for downstream processing and analysis. While a cloud queue service simplified our service and allowed for transparent scaling, it’s hard to use by other streaming jobs and more expensive. To address this, we explored bringing the ingestion service into Roblox’s data centers.
Our internal storage team had built queue-as-a-service (QaaS), based on an open-source distributed event streaming platform. QaaS is a great replacement for analytic event ingestion because events are tailed in first-in, first-out order and deleted after a short retention period. At Roblox, we create a dedicated topic for each schematized event and use partition count to scale for large event streams. The data team also built a dedicated service to consume from QaaS, build parquet files, and upload the files to durable cloud storage.
With QaaS in place and a dedicated service for building and storing parquet files, the data team performed shadow writes for six months to validate both data correctness and scalability. Finally, after extensive data completeness and integrity checks, we successfully migrated the analytic events ingestion off of our old cloud queue service. This was a major milestone. We removed the cloud resource cost from the ingestion path and significantly reduced the latency between an event firing and landing in our data lake. We previously had a service-level agreement of three hours, which we often missed—today, we’re consistently hitting an average of 15 minutes.
Progress and Future Work
With a modernized ingestion infrastructure, we’re able to process more events at better unit economics. This enables us to ingest and manage more than 2 trillion analytics events per day, which was unimaginable three years ago. Our QaaS-based ingestion infrastructure serves as a foundation for further improvements, such as streaming-as-a-service.
This allows engineers to write real-time event processing pipelines against schematized events by consuming from QaaS to power safety and real-time recommendation features. We also launched change data capture with the same schematization framework and QaaS ingestion, largely eliminating full database dumps. From real-time analytics and event streaming to unlocking new use cases, our work continues as we innovate and build smarter, faster, more cost-efficient data systems at scale.
We would like to thank Paul Mou for his valuable contributions to this work.
* As of the three months ended March 31, 2025.