The Infrastructure Supporting Record-Breaking Experiences

Reaching New Heights Every Weekend on Roblox

ByAnupam Singh, Senior Vice President of Engineering

PublishedJun 24, 2025

Roblox’s ability to scale and support tens of millions of users playing together across millions of unique experiences isn’t the result of a single innovation. It’s the sum of a broader culture of innovation and a thousand small things done well across the company. This is how we’ve built the infrastructure that’s currently supporting record-breaking traffic to many of the experiences on Roblox. One of those experiences, Grow a Garden, recently broke the Guinness World Records® achievement for most concurrently played video game, with 21.6 million users playing simultaneously. And in the process, the Roblox platform has continued to hit new peak concurrency records (as it has over nearly two decades), most recently exceeding 30 million concurrent players.

Roblox faces unique challenges in building and maintaining infrastructure for millions of creator-built experiences, including Dress to Impress, Adopt Me, and Dead Rails, requiring innovative engineering methodologies. The platform supports dozens of hourly updates and more than 30 million concurrent users with infrastructure that scales during unexpected traffic spikes. This infrastructure must support thundering herd situations where more than 21 million users join a single experience simultaneously (and the update code came from independent creators). Roblox engineers innovate solutions by challenging traditional wisdom—solutions that are inspired by our four core values.

Infrastructure at Roblox

Roblox engineers manage 24 edge data centers around the world, which run the game servers. When a user joins an experience, they’re matched to the nearest data center and the most appropriate instance within the center to minimize latency. We also manage two core data centers, which are much larger and run centralized services like the website, recommendation algorithms, safety filters, virtual economy, and publishing platform, which are necessary for the edge data centers to function. A global private network interconnects all the edge data centers to the core data centers, with edge data centers serving as a firewall to protect services running in the core data center.

Take the Long View: Proactive Capacity Prediction

In an ideal world, our creators should never have to think about capacity—the infrastructure should be invisible to them, working behind the scenes. When a creator publishes an experience to Roblox, our work is to support the capacity needed, no matter how many players show up. In the early days, we planned capacity once a year for the year or two ahead. But in recent years, successful experiences like Dress to Impress, Fisch, Dead Rails, and Grow a Garden have led us to rethink our framework for capacity planning.

In line with our value of taking the long view, we now predict capacity needs up to two years in advance, balancing user demand with efficient server utilization. Our planning cycle involves data center acquisition, server hardware refreshes, and physical networking, with new data centers like the one in Brazil being planned years ahead. The networking team also maintains “dark” capacity to ensure continuous operation despite issues like network cable cuts.

The capacity Roblox has today is based on predictions made two years ago, when we couldn’t have predicted experiences growing from unknown to huge popularity within weeks. Popular games like Dress to Impress and Grow a Garden, which helped double Roblox's peak concurrent player count from 13.9 million in April to 30.6 million in June 2025, didn't exist when these capacity predictions were made. For example, in March 2025, Dead Rails spiked to 1 million concurrent users, using all available CPU capacity.

Learning from these types of popularity spikes, we’ve moved to a more agile planning cycle. To consistently support record player counts on Roblox, engineering employs a rigorous weekly cycle of planning, testing, and capacity adjustments. Monday is dedicated to incident reviews, followed by capacity planning on Tuesday. Throughout the week, there’s continual chaos testing. Thursday focuses on reviewing capacity for any large updates our creators have told us to expect. On Friday, additional cloud resources are provisioned to ensure that the platform is prepared for peak weekend usage. Throughout the week, we continue to release entirely new features, and we do not lock the continuous deployment by all engineers.

Respect the Community: Effortless Capacity for Creators

Throttling is a very accepted concept in computer science. But this is the most misused and misunderstood lever of computer science. When new engineers join Roblox, their first solutions often include, “If we could just tell our creators to tweak this config or slow down their events…”. Veteran Roblox engineers then gently explain our value of respecting the community and that we don’t tell our creators what to do.

For example, most gaming systems have a simple solution for matchmaking when millions of players click play simultaneously. They throttle the joins, make players wait, or send them to random servers by skipping the matchmaking algorithm. At Roblox, we do the opposite. We redesigned our entire matchmaking systems for thundering herds of players. At peak, this system evaluates up to 4 billion possible join combinations per second. Years ago, we set the objective of 10 million joins in 10 seconds, and we continue to iterate toward that goal.

To avoid throttling due to capacity, we’re experimenting with cloud bursting as part of our transition to a cellular infrastructure, allowing for dynamic and compute-efficient scaling. This architecture handles peak demand by matching users to both on-premise and cloud edge data center cells. We’re working toward a fully automated bring-up and tear-down of cloud-based edge data centers that are fully abstracted for the matchmaking algorithm.

Another example is our text-filter system, which at peak handles 250,000 requests per second. That’s a large model inference running 250,000 tokens with constantly expanding context windows. And with more than 300 AI inference pipelines running in production, Roblox service owners invest a lot of time in finding the ideal mix of inference profiles between GPUs and CPUs. Even under peak loads, Roblox engineers respect the community by prioritizing creator freedom and user safety.

Get Stuff Done: System Stressing for Resilience

With our planning, we build up the capacity and algorithms to support the most exciting updates from creators. But we need to be sure these systems can hold up under even the largest peaks or single service outages. Information gathered from peak usage on more than 1,600 microservices helps identify services to further stress test.

True to our value of getting stuff done, every day we take a few of these services and constrain their capacity in production. We observe the attributes, then fix them before the weekend. We call this “test actual capacity on” (TACO) Tuesdays. Our reliability team also runs continuous capacity correctness (C3). Each engineering team uses a C3 dashboard to predict and manage their services’ CPU capacity. This enables service owners to continuously learn from the last peak to increase or decrease capacity for the next peak. We’ve also launched a system that traces call patterns in the core Roblox engine for new releases. This helps ensure that we’re better prepared during an update.

Even with all this preparation, we still occasionally run into scenarios where the unpredictable nature of traffic patterns could cause a single service or product flow to bring the platform down. For example, the 2 trillion event analytics pipeline could see 30% more traffic due to a popular update. This is where our resiliency mechanisms, such as adaptive concurrency control (ACC), circuit breaker, and shedding retries, kick in to protect the platform. This year, we also built a chaos-testing platform to strengthen our infrastructure’s resiliency and scalability by randomly injecting faults, exhausting resources, and randomly terminating processes in production.

Take Responsibility: Bringing All Hands on Deck

We spend all week testing and preparing for these big weekend updates. But when the weekend comes, we still have work to do. Ahead of weekend updates, Roblox engineers collaborate to monitor upcoming changes and predict remaining capacity, provisioning additional cloud resources as needed to accommodate millions of extra players via virtual edge data centers.

On Friday, we decide whether we need to add extra capacity with cloud resources. This process gives a clear direction to our hybrid cloud team to bring up enough extra capacity to accommodate millions of additional players. At any point, our 24 physical edge data centers are running, but after all the testing, we might decide we need additional edge data centers. There’s no way to rack and stack servers in 12 hours, so we work with our cloud partners to build multiple virtual edge data centers. We test them on Friday, and then we’re ready for the weekend.

In the true spirit of taking responsibility, everyone, including our highest-level executives, take on-call rotations—even on weekends. The surge of millions of users on Saturday can often trigger hundreds of alerts. Teams preemptively resolve these alerts, enabling us to handle challenges during a big update or a platform-wide all-time high.

As Leonardo da Vinci is often credited with saying, “Learning never exhausts the mind.” Each peak has inspired us to learn and invent new techniques to make our infrastructure more dependable and invisible. Our creators publish or update, and through the magic of invisible infrastructure, tens of millions of users start enjoying an entirely new experience almost immediately. We are eternally grateful to our creators and users for challenging us to push the boundaries of computer science.