How The Infrastructure Group Drives The Future of Everything We Do at Roblox

  • Our Infrastructure Group designs, builds, and operates the underlying storage, compute, networking, security, and engineering productivity systems powering the global Roblox platform.

  • Those systems operate at scale—supporting 77.7 million* daily active users, exabytes of content delivered, and more than 250 million concurrent connections, all across more than 135,000 servers.

  • Everything the group does is meant to maximize the reliability and efficiency of our systems and to help our engineers be as productive as possible.

Every second of every day, product engineers at Roblox can utilize more than 2,000 services running on our global internal cloud infrastructure. Our platform supports millions of reads and writes, handles terabytes of throughput, and processes tens of million HTTP requests. When our 77.7 million* daily active users come to Roblox, they do so across more than 250 million concurrent connections.

All of that is the scale of technological systems at Roblox, and the domain of our Infrastructure Group. Known as Infra, they design, build, and operate our company’s storage, compute, networking, security, and engineering productivity systems, as well as our data centers. Infra’s goal is to offer scalable, dependable, and easy-to-use systems. Above all else, the group prizes three key metrics:

  • Availability—the reliability of our systems

  • Cost-to-serve—the efficiency of our systems

  • Productivity—how productive they make the Roblox engineers building on top of the infrastructure

As the Infra group leader, vice president of engineering Max Ross, puts it, “everything we do aims to advance one or some combination of those three things—availability, cost-to-serve, and productivity.”

Solving Novel Problems Every Day

When more than a million users join a hit Roblox experience after a large update, a phenomenon known as a “thundering herd,” our creators can rest easy. That’s because Infra’s job is ensuring Roblox product engineers can build a platform offering our users the best and most stable experience. And doing that means that the Infra team gets to tackle complex systems and solve novel problems every day.

Why? Because we’re not connecting tens of millions of DAUs to a centralized transaction-processing location, a widely understood problem. Instead, we’re connecting them to each other in real-time, globally. All told, Infra’s thousands of services run on more than 135,000 servers in two core data centers, numerous edge data centers around the world, and some public cloud providers

Availability—The Reliability of Our Systems

A major factor in the success of our business is user time spent on Roblox, and we know there’s a direct line between reliable infrastructure and a user staying longer.

We want 99.99 percent user uptime every month, which means our systems could disrupt no more than 0.01 percent of engagement hours. And our product engineers expect our internal cloud infrastructure to function at least as well as any public cloud. “Our infrastructure should run as smoothly as possible,” Technical Director Danny Yuan says, “so that other engineers can build products that will delight our users.”

One way we’re pursuing this is by bringing observability and network connectivity closer to the applications powering Roblox experiences. We’re deploying Envoy proxy sidecars next to every service instance and experimenting with eBPF to observe the underlying state of connections between proxies and external services. This helps us understand and, crucially, reduce packet drops, explains Technical Director Rob Cameron.

The Halloween Outage

A lot of Infra’s reliability efforts stem from what we learned during our cascading, 73-hour-long outage in 2021. That one moment a few years ago ended up being a defining moment in our approach to building resilient infrastructure and our need to plan for the short and long term. “It was a ‘stop the presses’ moment,” Ross says, “the only thing we should be thinking about until we could assure everyone at Roblox it would never happen again.”

From a single Infra monolith to 34 cells

Cost to Serve (Efficiency)

These days, technology companies rarely build their own cloud infrastructure since public cloud providers offer essential tools like networking, fleet management, and so on.

But at our scale and with our decentralized nature, it’s more cost-effective for us to maintain a private cloud. We’re always identifying and overcoming the challenges that come from maintaining complex systems like this ourselves.

In order to ensure we get our desired cost savings, we must be thoughtful about system design. Our global private cloud demands close attention to efficiency so we can invest more in supporting our community of creators and users.

We're striving to make it easy for product engineers to build features that can run efficiently at scale. At the same time, we're inventing streamlined production tooling that allows a small team to operate large-scale infrastructure. “People outside of Infra may not always be aware,” says Technical Director Michael Wolf, “that we’re dramatically reinventing almost every part of our infrastructure.”

That means evolving from a bare-metal configuration to a Linux-based, containerized, architecture with a common control plane across both core and edge data centers. As a result, Roblox engineers will be able to utilize an enormous new repository of open-source software tools. And it will be easier to concurrently run multiple workloads on the same machines.

“We’re not afraid to tackle big challenges,” says Technical Director Andy Wilcox, alluding to Infra’s recent transition to new telemetry, compute, and deployment stacks. “These are foundational things we’ve been able to tackle as an engineering organization with an appetite for taking them on.”

It won’t happen overnight. It will take years since we can’t just reboot Roblox—our machines must stay up and running. That calls for a manual process of rewriting software and adapting to new tools. “It’s like changing the tires on the car,” Wolf says, “while you’re driving down the highway.”

Productivity

Every day, our engineering team efficiently tackles big problems at scale and gets as much out of our systems as we can.

For that, we regularly collect quantitative and qualitative data on our engineers’ productivity. That helps identify bottlenecks that can be improved with 3rd-party solutions or our own bespoke tools.

An example is a dashboard we released in March to address engineers’ pain points with our code review process. The tool helps engineers keep track of PRs that require their review based on numerous criteria we define. It also unifies code review tasks and allows scheduling notifications. Since the wide adoption of this dashboard, our P75 PR to merge time is down by 30 percent.

And of course, the ultimate engineering productivity feature is stable, scalable infrastructure to build on, so we’re always making long-term investments in our low-level systems.

This reflects two core Roblox values: Taking the long view and Getting things done. As a customer-centric infrastructure group, we’re pragmatic about making our customers more successful and productive. If they need something we don’t offer as a managed service, we can think about integrating vendor solutions alongside our internal tools in our private cloud.

But while short-term “keeping the lights on” solutions can often be tempting, they must be balanced with forward-looking engineering.

The reward is that maximizing productivity benefits the company while enabling us to get projects done to meet business goals.

A Culture That Encourages Exploration

As we aim to connect a billion people with civility and safety, there will always be major technical challenges to tackle. We’ve solved many and learned much already. But our eyes are on an even more scalable infrastructure while working to decrease our systems’ complexity.

Those dueling goals will present countless new lessons for years to come, especially as we take on increasing AI workloads. And we know for sure that achieving our goals means Infra’s systems need to evolve significantly over time.

For Infra engineers, every project is a potential point of transformation for the company, and everyone’s work is important. “Infra is an organization where people can do great work that really matters to Roblox and our users,” Wolf says, “and where nothing is really off-limits.”

Customer-focused mindset

Ultimately, our work is to help other Roblox engineers be more effective, today and in the future, with a mandate to quickly learn lessons, and deliver solutions derived from them.

We’re facing that challenge head-on. “I want to make sure we’re delivering value to Roblox today, this quarter and this year,” Ross says. “ I also want to make sure we’re building a foundation that will have us in a good spot for the next 5-to-10 years.”

* As of 3 months ended March 31, 2024