Sr. Site Reliability Engineer for Storage Platform
San Mateo, CA
Data Infrastructure Jobs All Jobs
Every day, tens of millions of people from around the world come to Roblox to play, learn, work, and socialize in immersive digital experiences created by the community.
Our vision is to build a platform that enables shared experiences among billions of users. This is what’s known as the metaverse: a persistent space where anyone can do just about anything they can imagine, from anywhere in the world and on any device. The breadth of opportunities, and the evolving demands of this first-of-its-kind platform, ensure that your avenues for growth are always expanding and flexible.
Join us and you’ll usher in a new category of human interaction while solving exceptional challenges that you won’t find anywhere else.
As a Sr. Site Reliability Engineer for Storage Platform, you’ll be supporting Roblox’s storage platform by designing, maintaining and operating our large scale KV store, caching, Kafka and Object Storage infrastructure while contributing to our internal Infrastructure-as-a-Service offerings.
- Experience designing & operating large-scale distributed systems handling billions of real-time requests per second. Deep Knowledge in one or more following technologies: Caching(Redis), Kafka , Distributed database (CockroachDB), OLAP , Object Storage system is a plus
- Experience with system configuration management with familiarity in Automation tools like Chef and Terraform
- Experience building deploy pipeline on top of container orchestrators like Kubernetes or Nomad and service discovery systems like Consul
- Experience with programming languages, like Python or Go
- Experience with telemetry stacks, like Grafana, Prometheus monitoring, AlertManager and Kibana
- Experience with Linux systems and shells
- BS degree (or equivalent professional experience) in Computer Science, with at least 5 years of hands on experience
- Have a leading role in designing & implementing our internal Infra-as-a-Service offerings on top of a container orchestrator platform
- Provide primary operational support and engineering for multiple large distributed software applications/services
- Build automation and frameworks to manage platform infrastructure, services and handle various software or hardware faults
- Measure and optimize system availability, reliability and performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Improve service SLA and end-end rollout time of our suite of software solutions
- Excellent medical, dental, and vision coverage
- A rewarding 401k program
- Flexible vacation policy
- Free catered lunches five times a week and several fully stocked kitchens with unlimited snacks
- Onsite fitness center and fitness program credit
- Annual CalTrain Go Pass
- A Roblox Admin badge for your avatar