CoreWeave is the AI Hyperscaler?, delivering a cloud platform of cutting edge services powering the next wave of AI. Our technology provides enterprises and leading AI labs with the most performant, efficient and resilient solutions for accelerated computing. Since 2017, CoreWeave has operated a growing footprint of data centers covering every region of the US and across Europe. CoreWeave was ranked as one of the TIME100 most influential companies of 2024.
As the leader in the industry, we thrive in an environment where adaptability and resilience are key. Our culture offers career-defining opportunities for those who excel amid change and challenge. If you’re someone who thrives in a dynamic environment, enjoys solving complex problems, and is eager to make a significant impact, CoreWeave is the place for you. Join us, and be part of a team solving some of the most exciting challenges in the industry.
CoreWeave powers the creation and delivery of the intelligence that drives innovation.
About the role:
The Fleet Monitoring & Analysis Team contributes to the automated provisioning and management of CoreWeave’s ever-expanding fleet of hardware nodes and node types by continually improving node and environmental monitoring and observability. Playing a central role in CoreWeave’s growth strategy, this team is a critical piece of our cohesive, zero-touch, and high-reliability fleet management engine.
We seek an Engineer to join the Fleet Monitoring & Analysis team to help us build, run, and refine our metrics, alerts, visualizations, and data-driven insights. This individual will join a team of mixed-skill engineers focused on elevating the art of managing high-performance hardware at scale. As a team member, you would have the opportunity to:
? Design and implement solutions to large-scale server observability to continually improve the stability of CoreWeave’s global hardware fleet.
? Adapt, extend, and implement open-source solutions to augment the depth and breadth of our visibility into our operating environment.
? Generate and maintain custom reports, alarms, and visualizations to help teams understand and respond to our growth and changes.
? Create test plans, deployment automation, dashboards, alerts, and insights into our fleet operations, as well as participate in the Fleet Engineering Developers’ on-call rotation.
? Grow, change, invest in your teammates, be invested in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.
Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams ? even if you aren’t a 100% skill or experience match. Here are some qualities we’ve found compatible with our team. If a portion of this resonates with you, we’d love to talk.
? You have 2 or more years experience in a software or infrastructure engineering industry.
? You have experience in the domains of automation and orchestration workflows and are knowledgeable about server hardware, components, and related technologies and strategies for the management of physical infrastructure at scale.
? You have experience implementing metrics collection and alerting on standard platforms.
? You believe in the value of automation and will champion practices that drive reliability and prioritize the CoreWeave customer experience.
? Applicants must have work authorization that does not require sponsorship from the company now or in the future.