Title: Cloud Reliability Engineering Manager (remote)
Work authorization: any (candidate must be authorized to work in US)
Must-have skills: 10+ y of exp managing remote Cloud SRE teams across AWS, Azure, and GCP; 5+ y of exp with infrastructure design and deployment utilizing Cloud PaaS and IaaS cloud offerings; 5+ y of cloud Operations knowledge with automation; 5+ y of cloud Solutions (Azure, AWS, GCP), Azure AD, Azure WVD, ARM templates, Kubernetes, Containers, Terraform, Azure DevOps, Python, React, AWS Certified DevOps or Google Cloud Engineer certification; cruise exp, DevOps practices, CI/CD pipelines, and Agile methodologies, Cloud Native and Serverless technologies, Zero Trust solutions, Azure Administrator / Azure DevOps
Requirements:
• IT experience (10+ years);
• Experience managing Cloud SRE teams across AWS, Azure, and GCP (5+ years);
• Experience with infrastructure design and deployment utilizing Cloud PaaS and IaaS cloud offerings (5+ years);
• Configuration/management experience with Cloud networking technologies (5+ years);
• Experience in cloud operations knowledge with automation (5+ years);
• Experience with cloud solutions (Azure, AWS, GCP), Azure AD, Azure WVD, ARM templates, Kubernetes, Containers, Terraform, Azure DevOps, Python, React (5+ years);
• Experience managing people (5+ years);
• Experience with cloud architecture and operations, and cloud security fundamentals;
• Experience with DevOps practices, CI/CD pipelines, and Agile methodologies;
• Experience with Cloud Native and Serverless technologies;
• Solid network foundation with experience deploying Zero Trust solutions;
• Ability to learn new technologies and development languages at fast pace;
• Certifications: AWS Certified DevOps, Azure Administrator / Azure DevOps, Google Cloud Engineer;
• Bachelor’s degree in computer science / systems engineering.
Responsibilities include but are not limited to the following:
• Develop, implement, and optimize the SRE strategy across various cloud platforms (AWS, Azure, GCP);
• Lead and manage the SRE team, fostering an environment of continuous learning and growth;
• Solve hard-core technology problems – availability, scalability, performance in the cloud, prevents critical business applications outages;
• Drive Incident Management process and oversee post-incident reviews;
• Monitor system performance and implement reliability improvements;
• Define and establish SLOs, SLIs, and error budgets;
• Coordinate with cross-functional teams to ensure system performance aligns with operational goals;
• Facilitate communication around system performance and incidents to stakeholders;
• Manage the implementation of Cloud Management Tools and Cloud Security Tools;
• Governance, operationalization, and cost management of cloud platforms;
• Less than 25% shore-based travel;
• May be requested to work a different shift.