AE
BengaluruHigh payGCCGreat Place to Work
Apply on American Express →Research American Express before you apply
Check ratings, real-employee reviews, verified pay, and interview difficulty.
Manager, Site Reliability Engineering leads and mentors Site Reliability Engineering (SRE) teams, fostering a culture of continuous improvement and inclusivity, while collaborating across the organization to enhance system resilience, scalability, and alignment with business objectives.
Responsibilities
- Manages and leads a team of Site Reliability Engineering colleagues, enabling a culture of continuous learning, growth opportunities, and inclusivity for all individual colleagues and teams
- Provides leadership, guidance, and coaching to Site Reliability Engineering teams, supporting training and development of best practices in software development, resiliency, and non-functional system requirements
- Recruit and develop a high-performing team, recognizing and rewarding achievements, and creating an environment that motivates and energizes colleagues to achieve best business objectives
- Oversees and facilitates collaboration with Software Engineering teams to design and implement features that improve system resilience, scalability, and performance; ensuring optimal functionality
- Collaborates with executives, product managers, and other stakeholders to ensure SRE principles are embedded throughout the organization
- Leads comprehensive chaos engineering experiments and resiliency tests, driving the analyzation of outcomes and implementation of improvements that enhance system robustness and recovery capabilities
- Plans regular drills and strategic planning to ensure organization is prepared for and can swiftly recover from complex and unexpected disruptions
- Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives
Qualifications
Education Qualifications:
- Bachelor’s degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred
- Knowledge of modern observability stack – Splunk, Elastic Search, Prometheus, Grafana
- Knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture
- Knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms
- Knowledge of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud.
- Work Experience:
- Experience in software development, or technology operations, with a focus on Site Reliability Engineering
- Experience in Linux/Unix systems, object-oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP)
Licenses and Certifications:
- Advanced certification in Site Reliability Engineering (SRE) or related is a plus