Platform Engineer – OpenShift+ AI-ML SRE | 4+ years

Listed 3 Jul 2026

BengaluruTop payGCCGreat Place to Work

Research Cisco before you apply

Check ratings, real-employee reviews, verified pay, and interview difficulty.

Glassdoor reviewsRatings, pros/cons, CEO approval↗AmbitionBoxIndia reviews, salaries, interviews↗Levels.fyiVerified compensation by level↗LinkedInPeople, growth, your connections↗

Meet the Team

You will be pivotal in contributing to the team responsible for designing and developing the next generation of scalable Kubernetes infrastructure with machine learning platforms that support both traditional ML and state-of-the-art Large Language Models (LLMs). This is a position for expert engineers where you will lead the technical direction, ensuring the performance, reliability, and scalability of AI systems while collaborating closely with data scientists, researchers, and other engineering teams.

Your Impact

The ideal candidate will have strong hands-on expertise in Red Hat OpenShift, proficiency in Golang and/or Python, and a passion for delivering highly reliable, scalable, and secure infrastructure. Hands on experience to AI technologies such as Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) & GPU frameworks.

Core Responsibilities

Design, deploy, administer, and optimize highly available Red Hat OpenShift platforms.
Implement and drive Site Reliability Engineering (SRE) practices to ensure platform reliability, scalability, and operational excellence.
Develop automation tools, operators, and platform services using Golang and/or Python.
Manage cluster lifecycle activities including upgrades, patching, capacity planning, and performance tuning.
Build and maintain CI/CD pipelines and Infrastructure as Code (IaC) solutions.
Implement and maintain observability solutions including logging, metrics, tracing, and alerting.
Monitor platform health and proactively identify and resolve reliability and performance issues.
Solve production incidents, perform root cause analysis (RCA), and drive preventive actions.
Collaborate closely with application and DevOps teams to improve deployment processes and platform adoption.
Ensure platform security, compliance, and consistency to organizational standards and procedures.
Participate in 16×5 on-call support rotation, providing timely response and resolution for production incidents and ensuring service availability.
Continuously evaluate and accept emerging technologies to enhance platform capabilities and operational efficiency.
Collaborate with global cross-functional teams across regions to support platform initiatives, drive operational excellence, and ensure seamless delivery of services and solutions.
GPU as a Service Platform offering and provide client support for hosting AI/ML workload powered by GPU

Minimum Qualifications / Requirement

4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or related roles.
Strong hands-on experience with Red Hat OpenShift administration, operations, and troubleshooting.
Proficiency in Golang and/or Python for automation and platform engineering.
Experience with container technologies such as Docker and container runtimes.
Strong understanding of Linux systems, networking, and distributed systems concepts.
Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or equivalent.
Experience with CI/CD tools such as Jenkins, GitLab CI, ArgoCD, Tekton, or similar.
Proven experience with observability tools such as Prometheus, Grafana, ELK, Loki, Jaeger, and OpenTelemetry.
Strong troubleshooting, debugging, and incident management capabilities.
Hands on experience to AI/ML platforms, Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG) & GPU architectures.
Experience with AI frameworks such as LangChain, LlamaIndex, or vector databases.
Ability to support and participate in 16×5 on-call rotations for critical production environments

Preferred Qualifications / Requirements

Familiarity with public cloud platforms (AWS, Azure, or GCP)
Familiarity with GitOps methodologies and tools.
Experience with service mesh technologies such as Istio.
Knowledge of container and platform security standards.
Reliability-first and automation-driven attitude.
Strong analytical and problem-solving skills.
Ability to work effectively in a fast-paced production environment.
Excellent communication and partnership skills.
Ownership, accountability, and a customer-focused approach.

Why Cisco?

At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint.

Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere.

We are Cisco, and our power starts with you.