Key Responsibilities

  • Lead end-to-end infrastructure automation and platform reliability initiatives in high-performance environments.
  • Design, implement, and manage CI/CD pipelines, IaC frameworks, and container orchestration systems.
  • Apply SRE principles to ensure platform resilience, high availability, and continuous improvement.
  • Develop and implement strategies for monitoring, observability, and incident response using open-source tools.
  • Mentor and guide a small engineering team, fostering collaboration and technical excellence.
  • Ensure systems adhere to security, compliance, scalability, and cost optimization best practices.
  • Collaborate across product, development, and DevOps teams to define architectural standards and promote automation-first practices.

Required Skills

  • Proven hands-on expertise with IaC, cloud platforms, CI/CD pipelines, containerization, orchestration, and SRE principles.
  • Strong experience with IaC tools such as Ansible, Terraform, CloudFormation, or Pulumi.
  • Deep understanding of resource management frameworks like Kubernetes, Apache Mesos, or Yarn.
  • Proficient in Linux administration, with experience in monitoring, logging, and observability using Prometheus, Grafana, and ELK.
  • Programming proficiency in Python, Java, or Golang, with strong architectural and system design skills focused on scalability and resilience.
  • Practical knowledge of multi-cloud and hybrid-cloud architectures.

Preferred Skills

  • Experience in network and infrastructure operations engineering.
  • Understanding of network protocols (TCP/IP, UDP, HTTP/HTTPS, DNS, BGP, OSPF, VXLAN, IPSec, etc.).
  • Familiarity with network security and automation, including zero-trust frameworks, TLS/SSL, and modern automation protocols such as gNMI/gRPC and RESTCONF.
  • Experience with Agile methodologies (Scrum/Kanban) and SRE performance metrics (MTTR, SLO, SLI, deployment frequency).
  • Strong Python scripting expertise for network automation (API integrations, structured data, parsing, error handling, packaging).
  • Proven hands-on experience with Terraform and Ansible in production environments.
  • Practical experience with NETCONF and YANG for model-driven network automation.
  • Strong expertise in Jinja templating for configuration generation and standardization.

Soft Skills

  • Strong leadership and mentoring abilities.
  • Excellent problem-solving and analytical thinking.
  • Effective communicator across technical and non-technical teams.
  • Ability to thrive in fast-paced, evolving technology environments.
  • Collaborative and automation-driven mindset.

Qualifications

  • Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field.
  • 5–10 years of experience in Infrastructure, DevOps, or SRE roles within product-focused organizations.
  • Hands-on experience in cloud-native platforms (AWS, Azure, GCP).

Preferred Certifications

  • DevOps or SRE Certification.
  • Kubernetes Certification (CKA / CKAD).
  • Network or Security Certifications (CCNA, CompTIA, or equivalent).