Cloud Engineer (Senior Level)

XIAOMI TECHNOLOGIES SINGAPORE PTE. LTD.

Job Responsibilities

Maintain high availability of production systems by focusing on resilient cloud architecture, fast detection and mitigation of incidents, postmortem-driven improvements, and automated routine checks for risks.
Leverage AI assistant and coding agents to reduce toil and improve incident diagnosis, infrastructure automation, knowledge management, and operational efficiency.
Participate in on-call rotations and respond to production incidents to ensure service continuity. Compensatory time off will be provided in accordance with company policy.
Perform server administration and cloud operations, including troubleshooting, monitoring, and capacity management of Linux cloud virtual servers and Kubernetes workload.
Analyze distributed system issues including performance and cost efficiency bottlenecks, and drive improvements in reliability, performance, scalability and cost optimization.
Define and monitor service level objectives (SLOs) and continuously improve system resilience and availability.
Perform proof-of-concept evaluations, including setup, testing, and production validation of cloud solutions and services before mass adoption.
Deploy and configure cloud solutions in production environments, including resource provisioning, configuration, monitoring, and operational support across cloud platforms.

Job Requirements

Bachelor’s degree in Computer Science, Information Technology, Programming & Systems Analysis, Science (Computer Studies), or a related field.
Alternatively, a minimum of 3–5 years of relevant experience in Site Reliability Engineering, Cloud Engineering, DevOps, or related roles.
Proficiency in at least one programming or scripting language such as Python, Go, or Bash.
Experience with cloud platforms such as Alibaba Cloud, AWS, Azure, or equivalent; experience in multi-cloud or hybrid cloud environments is preferred.
Strong understanding of Linux systems (kernel, memory, process, etc.), networking (TCP/IP, DNS, TLS), load balancing, high availability architecture, and observability platforms such as Prometheus, Grafana, Loki and the ELK stack.
Expert knowledge and hands-on experience in incident handling, especially in Kubernetes and container environments.
Hands-on experience with nginx preferred, including configuration, troubleshooting, and performance tuning.
Experience in deploying and operating large-scale production distributed systems, including server administration, microservice architecture, cloud load balancing (L4/L7), IP routing, reverse proxy architecture, and cloud support.
Experience with automation of cloud operations and infrastructure, including scripting and CI/CD process, is preferred.
Experience using AI-assisted engineering tools and coding agents to improve automation, incident response, troubleshooting, and operational efficiency is an advantage.
A strong team player with good communication skills, responsible, self-driven, and highly motivated.
Ability to communicate in Mandarin and English, in order to support coordination and collaboration with Mandarin-speaking stakeholders, teams, and business partners across regional markets.