Cloud Engineer (Senior Level)
XIAOMI TECHNOLOGIES SINGAPORE PTE. LTD.
- Maintain high availability of production systems by focusing on resilient cloud architecture, fast detection and mitigation of incidents, postmortem-driven improvements, and automated routine checks for risks.
- Leverage AI assistant and coding agents to reduce toil and improve incident diagnosis, infrastructure automation, knowledge management, and operational efficiency.
- Participate in on-call rotations and respond to production incidents to ensure service continuity. Compensatory time off will be provided in accordance with company policy.
- Perform server administration and cloud operations, including troubleshooting, monitoring, and capacity management of Linux cloud virtual servers and Kubernetes workload.
- Analyze distributed system issues including performance and cost efficiency bottlenecks, and drive improvements in reliability, performance, scalability and cost optimization.
- Define and monitor service level objectives (SLOs) and continuously improve system resilience and availability.
- Perform proof-of-concept evaluations, including setup, testing, and production validation of cloud solutions and services before mass adoption.
- Deploy and configure cloud solutions in production environments, including resource provisioning, configuration, monitoring, and operational support across cloud platforms.
- Bachelor’s degree in Computer Science, Information Technology, Programming & Systems Analysis, Science (Computer Studies), or a related field.
- Alternatively, a minimum of 3–5 years of relevant experience in Site Reliability Engineering, Cloud Engineering, DevOps, or related roles.
- Proficiency in at least one programming or scripting language such as Python, Go, or Bash.
- Experience with cloud platforms such as Alibaba Cloud, AWS, Azure, or equivalent; experience in multi-cloud or hybrid cloud environments is preferred.
- Strong understanding of Linux systems (kernel, memory, process, etc.), networking (TCP/IP, DNS, TLS), load balancing, high availability architecture, and observability platforms such as Prometheus, Grafana, Loki and the ELK stack.
- Expert knowledge and hands-on experience in incident handling, especially in Kubernetes and container environments.
- Hands-on experience with nginx preferred, including configuration, troubleshooting, and performance tuning.
- Experience in deploying and operating large-scale production distributed systems, including server administration, microservice architecture, cloud load balancing (L4/L7), IP routing, reverse proxy architecture, and cloud support.
- Experience with automation of cloud operations and infrastructure, including scripting and CI/CD process, is preferred.
- Experience using AI-assisted engineering tools and coding agents to improve automation, incident response, troubleshooting, and operational efficiency is an advantage.
- A strong team player with good communication skills, responsible, self-driven, and highly motivated.
- Ability to communicate in Mandarin and English, in order to support coordination and collaboration with Mandarin-speaking stakeholders, teams, and business partners across regional markets.