AI Platform Engineer
KUOK (SINGAPORE) LIMITED
About the Role
We are seeking a passionate AI Platform Engineer to build and own the infrastructure layer that every AI use case in Kuok Group runs on —the LLM gateway, the deployment platform, CI/CD pipelines, model serving, observability, cost controls, and the eval pipeline infrastructure, end to end. This role will be reporting to the Principal AI Architect.
This is a T-shaped role: broad cloud and DevOps foundations, with deep specialism in LLM infrastructure. The ideal candidate is equally comfortable provisioning environments and managing release pipelines as they are configuring a model gateway, wiring up LangSmith traces, and buildingan eval harness.
Working closely with the Head, AI Platform on architecture direction and with the LLM Ops / MLOps Engineer on the observability and eval layer, this person will be the backbone of the platform that Applied AI Engineers depend on to ship confidently and at pace.
.
Key Responsibilities
Deployment Platform & CI/CD
- Design, build, and maintain CI/CD pipelines for all AI use cases — from code commit through staging to production, with automated release gates and rollback capability
- Own environment provisioning and infra-as-code (Terraform or equivalent) — staging, UAT, and production environments should be reproducible, version-controlled, and auditable
- Manage the deployment platform end to end: release scheduling, environment promotion, incident response, and post-deployment validation
- Champion good deployment hygiene: automated pipelines, version-controlled configuration, and documented environment differences as standard practice
LLM Gateway & Model Serving
- Build and operate the LLM gateway layer (LiteLLM or equivalent) — API access controls, rate limiting, model routing, and failover across Azure-backed endpoints
- Manage model serving configuration: endpoint management, load balancing, latency SLOs, and model switching without disrupting live use cases
- Own secrets and access management for all model API credentials and service accounts across environments
- Maintain a prompt and model version registry so that every production use case can be traced to a specific model version and prompt configuration
Observability, Cost & Controls
- Instrument all deployed use cases with LLM observability tooling (LangSmith or equivalent)— traces, latency, token counts, and error rates as standard
- Build and maintain cost telemetry dashboards: per-use-case token consumption, compute spend, and alerting on cost anomalies
- Implement and maintain token budget controls and rate limits across BUs — keeping cost visible and predictable is a shared responsibility that starts at the platform layer
- Own general platform monitoring and reliability: uptime, alerting, on-call runbooks, and incident response for platform-layer issues
Eval Pipeline Infrastructure
- Build the infrastructure layer for LLM evaluation pipelines — test harnesses, regression runners, and LLM-as-judge scaffolding used by Applied AI Engineers per use case
- Work with the LLM Ops / MLOps Engineer on eval pipeline design
- Ensure eval pipeline runs are logged, versioned, and traceable — eval results should be reproducible
- Support evals as a consistent deployment gate — working with the team to ensure every use case has a passing eval run on the current model version before moving to production
Standards & Collaboration
- Maintain platform documentation — architecture diagrams, runbooks, environment specs, and onboarding guides — so institutional knowledge is shared and accessible across the team
- Work within the Head, AI Platform's engineering standards: all platform changes go through code review before deployment
- Support the QA / Dev Engineers (Applied AI cluster) on integration and regression testing where it touches the platform layer
- Proactively surface platform-layer risks and capacity constraints to the Head, AI Platform
.
Requirements
Must-Have
- Solid cloud and DevOps engineering foundations — you have built and operated CI/CD pipelines, managed environments with IaC, and handled production deployments and rollbacks on at least one major cloud platform (Azure, AWS, or GCP);comfortable working across Linux and Windows Server, and familiar with core networking concepts — VPC/VNET, DNS, firewalls, and load balancers
- Hands-on experience with LLM infrastructure: you have configured and operated a model gateway or API proxy layer, managed multi-model routing, and dealt with rate limits and failover in a live environment
- LLM observability experience — you have instrumented production AI systems with tracing and monitoring tooling and used the data to diagnose issues
- Cost telemetry and token controls — you understand how LLM API costs are structured and have built or operated dashboards and controls to keep spend visible and bounded
- Strong Python skills and comfort with the full LLM deployment tooling ecosystem —equally at home in application code and infrastructure configuration
- Strong appreciation for documentation and configuration management — environments as code, clear runbooks, and written context that helps the team move faster together
.
Strong Advantage
- Experience with eval pipeline infrastructure: test harness design, regression frameworks, LLM-as-judge scaffolding, or automated output quality checks
- Security and access management experience in an AI context: IAM, RBAC, secrets management, API credential rotation, encryption at rest and in transit, and least-privilege access design for model-serving environments
- Familiarity with MLOps practices: model versioning, A/B traffic splitting, canary deployments for model updates
- Experience supporting engineering teams as a platform provider — you understand that your internal customers are the engineers shipping use cases, and you design for their velocity as well as for reliability
- Exposure to enterprise multi-tenant environments: managing shared infrastructure across multiple teams or business units with different access and cost boundaries; familiarity with virtualisation platforms (VMware, Hyper-V, or Nutanix) is a plus