AI Platform Engineer

KUOK (SINGAPORE) LIMITED

About the Role

We are seeking a passionate AI Platform Engineer to build and own the infrastructure layer that every AI use case in Kuok Group runs on —the LLM gateway, the deployment platform, CI/CD pipelines, model serving, observability, cost controls, and the eval pipeline infrastructure, end to end. This role will be reporting to the Principal AI Architect.

This is a T-shaped role: broad cloud and DevOps foundations, with deep specialism in LLM infrastructure. The ideal candidate is equally comfortable provisioning environments and managing release pipelines as they are configuring a model gateway, wiring up LangSmith traces, and buildingan eval harness.

Working closely with the Head, AI Platform on architecture direction and with the LLM Ops / MLOps Engineer on the observability and eval layer, this person will be the backbone of the platform that Applied AI Engineers depend on to ship confidently and at pace.

Key Responsibilities

Deployment Platform & CI/CD

Design, build, and maintain CI/CD pipelines for all AI use cases — from code commit through staging to production, with automated release gates and rollback capability
Own environment provisioning and infra-as-code (Terraform or equivalent) — staging, UAT, and production environments should be reproducible, version-controlled, and auditable
Manage the deployment platform end to end: release scheduling, environment promotion, incident response, and post-deployment validation
Champion good deployment hygiene: automated pipelines, version-controlled configuration, and documented environment differences as standard practice

LLM Gateway & Model Serving

Build and operate the LLM gateway layer (LiteLLM or equivalent) — API access controls, rate limiting, model routing, and failover across Azure-backed endpoints
Manage model serving configuration: endpoint management, load balancing, latency SLOs, and model switching without disrupting live use cases
Own secrets and access management for all model API credentials and service accounts across environments
Maintain a prompt and model version registry so that every production use case can be traced to a specific model version and prompt configuration

Observability, Cost & Controls

Instrument all deployed use cases with LLM observability tooling (LangSmith or equivalent)— traces, latency, token counts, and error rates as standard
Build and maintain cost telemetry dashboards: per-use-case token consumption, compute spend, and alerting on cost anomalies
Implement and maintain token budget controls and rate limits across BUs — keeping cost visible and predictable is a shared responsibility that starts at the platform layer
Own general platform monitoring and reliability: uptime, alerting, on-call runbooks, and incident response for platform-layer issues

Eval Pipeline Infrastructure

Build the infrastructure layer for LLM evaluation pipelines — test harnesses, regression runners, and LLM-as-judge scaffolding used by Applied AI Engineers per use case
Work with the LLM Ops / MLOps Engineer on eval pipeline design
Ensure eval pipeline runs are logged, versioned, and traceable — eval results should be reproducible
Support evals as a consistent deployment gate — working with the team to ensure every use case has a passing eval run on the current model version before moving to production

Standards & Collaboration

Maintain platform documentation — architecture diagrams, runbooks, environment specs, and onboarding guides — so institutional knowledge is shared and accessible across the team
Work within the Head, AI Platform's engineering standards: all platform changes go through code review before deployment
Support the QA / Dev Engineers (Applied AI cluster) on integration and regression testing where it touches the platform layer
Proactively surface platform-layer risks and capacity constraints to the Head, AI Platform

Requirements

Must-Have

Solid cloud and DevOps engineering foundations — you have built and operated CI/CD pipelines, managed environments with IaC, and handled production deployments and rollbacks on at least one major cloud platform (Azure, AWS, or GCP);comfortable working across Linux and Windows Server, and familiar with core networking concepts — VPC/VNET, DNS, firewalls, and load balancers
Hands-on experience with LLM infrastructure: you have configured and operated a model gateway or API proxy layer, managed multi-model routing, and dealt with rate limits and failover in a live environment
LLM observability experience — you have instrumented production AI systems with tracing and monitoring tooling and used the data to diagnose issues
Cost telemetry and token controls — you understand how LLM API costs are structured and have built or operated dashboards and controls to keep spend visible and bounded
Strong Python skills and comfort with the full LLM deployment tooling ecosystem —equally at home in application code and infrastructure configuration
Strong appreciation for documentation and configuration management — environments as code, clear runbooks, and written context that helps the team move faster together

Strong Advantage

Experience with eval pipeline infrastructure: test harness design, regression frameworks, LLM-as-judge scaffolding, or automated output quality checks
Security and access management experience in an AI context: IAM, RBAC, secrets management, API credential rotation, encryption at rest and in transit, and least-privilege access design for model-serving environments
Familiarity with MLOps practices: model versioning, A/B traffic splitting, canary deployments for model updates
Experience supporting engineering teams as a platform provider — you understand that your internal customers are the engineers shipping use cases, and you design for their velocity as well as for reliability
Exposure to enterprise multi-tenant environments: managing shared infrastructure across multiple teams or business units with different access and cost boundaries; familiarity with virtualisation platforms (VMware, Hyper-V, or Nutanix) is a plus