AI Platform Engineer

KUOK (SINGAPORE) LIMITED

About the Role

We are seeking a passionate AI Platform Engineer to build and own the infrastructure layer that every AI use case in Kuok Group runs on —the LLM gateway, the deployment platform, CI/CD pipelines, model serving, observability, cost controls, and the eval pipeline infrastructure, end to end. This role will be reporting to the Principal AI Architect.

This is a T-shaped role: broad cloud and DevOps foundations, with deep specialism in LLM infrastructure. The ideal candidate is equally comfortable provisioning environments and managing release pipelines as they are configuring a model gateway, wiring up LangSmith traces, and buildingan eval harness.

Working closely with the Head, AI Platform on architecture direction and with the LLM Ops / MLOps Engineer on the observability and eval layer, this person will be the backbone of the platform that Applied AI Engineers depend on to ship confidently and at pace.

.

Key Responsibilities

Deployment Platform & CI/CD

  • Design, build, and maintain CI/CD pipelines for all AI use cases — from code commit through staging to production, with automated release gates and rollback capability
  • Own environment provisioning and infra-as-code (Terraform or equivalent) — staging, UAT, and production environments should be reproducible, version-controlled, and auditable
  • Manage the deployment platform end to end: release scheduling, environment promotion, incident response, and post-deployment validation
  • Champion good deployment hygiene: automated pipelines, version-controlled configuration, and documented environment differences as standard practice

LLM Gateway & Model Serving

  • Build and operate the LLM gateway layer (LiteLLM or equivalent) — API access controls, rate limiting, model routing, and failover across Azure-backed endpoints
  • Manage model serving configuration: endpoint management, load balancing, latency SLOs, and model switching without disrupting live use cases
  • Own secrets and access management for all model API credentials and service accounts across environments
  • Maintain a prompt and model version registry so that every production use case can be traced to a specific model version and prompt configuration

Observability, Cost & Controls

  • Instrument all deployed use cases with LLM observability tooling (LangSmith or equivalent)— traces, latency, token counts, and error rates as standard
  • Build and maintain cost telemetry dashboards: per-use-case token consumption, compute spend, and alerting on cost anomalies
  • Implement and maintain token budget controls and rate limits across BUs — keeping cost visible and predictable is a shared responsibility that starts at the platform layer
  • Own general platform monitoring and reliability: uptime, alerting, on-call runbooks, and incident response for platform-layer issues

Eval Pipeline Infrastructure

  • Build the infrastructure layer for LLM evaluation pipelines — test harnesses, regression runners, and LLM-as-judge scaffolding used by Applied AI Engineers per use case
  • Work with the LLM Ops / MLOps Engineer on eval pipeline design
  • Ensure eval pipeline runs are logged, versioned, and traceable — eval results should be reproducible
  • Support evals as a consistent deployment gate — working with the team to ensure every use case has a passing eval run on the current model version before moving to production

Standards & Collaboration

  • Maintain platform documentation — architecture diagrams, runbooks, environment specs, and onboarding guides — so institutional knowledge is shared and accessible across the team
  • Work within the Head, AI Platform's engineering standards: all platform changes go through code review before deployment
  • Support the QA / Dev Engineers (Applied AI cluster) on integration and regression testing where it touches the platform layer
  • Proactively surface platform-layer risks and capacity constraints to the Head, AI Platform

.

Requirements

Must-Have

  • Solid cloud and DevOps engineering foundations — you have built and operated CI/CD pipelines, managed environments with IaC, and handled production deployments and rollbacks on at least one major cloud platform (Azure, AWS, or GCP);comfortable working across Linux and Windows Server, and familiar with core networking concepts — VPC/VNET, DNS, firewalls, and load balancers
  • Hands-on experience with LLM infrastructure: you have configured and operated a model gateway or API proxy layer, managed multi-model routing, and dealt with rate limits and failover in a live environment
  • LLM observability experience — you have instrumented production AI systems with tracing and monitoring tooling and used the data to diagnose issues
  • Cost telemetry and token controls — you understand how LLM API costs are structured and have built or operated dashboards and controls to keep spend visible and bounded
  • Strong Python skills and comfort with the full LLM deployment tooling ecosystem —equally at home in application code and infrastructure configuration
  • Strong appreciation for documentation and configuration management — environments as code, clear runbooks, and written context that helps the team move faster together

.

Strong Advantage

  • Experience with eval pipeline infrastructure: test harness design, regression frameworks, LLM-as-judge scaffolding, or automated output quality checks
  • Security and access management experience in an AI context: IAM, RBAC, secrets management, API credential rotation, encryption at rest and in transit, and least-privilege access design for model-serving environments
  • Familiarity with MLOps practices: model versioning, A/B traffic splitting, canary deployments for model updates
  • Experience supporting engineering teams as a platform provider — you understand that your internal customers are the engineers shipping use cases, and you design for their velocity as well as for reliability
  • Exposure to enterprise multi-tenant environments: managing shared infrastructure across multiple teams or business units with different access and cost boundaries; familiarity with virtualisation platforms (VMware, Hyper-V, or Nutanix) is a plus

How to apply

To apply for this job you need to authorize on our website. If you don't have an account yet, please register.