AI / High-Performance Computing Engineer

DSO National Laboratories

In this role, you will:

  • Participate in the full lifecycle of HPC cluster ops from system bring-up-and-down, workload characterisation and optimisation, and rollout of new AI and Software Services.
  • Design & operate a GPU orchestration layer with high availability and utilisation for AI training, inference and other scientific workloads.
  • Partner with other DSO engineers to design standards, automate operations, and translate research code into performant workloads on distributed systems.
  • Maintain hardware infrastructure, distributed storage, high speed networking and supporting IT infrastructure and support maintenance and upgrades.
  • Degree in Computer Science & Engineering / Software Engineering / Artificial Intelligence or any other related field
  • Minimum 2-year experience in IT Infrastructure or related field. More experience candidates may be considered for senior role.
  • Strong proficiency in Linux environments, computer architecture, and Python / Bash scripting for tooling and automation.
  • Working proficiency of Kubernetes container orchestration and infrastructure provisioning/management software (e.g. Ansible, Terraform) for fleet automation.
  • Experience with NVIDIA GPUs, GitOps, Infra CI/CD, networking protocols, and other AI infrastructure technologies will be advantageous.
  • Strong written and verbal communication to lead vendor and cross-functional engagements and/or performance analysis and troubleshooting initiatives.

SKILLS

PARALLEL COMPUTING

DISTRIBUTED SYSTEMS

QUANTUM COMPUTING

JOB ID

1107405

EXPERIENCE

0 ~ 3 years