AI / High-Performance Computing Engineer
DSO National Laboratories
In this role, you will:
- Participate in the full lifecycle of HPC cluster ops from system bring-up-and-down, workload characterisation and optimisation, and rollout of new AI and Software Services.
- Design & operate a GPU orchestration layer with high availability and utilisation for AI training, inference and other scientific workloads.
- Partner with other DSO engineers to design standards, automate operations, and translate research code into performant workloads on distributed systems.
- Maintain hardware infrastructure, distributed storage, high speed networking and supporting IT infrastructure and support maintenance and upgrades.
- Degree in Computer Science & Engineering / Software Engineering / Artificial Intelligence or any other related field
- Minimum 2-year experience in IT Infrastructure or related field. More experience candidates may be considered for senior role.
- Strong proficiency in Linux environments, computer architecture, and Python / Bash scripting for tooling and automation.
- Working proficiency of Kubernetes container orchestration and infrastructure provisioning/management software (e.g. Ansible, Terraform) for fleet automation.
- Experience with NVIDIA GPUs, GitOps, Infra CI/CD, networking protocols, and other AI infrastructure technologies will be advantageous.
- Strong written and verbal communication to lead vendor and cross-functional engagements and/or performance analysis and troubleshooting initiatives.
SKILLS
PARALLEL COMPUTING
DISTRIBUTED SYSTEMS
QUANTUM COMPUTING
JOB ID
1107405
EXPERIENCE
0 ~ 3 years