Senior Production Engineer
KAISHI PARTNERS PTE. LTD.

A well-funded global Crypto Custodian Firm is seeking a Senior Production Engineer to join them here in Singapore.
Role Summary
As a Senior Production Engineer, you will be responsible for the reliability, scalability, and observability of our platform. This role combines software engineering expertise with deep operational ownership. You will define and implement the strategies that keep our systems resilient, performant, and secure — while representing system health and risks during product roadmap discussions.
You will be both a hands-on technical leader and a strategic contributor, working across Engineering and Product to build a world-class digital assets infrastructure. This will involve developing our existing product suite and integrating with a variety of third party services to form a digital asset trading ecosystem.
Key Responsibilities
- Responsible for production stability, including site reliability engineering, infrastructure, observability, and incident response.
- Provide thought-leadership and architectural design input to the Engineering team with respect to system resilience and scale.
- Drive a culture of proactive reliability, incident learning, and continuous improvement.
- Own and evolve our monitoring, alerting, and incident management frameworks.
- Lead root cause analysis, postmortems, and resilience improvements.
- Implement and maintain SLIs, SLOs, and error budgets in collaboration with platform and product teams.
- Develop strategies for system scaling (horizontal and vertical), performance tuning, and capacity planning on Azure.
- Lead engineering team efforts in disaster recovery, and failover planning.
- Design and implement tools and automation to support self-healing, auto-scaling, and rapid recovery systems.
- Hands-on contribution to the backend codebase (Java/Spring) to improve runtime performance, observability, and fault tolerance.
- Represent platform stability, risk, and incident trends in Product Prioritisation and Planning meetings. Advocate for technical debt reduction, reliability features, and production-readiness across teams
Required Skills and Experience
- 8+ years in software engineering and/or production infrastructure roles.
- Strong coding skills in Java and experience with the Spring ecosystem.
- Deep hands-on experience with cloud services, ideally Azure, including AKS, Azure Monitor, Application Insights, and Key Vault.
- Expertise in observability tooling (e.g., Graylog, Prometheus, Grafana, ELK, OpenTelemetry).
- Proven experience in running mission-critical, high-uptime services - ideally in fintech, crypto, or other transactional environments.
- Solid understanding of distributed systems, microservices architecture, and container orchestration (Kubernetes and Docker).
- Experience with Infrastructure as Code tools (Terraform, Bicep, or similar). Experience in integrating disparate systems to ensure clean interfaces and manage capacity planning across the estate
Preferred Qualifications
- Familiarity with blockchain systems, digital asset custody, or crypto exchange platforms.
- Strong skills in RDBMS performance tuning, ideally MS SQL database
- Experience with regulated environments (e.g., financial compliance, GDPR, ISO 27001).
- Strong understanding of SLAs/SLOs/SLIs and the principles of site reliability engineering.
- Exposure to chaos engineering, performance testing, and auto-remediation strategies.
- Good degree in a STEM subject.
See more jobs in Singapore