Senior Site Reliability Engineer, Vice President

Goldman Sachs

Senior Site Reliability Engineer (SRE) – Incident Management, Escalation, and Automation

Overview

We are seeking a seasoned Site Reliability Engineer who excels at incident response and management with a strong emphasis on escalation discipline and crisp, audience-appropriate communications.

You will partner closely with front-office trading desks, engineering and fellow SRE colleagues, and Application Business Operations (ABO) to enhance desk readiness, reduce manual workload through strategic automation and AI, and raise the bar on observability, capacity, and change quality across globally distributed systems. This role includes stewardship of cross-region handoffs, governance of error budgets, and the establishment of clear SRE KPIs to demonstrate value and drive continuous improvement.

Key Responsibilities

Incident Command, Escalation, and Communications
- Act as Incident Commander for high-severity events, ensuring timely escalation, resolver engagement, and transparent communications to technical and business stakeholders.
- Maintain consistent status updates, incident timelines, and customer/leadership communications; improve comms templates and runbooks for clarity and speed.
- Drive post-incident reviews with a blameless, learning-first approach; produce actionable remediation items, owners, and due dates.
Cross-Region Handoffs and Desk Readiness
- Own the cross-region handoff procedure to ensure emerging issues are surfaced globally, with explicit ownership, clear next steps, and desk-readiness checklists.
- Ensure shift notes, incident context, and risk hot-spots are consistently captured, discoverable, and actioned.
ABO Partnership and Workload Reduction
- Partner closely with ABO to identify incident/issue trends and patterns; quantify impact and prioritize engineering fixes that remove manual workarounds.
- Provide visibility into ABO workload; escalate when prioritization is needed for engineering solutions that reduce toil.
Strategic Automation and AI
- Apply engineering tenets to automate repetitive tasks, codify remediations, and implement self-healing mechanisms; evaluate and responsibly adopt AI to improve triage, runbook execution, and anomaly detection.
- Track toil reduction and time saved; feed back into prioritization and capacity planning.
Observability, Monitoring, and Alert Quality
- Collaborate with developers to improve instrumentation, SLIs, dashboards, and actionable alerts aligned to firmwide standards and globally consistent tooling.
- Reduce alert noise and increase signal-to-noise ratio via better thresholds, aggregation, deduplication, and suppression; validate alert-to-action mapping with runbooks and ownership.
- Expand tracing, logging, and metrics coverage to speed detection, triage, and root cause isolation.
SLOs, Error Budgets, and Reliability Governance
- Define and steward SLOs and SLIs across services; implement and manage error budgets with clear policies influencing release velocity and risk acceptance.
- Facilitate data-driven tradeoffs between feature delivery and reliability; regularly review budget burn with product and engineering.
Capacity Engineering and Scalability
- Drive capacity engineering standards; partner with teams on forecasting, scaling strategies, and reporting (leading indicators, saturation, headroom).
- Work with developers to automate capacity tests, limit management, and scaling actions; ensure predictable behavior under load and graceful degradation.
Change Quality and ORR Gatekeeping
- Oversee change quality across environments; reduce change-related incidents through pre-deployment checks, progressive delivery, and canaries.
- Serve as ORR (Operational Readiness Review) gatekeepers to validate observability, runbooks, on-call readiness, rollback plans, and dependencies before go-live.
Documentation, Runbooks, and Training
- Review and improve documentation freshness, clarity, and completeness; identify and automate runbook steps with high repeatability.
- Train developers on SRE fundamentals: SLOs/SLIs, error budgets, incident roles, on-call hygiene, and production-readiness best practices.
KPIs and Reporting
- Establish, track, and publish SRE KPIs and OKRs to evidence value, including MTTD, MTTA, MTTR, incident frequency and severity distribution, change failure rate, error budget burn, alert quality, toil reduction, and capacity headroom.
- Produce regular executive-ready reports and partner dashboards; highlight trends, risks, and the impact of reliability investments.

Qualifications

Min. 5 years in SRE, production operations, or reliability-focused engineering supporting high-availability, customer-facing or trading/front-office systems.
Proven experience as Incident Commander with measurable improvements in escalation timeliness, communications quality, and MTTR.
Strong foundations in Linux, networking (DNS, HTTP, TLS, routing), distributed systems, and public cloud (AWS/Azure/GCP).
Hands-on with observability stacks (e.g., Prometheus, Grafana, OpenTelemetry, ELK), incident tooling (e.g., PagerDuty, Opsgenie), and collaboration platforms (e.g., Slack/Teams).
Proficiency with infrastructure-as-code and automation (e.g., Terraform, CloudFormation, Ansible) and at least one modern programming language (Go, Python).
Experience implementing SLO/SLI/error budgets, capacity planning, progressive delivery (feature flags, canary, blue/green), and chaos/game days.
Excellent written and verbal communication; able to translate complex technical contexts into concise updates for executives and business stakeholders.
Comfortable working across time zones with strong ownership of cross-region handoffs and follow-through.

Preferred Experience

Front-office/trading or similarly latency- and availability-sensitive environments; close partnership with business operations (ABO) or site operations teams.
Kubernetes-based microservices, service meshes, multi-region architectures, and global standards harmonization.
Building AI-assisted operations (alert enrichment, anomaly detection, runbook copilots) with measurable toil reduction.
Operating status pages and customer-facing incident communications.
Implementing ITIL-aligned processes adapted to SRE practices; ORR frameworks and governance.

Success Metrics

Faster detection and resolution: lower MTTD, MTTA, MTTR for incidents.
Higher alert quality: reduced volume, higher precision, clear actionability.
Reduced change failure rate; increased success of progressive rollouts.
Measurable toil reduction for ABO and engineering through automation and AI.
Improved capacity predictability through documented headroom, fewer saturation events.
Documentation freshness and runbook automation coverage.
Positive stakeholder feedback on handoffs, communications, and incident leadership.

ABOUT GOLDMAN SACHS

At Goldman Sachs, we commit our people, capital and ideas to help our clients, shareholders and the communities we serve to grow. Founded in 1869, we are a leading global investment banking, securities and investment management firm. Headquartered in New York, we maintain offices around the world.

We believe who you are makes you better at what you do. We're committed to fostering and advancing diversity and inclusion in our own workplace and beyond by ensuring every individual within our firm has a number of opportunities to grow professionally and personally, from our training and development opportunities and firmwide networks to benefits, wellness and personal finance offerings and mindfulness programs. about our culture, benefits, and people at GS.com/careers.

We’re committed to finding reasonable accommodations for candidates with special needs or disabilities during our recruiting process. : https://www.goldmansachs.com/careers/footer/disability-statement.html

Goldman Sachs is an equal opportunity employer and does not discriminate on the basis of race, color, religion, sex, national origin, age, veterans status, disability, or any other characteristic protected by applicable law.

We Offer Best-In-Class Benefits

Healthcare & Medical Insurance

We offer a wide range of health and welfare programs that vary depending on office location. These generally include medical, dental, short-term disability, long-term disability, life, accidental death, labor accident and business travel accident insurance.

Holiday & Vacation Policies

We offer competitive vacation policies based on employee level and office location. We promote time off from work to recharge by providing generous vacation entitlements and a minimum of three weeks expected vacation usage each year.

Financial Wellness & Retirement

We assist employees in saving and planning for retirement, offer financial support for higher education, and provide a number of benefits to help employees prepare for the unexpected. We offer live financial education and content on a variety of topics to address the spectrum of employees’ priorities.

Health Services

We offer a medical advocacy service for employees and family members facing critical health situations, and counseling and referral services through the Employee Assistance Program (EAP). We provide Global Medical, Security and Travel Assistance and a Workplace Ergonomics Program. We also offer state-of-the-art on-site health centers in certain offices.

Fitness

To encourage employees to live a healthy and active lifestyle, some of our offices feature on-site fitness centers. For eligible employees we typically reimburse fees paid for a fitness club membership or activity (up to a pre-approved amount).

Child Care & Family Care

We offer on-site child care centers that provide full-time and emergency back-up care, as well as mother and baby rooms and homework rooms. In every office, we provide advice and counseling services, expectant parent resources and transitional programs for parents returning from parental leave. Adoption, surrogacy, egg donation and egg retrieval stipends are also available.

Benefits at Goldman Sachs

Read more about the full suite of class-leading benefits our firm has to offer.

Opportunity Overview

CORPORATE TITLEVice President

OFFICE LOCATION(S)Singapore

JOB FUNCTIONSoftware Engineering

DIVISIONAsset & Wealth Management