Observability & Incidents

You cannot fix what you cannot see. Every outage that lasts longer than it should shares a root cause: the team lacked visibility into what the system was doing. Logs existed but were unstructured and unsearchable. Metrics existed but nobody had set thresholds. Alerts existed but fired so often they were ignored. Observability is not a tool you install, it is a discipline you build into every service from day one. And when something breaks, the incident response process must be automatic, not improvised.

Why this matters

When a production system goes down at 2am or a client reports data they cannot access, the speed and quality of your response defines your team's credibility. S&P's value of Care means we treat our clients' systems with the same urgency we would treat our own. A 30-minute outage with clear communication is recoverable. A 4-hour outage with silence is a relationship-ending event.

Observability is also where Teamwork becomes operational. No single person should be the only one who knows how to diagnose a production issue. Structured logs, dashboards, runbooks, and a defined incident process ensure that anyone on the team can step in, understand the situation, and act: regardless of who built the feature or who is on call.

The standard

The three pillars

Observability rests on three complementary signals. Each answers a different question about your system. All three are required for production services.

Logs answer "what happened?" They are the narrative record of events: requests processed, errors thrown, jobs executed, state changes applied. Without structured logs, debugging is grep through noise.

Metrics answer "how is the system performing?" They are numeric measurements over time: request latency, error rate, CPU usage, queue depth, active connections. Without metrics, you cannot set alerts, cannot spot trends, and cannot answer "is this normal?"

Traces answer "where did the time go?" They follow a single request across service boundaries: from the API gateway through the backend to the database and back. Without traces, debugging latency in a multi-service system is guesswork.

A system with logs but no metrics cannot alert. A system with metrics but no logs cannot explain why an alert fired. A system with both but no traces cannot pinpoint which service in the chain is the bottleneck. You need all three.

Structured logging

Every S&P service must produce structured JSON logs. Unstructured text logs (console.log("user created")) are not searchable, not parseable, and not useful at scale. Structured logs are machine-readable and human-debuggable.

Mandatory log fields:

Field	Purpose	Example
`timestamp`	When the event occurred (ISO 8601, UTC)	`"2026-05-27T14:32:01.123Z"`
`level`	Severity (error, warn, info, debug)	`"error"`
`message`	Human-readable event description	`"Failed to process payment"`
`service`	Which service produced the log	`"api-gateway"`
`correlationId`	Request-scoped ID for tracing across logs	`"req_01JX..."`
`context`	Additional structured data	`{ "userId": "abc", "orderId": "xyz" }`

NestJS logging standard. Use Pino via nestjs-pino. Pino produces JSON by default, is significantly faster than Winston, and integrates with GCP Cloud Logging without transformation. For the main.ts and PinoModule wiring (serializers, redaction, correlation-ID generation), see Backend Reference. Structured logging with Pino.

Correlation IDs are non-negotiable. Every incoming request must be assigned a correlation ID (use the x-correlation-id header if present, generate one otherwise). This ID propagates through every log entry, every downstream service call, and every database query for that request. When a user reports an error, the correlation ID lets you reconstruct the entire request lifecycle across all services in seconds.

Log levels: use them consistently:

Level	When to use	Example
`error`	Something failed that should not have. Requires attention.	Database connection lost, payment processing failed, unhandled exception
`warn`	Something unexpected happened but the system recovered.	Retry succeeded after timeout, deprecated API endpoint called, rate limit approaching
`info`	Normal operational events worth recording.	Request completed, job finished, user authenticated, deployment started
`debug`	Detailed diagnostic information for development.	Query parameters, cache hit/miss, internal state transitions

What never goes in logs: Passwords, tokens, API keys, credit card numbers, personal health information, or any data classified as sensitive PII. Use the redact option in Pino to strip sensitive headers automatically. For request/response bodies, log the shape (field names) but not the values of sensitive fields.

Reading and debugging with logs

Producing structured logs is half the job. Reading them quickly, under pressure, during an incident is the other half, and it is a skill worth practising before you need it. The structure mandated above is what makes the techniques below possible.

Start from the symptom and work backward. Your entry point is an error, an alert, or a user report. The fastest path from symptom to cause is the correlation ID: take it from the failing request (the alert payload, the error response, or by looking up the user's recent activity) and filter every service's logs to that single ID. That reconstructs the entire request lifecycle in one view.

Filter before you read. A raw log stream is unreadable. Narrow it: time window first (when did this start?), then service, then level. "This correlationId, this five-minute window, level warn and above" turns a firehose into a story you can follow.

Read the first error, not the last. In a cascade, the last error is usually a symptom (a downstream timeout, a null returned by an already-failed call). The first error in the time-ordered slice is usually the cause. Scroll up to where things began to diverge from the normal info flow, not down to where the system finally gave up.

Reading a stack trace: find your own code. The top frames are often deep inside NestJS, a library, or Node internals. Scan downward to the first frame that points at a file in your own codebase. That line is almost always where the fix lives. The exception type and message tell you what broke; that frame tells you where.

Use levels as triage signals. error means act. A rising warn rate means investigate the trend before it becomes an error. info traces the normal flow, so you can see exactly where execution diverged. If you are trying to debug a production issue and the info logs you need are not there, that is a logging gap to fix in the code, not a reason to reach for console.log.

Correlate across services. For anything spanning more than one service, the correlation ID ties the frontend request, the API, and every downstream call into one ordered timeline. It is the same ID distributed tracing relies on (covered below): logs give you the detail at each step, traces give you the shape of the whole.

Learn your platform's query syntax. Whatever log aggregation platform the project uses, filtering by field, level, and time window is the core skill. Knowing it cold is the difference between a two-minute diagnosis and a two-hour one when production is down.

Anti-patterns to refuse:

console.log debugging in production. It is unstructured, unleveled, and unsearchable, and it ships noise to everyone reading the stream. If you need more visibility, add a structured log line at the right level.
Logging and never reading. A log line that has never helped anyone diagnose anything is cost without benefit, and it dilutes the lines that matter. Log what you would actually want to see during an incident.
info-level noise. Over-logging at info makes the stream unreadable and trains people to ignore it, the logging equivalent of alert fatigue. Record the events worth recording, not every step of execution.

The bar to hold yourself to: from logs alone, you can reconstruct what happened to a single request, in order, across every service it touched, without attaching a debugger or asking the user to reproduce it. If you cannot, the gap is in what you log. Fix that before the next incident, not during it.

Metrics collection

Metrics tell you whether the system is healthy before users tell you it is not. Every production service must expose the following baseline metrics.

The four golden signals (from Google SRE):

Signal	What it measures	How to collect
Latency	Time to serve a request	Histogram of response times by endpoint and status code
Traffic	Demand on the system	Request count per second by endpoint
Errors	Rate of failed requests	Count of 5xx responses, unhandled exceptions, failed health checks
Saturation	How close to capacity	CPU usage, memory usage, database connection pool utilization, queue depth

Implementation: Use Prometheus client via @willsoto/nestjs-prometheus to expose metrics on a /metrics endpoint. Use the cloud provider's managed metrics service (GCP Cloud Monitoring, AWS CloudWatch) to scrape and store them. For custom business metrics (orders processed, jobs completed), create explicit counters and histograms, these are often more useful for incident diagnosis than infrastructure metrics.

Dashboards: Every production project must have a dashboard showing the four golden signals. Use Grafana, GCP Cloud Monitoring dashboards, or Datadog, the tool matters less than the existence of the dashboard. Link it in the project's Confluence space and review it at sprint demos.

Dashboard minimum requirements: request rate and error rate over time (1h, 6h, 24h, 7d), latency percentiles (p50, p95, p99) per endpoint, database connection pool usage, CPU and memory utilization per service, and queue depth or active connections where applicable.

Distributed tracing

When a user request traverses multiple services (API gateway, backend service, database, third-party API, cache) a single log entry per service is not enough to understand latency or failure. Distributed tracing connects these disparate log entries into a single trace.

Use OpenTelemetry. It is the industry standard, vendor-neutral, and integrates with every major observability backend (GCP Cloud Trace, Jaeger, Datadog, Grafana Tempo). The OpenTelemetry SDK for Node.js provides auto-instrumentation for HTTP, Express, NestJS, PostgreSQL (pg), and Redis. Initialize the SDK before the NestJS app starts using @opentelemetry/sdk-node with getNodeAutoInstrumentations().

Connecting traces to logs: Include the trace ID in every log entry via @opentelemetry/instrumentation-pino. When you see an error in the logs, you can jump directly to the full trace showing every service hop, database query, and external call that request made.

Alerting strategy

Alerts are the mechanism that turns observability into action. A well-designed alerting strategy wakes the right person for the right reason. A poorly designed one either misses real incidents or cries wolf until the team ignores it entirely.

Severity levels for alerts:

Severity	Definition	Response time	Notification channel
P1: Critical	Service is down or data loss is occurring. Users are blocked.	Immediate (< 15 min)	Phone call / PagerDuty + `#[project-name]-alerts` + stakeholder notification
P2: High	Service is degraded. Some users are affected. Core functionality is impaired.	Within 1 hour	`#[project-name]-alerts` + on-call engineer notification
P3: Medium	Non-critical issue. Performance degraded but users can work around it.	Within business hours	`#[project-name]-alerts`
P4: Low	Anomaly detected. No immediate user impact but worth investigating.	Next sprint	Jira ticket created automatically
P5: Trivial	Cosmetic or informational. No user impact and no investigation needed.	Backlog (no SLA)	Jira ticket (backlog), no alert fired

Alerting rules:

Alert on symptoms, not causes. "Error rate exceeds 5%" is a symptom. "Database CPU is high" is a cause. Alert on the symptom; investigate the cause.
Every alert must link to a runbook. An alert without a runbook is a puzzle dropped on someone at 2am.
Set thresholds based on baselines. If your p99 latency is normally 200ms, alerting at 1000ms misses a 5x degradation. Alert at 2-3x the normal baseline.
Use alert grouping and deduplication. Five alerts about the same database being down should produce one notification, not five.
Review alert volume monthly. If an alert fires weekly without requiring action, fix the underlying issue or adjust the threshold. Alert fatigue is the biggest threat to effective incident response.

Essential alerts for every S&P project:

What to alert on	Threshold (adjust to baseline)	Severity
Error rate (5xx)	> 5% of requests for 5 min	P1
Health check failure	3 consecutive failures	P1
Response latency p95	> 3x normal baseline for 10 min	P2
Database connection pool	> 80% utilization for 5 min	P2
Disk/storage usage	> 85% capacity	P2
Certificate expiry	< 14 days	P3
Dependency health check	External API unreachable for 5 min	P2
Queue depth	Growing consistently for 15 min	P3
Memory usage	> 90% for 10 min	P2

Health checks and synthetic monitoring

Health check endpoints are mandatory for every production service. They serve two purposes: infrastructure-level liveness probes (Cloud Run, Kubernetes) and application-level readiness checks.

Implement two endpoints:

/health/live: Returns 200 if the process is running. No dependency checks. Used by the infrastructure to determine if the container should be restarted.
/health/ready: Returns 200 if the service can handle requests. Checks database connectivity, cache availability, and critical third-party dependencies. Used by the load balancer to route traffic.

For the NestJS HealthController implementation (@nestjs/terminus liveness and readiness checks), see Backend Reference. Health check endpoints.

Synthetic monitoring runs automated checks against production endpoints every 1-5 minutes from external locations. This detects outages before users report them and catches issues that internal health checks miss (DNS failures, CDN problems, regional connectivity). Use GCP Uptime Checks, AWS CloudWatch Synthetics, or a third-party service (Checkly, Better Uptime). At minimum, monitor the application's main URL, the /health/ready endpoint, and critical API endpoints (authentication, core business flows).

Incident response process

This section is prescriptive. Follow it exactly. The cost of improvising during an incident is too high: miscommunication, delayed response, and repeated mistakes. The process below is the standard for every S&P project with production users.

Severity classification

When an incident is detected, classify it immediately. Do not spend time debating severity: pick the closest match and adjust later if needed. It is always better to overclassify and downgrade than to underclassify and scramble.

Severity	Definition	Examples	Target resolution
P1	Complete outage or data loss. All users affected. Business operations halted.	Production database down, authentication service unreachable, data corruption detected, payment processing completely failed	1 hour
P2	Major degradation. Core functionality impaired for a significant portion of users.	API latency 10x normal, intermittent 500 errors on critical endpoints, background jobs not processing, search completely broken	4 hours
P3	Partial degradation. Non-critical functionality affected. Users can work around the issue.	PDF export failing, notification emails delayed, analytics dashboard not updating, non-critical third-party integration down	24 hours
P4	Minor issue. Cosmetic or low-impact. No significant user impact.	Incorrect error message text, minor UI rendering issue in one browser, slow but functional non-critical endpoint	Next sprint
P5	Trivial. Cosmetic or informational only. No user impact and no workaround needed.	Typo in internal tooling, deprecated dependency warning in logs, minor copy fix	Backlog (no SLA)

Escalation rule: If a P2 is not resolved within 2 hours, escalate to P1 procedures. If a P3 is not resolved within 8 hours and user complaints are increasing, escalate to P2.

Detection and triage

Incidents are detected through three channels: (1) automated alerts firing on thresholds (the preferred detection method; (2) user reports via Slack, email, or support channels) triage immediately, these are often more severe than they appear; (3) internal discovery during development, deployment, or routine monitoring.

Triage checklist (do this within the first 5 minutes):

Confirm the incident is real (not a false alarm, not a monitoring gap).
Assess scope: how many users are affected? Which functionality?
Assign severity using the classification table above.
Determine if a recent deployment is the likely cause. If yes, prepare to rollback.
Open an incident thread in the project's #[project-name]-intern channel (or #engineering if the issue is org-wide).

Incident commander role

For P1 and P2 incidents, an Incident Commander (IC) must be designated within 15 minutes. The IC is not necessarily the most senior engineer, they are the person who coordinates the response. The IC does not debug; they direct.

IC responsibilities: Own the incident until resolved or explicitly handed off. Assign roles (who debugs, who communicates, who prepares rollback). Make decisions when the team is stuck: "we are rolling back in 5 minutes unless someone has a fix" is an IC decision. Manage the communication cadence (see below). Track the timeline for the post-incident review. Declare the incident resolved and confirm with stakeholders.

Who becomes IC: The on-call engineer is the default IC for P1/P2. If they are better suited to debugging (they know the affected code), they should hand IC to another available engineer. For P3/P4/P5, the engineer who triaged the incident handles it directly: no formal IC needed.

Communication protocol

This is the part teams most often get wrong. Silence during an incident is interpreted as incompetence, regardless of how hard the team is working on a fix. Structured communication is mandatory.

P1 communication cadence:

Time	Action	Channel
0-5 min	Post initial incident notice	`#[project-name]-intern` channel
0-15 min	Notify project tech lead and CTO	Direct Slack message or phone call
Every 30 min	Post status update (what we know, what we're doing, next update time)	`#[project-name]-intern` + `#[project-name]-client` if the client is affected
On resolution	Post resolution notice with brief summary	`#[project-name]-intern` + email to stakeholders
Within 48h	Conduct post-incident review	Scheduled meeting

P2 communication cadence:

Time	Action	Channel
0-15 min	Post initial incident notice	`#[project-name]-intern` channel
0-30 min	Notify project tech lead	Direct Slack message
Every 60 min	Post status update	`#[project-name]-intern`
On resolution	Post resolution notice	`#[project-name]-intern`
Within 1 week	Conduct post-incident review	Async document or meeting

P3/P4: Post in the project's #[project-name]-intern channel. No formal cadence required. Document resolution for the team.

Org-wide issues (a CVE in a shared dependency, a cloud-provider outage, anything spanning multiple projects): announce in #engineering so every team sees it, on top of the affected project channels.

Incident notification template (first message):

INCIDENT: [P level]
Service: [affected service/project]
Impact: [what users are experiencing]
Status: Investigating
Incident Commander: [name]
Next update: [time, e.g., "in 30 minutes"]

Status update template:

UPDATE ([P level]) [service]
Current status: [Investigating / Identified / Fixing / Monitoring]
What we know: [1-2 sentences on root cause or current hypothesis]
What we are doing: [current action]
Next update: [time]

Resolution template:

RESOLVED ([P level]) [service]
Duration: [start time to resolution time]
Impact: [summary of user impact]
Root cause: [1-2 sentences]
Resolution: [what fixed it]
Follow-up: Post-incident review scheduled for [date]

Resolution workflow

Once the incident is triaged and the IC is assigned, follow this resolution sequence.

Step 1: Assess recent changes. Most incidents are caused by recent deployments. Check: was there a deployment in the last 2 hours? Were any infrastructure changes made? Did a third-party dependency go down? If a deployment is the likely cause and the fix is not immediately obvious, rollback first, investigate second. A rollback that restores service in 5 minutes is better than a fix that takes 45 minutes.

Step 2: Gather diagnostic data. Pull logs filtered by the affected time window and correlation IDs. Check the dashboard, which golden signal went red first? Check traces for the affected endpoints. Check database metrics (connection pool, slow queries, locks). Check external dependency status pages.

Step 3: Identify and apply the fix. For code defects, deploy through the normal pipeline (or use the hotfix procedure for P1 if the pipeline is too slow). For infrastructure issues, apply the fix directly (scale up, restart, failover). For third-party outages, implement the fallback (circuit breaker, degraded mode) or communicate the dependency to stakeholders.

Step 4: Verify resolution. Confirm the fix via the same signals that detected the incident. Confirm with the reporter or affected users. Monitor for 30 minutes after resolution to ensure the fix holds.

Step 5: Declare resolved. The IC posts the resolution message and schedules the post-incident review.

Post-incident review (blameless postmortem)

Every P1 and P2 incident must have a post-incident review. P3 incidents get a review if the team decides the incident revealed a systemic issue worth examining.

The review is blameless. The goal is to understand what happened and what to change, not who to blame. People make mistakes because systems allow them to. Fix the system.

Post-incident review template:

# Post-Incident Review: [Incident title]

| Field | Value |
|-------|-------|
| **Date of incident** | YYYY-MM-DD |
| **Duration** | [start time: end time, total duration] |
| **Severity** | P1 / P2 / P3 |
| **Incident Commander** | [name] |
| **Author** | [name] |
| **Review date** | YYYY-MM-DD |

## Summary

[2-3 sentences: what happened, who was affected, how it was resolved.]

## Timeline

| Time (UTC) | Event |
|------------|-------|
| HH:MM | [First signal: alert fired / user report received] |
| HH:MM | [Incident declared, IC assigned] |
| HH:MM | [Key investigation step or discovery] |
| HH:MM | [Fix applied / rollback executed] |
| HH:MM | [Service confirmed restored] |
| HH:MM | [Incident declared resolved] |

## Root cause

[What caused the incident? Be specific. "The database ran out of connections
because the connection pool was sized for 20 connections but the service was
scaled to 8 replicas, each opening 20 connections, exceeding the database
limit of 100."]

## Impact

- Users affected: [number or percentage]
- Functionality affected: [what was broken]
- Data impact: [any data loss or corruption: state explicitly if none]
- Duration of user-facing impact: [time]

## What went well

- [Things that worked during the response]

## What went wrong

- [Things that failed or slowed down the response]

## Action items

| Action | Owner | Due date | Jira ticket |
|--------|-------|----------|-------------|
| [Specific remediation action] | [name] | [date] | [ticket ID] |
| [Process improvement] | [name] | [date] | [ticket ID] |
| [Monitoring improvement] | [name] | [date] | [ticket ID] |

Post-incident review rules: Hold the review within 48 hours of a P1, within 1 week of a P2: memories fade fast. Every action item gets a Jira ticket with an owner and a due date; action items without tickets do not get done. Share the review document with the full engineering team, not just the people involved. Store reviews in the project's Confluence space or in a dedicated #post-incidents Confluence space accessible to all engineers.

On-call expectations

On-call is not optional for projects with production users. The specific rotation depends on team size and project criticality, but the expectations below apply universally.

On-call responsibilities

The on-call engineer is the first responder for production issues during their rotation. You must be reachable within 15 minutes (phone, Slack, PagerDuty), have a working laptop with VPN and deployment access, and triage incoming alerts and user reports. For P1/P2, you become the IC or hand off to someone better positioned. Handle P3/P4 during business hours, you do not wake up for a P4 at 3am. Document anything you fix or discover in the team's channel.

On-call rotation

Rotate weekly: longer rotations cause burnout, shorter ones cause context-switching overhead. The rotation must be documented and visible (shared Google Calendar, PagerDuty schedule, or Slack status). For small teams (2-3 engineers), alternate between primary on-call and backup. The backup is contacted if the primary is unreachable within 15 minutes. On-call handoff happens at a consistent time (e.g., Monday 10am) with a summary posted in the team channel: what happened, any ongoing issues, any alerts that need attention.

On-call health

On-call should not be punishing. If the on-call engineer is woken up more than twice in a week, the system has reliability problems that need engineering attention (not just a more resilient on-call engineer. Track on-call interrupt volume and prioritise reliability work when it trends up. Compensate on-call work appropriately) out-of-hours incident response is real work. After a severe P1 incident, the on-call engineer should take compensatory time.

Runbooks

A runbook is a step-by-step guide for diagnosing and resolving a specific operational issue. Every alert must link to a runbook. A runbook that does not exist is an engineer reading logs at 2am trying to figure out what to do from scratch.

When to create a runbook

Create a runbook when you set up a new alert, when you resolve an incident with non-obvious resolution steps, when a process requires more than 3 steps to execute (deployment, database migration, credential rotation), or when you are the only person who knows how to fix something.

Runbook template

# Runbook: [Alert name or issue description]

## When this triggers
[What alert fires, or what symptoms indicate this issue.]

## Impact
[What users experience when this happens.]

## Diagnosis steps
1. [Check specific dashboard/metric]
2. [Run specific query or command]
3. [Look for specific log pattern]

## Resolution steps
1. [Step-by-step fix with exact commands]
2. [Expected outcome after each step]
3. [How to verify the fix worked]

## Escalation
[Who to contact if the steps above don't resolve the issue.]

## History
[Previous occurrences and any context from post-incident reviews.]

Runbook hygiene

Store runbooks in the project's Confluence space, linked from the alert configuration. Review them quarterly, a wrong runbook is worse than no runbook. When a runbook is used during an incident, update it afterward with anything that was missing or unclear. Runbooks should be executable by any engineer on the team, not just the author.

Error budgets and SLOs

Service Level Objectives (SLOs) define "good enough" in measurable terms. Error budgets are the math that connects SLOs to engineering decisions.

Defining SLOs

An SLO is a target for a specific metric over a specific time window. The most common SLOs:

SLO type	Example	Measurement
Availability	99.9% of requests return a non-5xx response over 30 days	`(successful requests / total requests) * 100`
Latency	95% of requests complete within 500ms over 30 days	p95 latency histogram
Correctness	99.99% of data processing jobs complete without error over 30 days	`(successful jobs / total jobs) * 100`

How to set SLOs: Start with what your users actually need, not what sounds impressive. A 99.99% availability SLO requires redundancy, automated failover, and operational maturity that most projects do not have. For most S&P projects, 99.9% availability (8.7 hours of downtime per year) is a reasonable starting point. SLOs must be measurable with your existing observability tooling, an SLO you cannot measure is a wish, not an objective. Review SLOs quarterly and tighten them as the system matures.

Error budgets

The error budget is the inverse of the SLO. A 99.9% availability SLO means you have a 0.1% error budget: roughly 43 minutes of downtime per month.

How to use error budgets: When the budget is healthy, the team has room to ship features and take calculated risks with deployments. When the budget is depleted, the team shifts focus to reliability work (fixing flaky tests, improving monitoring, hardening infrastructure. Error budgets make the trade-off between velocity and reliability explicit. Instead of arguing about whether to ship a risky change, look at the budget. This is not about punishing the team for outages) it is about making informed decisions with real data.

Critical thinking

Observability is an investment, not overhead. Teams that skip structured logging and metrics because "we'll add it later" always regret it: "later" arrives when production is on fire and there is no data to diagnose the problem. The cost of adding observability on day one is small. The cost of adding it during a P1 is immense.
Not every project needs the same depth. An internal tool with 10 users does not need the same alerting strategy as a client-facing platform with 50,000 users. Scale your observability investment to the risk profile. But structured logging and basic health checks are the minimum for any production service: no exceptions.
Alert fatigue is a system design problem, not a discipline problem. If your team ignores alerts, the solution is not to tell them to pay more attention. The solution is to fix the alerts: reduce false positives, tune thresholds to baselines, group related alerts, and delete alerts that never result in action.
Post-incident reviews are only valuable if action items get done. A blameless postmortem that produces 10 action items is useless if none of them are completed. Limit action items to 3-5 high-impact changes, assign them as sprint work, and track them to completion.
SLOs are a conversation tool, not a contract. The primary value of an SLO is that it creates a shared language between engineering, product, and the client about what "reliable" means. Use SLOs to have better conversations about trade-offs, not to create blame when they are missed.
On-call rotation must be sustainable. If on-call is miserable, the best engineers will leave or refuse to participate. Invest in reliability to make on-call quiet, provide compensatory time, and treat chronic on-call pain as a bug in the system, not a cost of doing business.

Checklist

For every production service

For every project with production users

Incident severity levels (P1-P5) are defined and understood by the team
The incident communication protocol is documented and accessible
The project's #[project-name]-alerts channel exists, and the team agrees which channel coordinates incidents (#[project-name]-intern, or #engineering for org-wide issues)
On-call rotation is defined, documented, and visible
The on-call engineer has deployment and rollback access
Post-incident review template is available in the project's Confluence space
Runbooks exist for common failure scenarios
SLOs are defined for the project's core user-facing functionality
The incident response process from Security is integrated with this process

After every incident

Post-incident review conducted within the required timeframe (48h for P1, 1 week for P2)
Root cause identified and documented
Action items created as Jira tickets with owners and due dates
Runbooks updated based on incident learnings
Alerts and thresholds reviewed and adjusted if needed
Post-incident review shared with the engineering team

AI tips

Diagnose log patterns. Paste structured log entries from an incident window and ask AI to identify the sequence of events, spot the earliest error signal, and suggest likely root causes. AI is effective at pattern matching across large log volumes.
Draft post-incident reviews. Provide the incident timeline and Slack messages from the incident channel. Ask AI to structure these into the post-incident review template. Review for accuracy. AI organizes information well but may misinterpret causation.
Generate runbooks from incident resolutions. After fixing a production issue, describe the diagnosis and fix steps and ask AI to produce a runbook following the template. This captures knowledge while it is fresh.
Tune alert thresholds. Export metric data for an alert that fires too often and ask AI to analyze the distribution and suggest a threshold that reduces false positives while catching real incidents.
Build dashboard queries. Describe what you want to monitor and which metrics backend you use (Prometheus, CloudWatch, GCP Monitoring). Ask AI to generate the PromQL or MQL query, these are syntactically tricky and AI handles the syntax well.
Analyse incident trends. Feed multiple post-incident reviews to AI and ask it to identify recurring themes: common root causes, services that fail most often, time-of-day patterns. This informs where to invest in reliability work.

Resources

S&P internal:

Security incident first-response. Security-specific incident containment and evidence preservation
Architecture & System Design. System context and container diagrams that inform observability design
Engineering Principles: "Write it down" and decision-making principles that underpin incident documentation

Observability tools:

OpenTelemetry. Vendor-neutral observability framework (traces, metrics, logs)
Pino / nestjs-pino. Structured JSON logging for NestJS
Prometheus client for Node.js. Metrics exposition library
Grafana. Dashboarding and visualization
GCP Cloud Logging / Cloud Monitoring / Cloud Trace. Managed observability for GCP

Incident management references:

Google SRE Book (Monitoring Distributed Systems) The four golden signals and alert design
Google SRE Book (Being On-Call) On-call expectations and sustainability
Google SRE Book (Postmortem Culture) Blameless postmortem practices
PagerDuty Incident Response Guide. Comprehensive incident response framework
Atlassian Incident Management Handbook. Incident commander role and communication

General references:

Microsoft Code-with-Engineering Playbook (Observability) Observability practices and patterns
12-Factor App (Logs) Treat logs as event streams
SLO Workbook (Google). Practical guide to implementing SLOs

Why this matters​

The standard​

The three pillars​

Structured logging​

Reading and debugging with logs​

Metrics collection​

Distributed tracing​

Alerting strategy​

Health checks and synthetic monitoring​

Incident response process​

Severity classification​

Detection and triage​

Incident commander role​

Communication protocol​

Resolution workflow​

Post-incident review (blameless postmortem)​

On-call expectations​

On-call responsibilities​

On-call rotation​

On-call health​

Runbooks​

When to create a runbook​

Runbook template​

Runbook hygiene​

Error budgets and SLOs​

Defining SLOs​

Error budgets​

Critical thinking​

Checklist​

For every production service​

For every project with production users​

After every incident​

AI tips​

Resources​