Observability & Incidents
You cannot fix what you cannot see. Every outage that lasts longer than it should shares a root cause: the team lacked visibility into what the system was doing. Logs existed but were unstructured and unsearchable. Metrics existed but nobody had set thresholds. Alerts existed but fired so often they were ignored. Observability is not a tool you install, it is a discipline you build into every service from day one. And when something breaks, the incident response process must be automatic, not improvised.
Why this matters
When a production system goes down at 2am or a client reports data they cannot access, the speed and quality of your response defines your team's credibility. S&P's value of Care means we treat our clients' systems with the same urgency we would treat our own. A 30-minute outage with clear communication is recoverable. A 4-hour outage with silence is a relationship-ending event.
Observability is also where Teamwork becomes operational. No single person should be the only one who knows how to diagnose a production issue. Structured logs, dashboards, runbooks, and a defined incident process ensure that anyone on the team can step in, understand the situation, and act: regardless of who built the feature or who is on call.
The standard
The three pillars
Observability rests on three complementary signals. Each answers a different question about your system. All three are required for production services.
Logs answer "what happened?" They are the narrative record of events: requests processed, errors thrown, jobs executed, state changes applied. Without structured logs, debugging is grep through noise.
Metrics answer "how is the system performing?" They are numeric measurements over time: request latency, error rate, CPU usage, queue depth, active connections. Without metrics, you cannot set alerts, cannot spot trends, and cannot answer "is this normal?"
Traces answer "where did the time go?" They follow a single request across service boundaries: from the API gateway through the backend to the database and back. Without traces, debugging latency in a multi-service system is guesswork.
A system with logs but no metrics cannot alert. A system with metrics but no logs cannot explain why an alert fired. A system with both but no traces cannot pinpoint which service in the chain is the bottleneck. You need all three.
Structured logging
Every S&P service must produce structured JSON logs. Unstructured text logs (console.log("user created")) are not searchable, not parseable, and not useful at scale. Structured logs are machine-readable and human-debuggable.
Mandatory log fields:
| Field | Purpose | Example |
|---|---|---|
timestamp | When the event occurred (ISO 8601, UTC) | "2026-05-27T14:32:01.123Z" |
level | Severity (error, warn, info, debug) | "error" |
message | Human-readable event description | "Failed to process payment" |
service | Which service produced the log | "api-gateway" |
correlationId | Request-scoped ID for tracing across logs | "req_01JX..." |
context | Additional structured data | { "userId": "abc", "orderId": "xyz" } |
NestJS logging standard:
Use Pino via nestjs-pino as the logger. Pino produces JSON by default, is significantly faster than Winston, and integrates with GCP Cloud Logging without transformation.
// main.ts: configure structured logging
import { Logger } from 'nestjs-pino';
const app = await NestFactory.create(AppModule, { bufferLogs: true });
app.useLogger(app.get(Logger));
// PinoModule configuration in AppModule
PinoModule.forRoot({
pinoHttp: {
level: process.env.LOG_LEVEL || 'info',
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty' }
: undefined,
genReqId: (req) => req.headers['x-correlation-id'] || generateULID(),
serializers: {
req: (req) => ({
method: req.method,
url: req.url,
userAgent: req.headers['user-agent'],
}),
res: (res) => ({ statusCode: res.statusCode }),
},
redact: ['req.headers.authorization', 'req.headers.cookie'],
},
});
Correlation IDs are non-negotiable. Every incoming request must be assigned a correlation ID (use the x-correlation-id header if present, generate one otherwise). This ID propagates through every log entry, every downstream service call, and every database query for that request. When a user reports an error, the correlation ID lets you reconstruct the entire request lifecycle across all services in seconds.
Log levels: use them consistently:
| Level | When to use | Example |
|---|---|---|
error | Something failed that should not have. Requires attention. | Database connection lost, payment processing failed, unhandled exception |
warn | Something unexpected happened but the system recovered. | Retry succeeded after timeout, deprecated API endpoint called, rate limit approaching |
info | Normal operational events worth recording. | Request completed, job finished, user authenticated, deployment started |
debug | Detailed diagnostic information for development. | Query parameters, cache hit/miss, internal state transitions |
What never goes in logs: Passwords, tokens, API keys, credit card numbers, personal health information, or any data classified as sensitive PII. Use the redact option in Pino to strip sensitive headers automatically. For request/response bodies, log the shape (field names) but not the values of sensitive fields.
Metrics collection
Metrics tell you whether the system is healthy before users tell you it is not. Every production service must expose the following baseline metrics.
The four golden signals (from Google SRE):
| Signal | What it measures | How to collect |
|---|---|---|
| Latency | Time to serve a request | Histogram of response times by endpoint and status code |
| Traffic | Demand on the system | Request count per second by endpoint |
| Errors | Rate of failed requests | Count of 5xx responses, unhandled exceptions, failed health checks |
| Saturation | How close to capacity | CPU usage, memory usage, database connection pool utilization, queue depth |
Implementation: Use Prometheus client via @willsoto/nestjs-prometheus to expose metrics on a /metrics endpoint. Use the cloud provider's managed metrics service (GCP Cloud Monitoring, AWS CloudWatch) to scrape and store them. For custom business metrics (orders processed, jobs completed), create explicit counters and histograms, these are often more useful for incident diagnosis than infrastructure metrics.
Dashboards: Every production project must have a dashboard showing the four golden signals. Use Grafana, GCP Cloud Monitoring dashboards, or Datadog, the tool matters less than the existence of the dashboard. Link it in the project's Confluence space and review it at sprint demos.
Dashboard minimum requirements: request rate and error rate over time (1h, 6h, 24h, 7d), latency percentiles (p50, p95, p99) per endpoint, database connection pool usage, CPU and memory utilization per service, and queue depth or active connections where applicable.
Distributed tracing
When a user request traverses multiple services (API gateway, backend service, database, third-party API, cache) a single log entry per service is not enough to understand latency or failure. Distributed tracing connects these disparate log entries into a single trace.
Use OpenTelemetry. It is the industry standard, vendor-neutral, and integrates with every major observability backend (GCP Cloud Trace, Jaeger, Datadog, Grafana Tempo). The OpenTelemetry SDK for Node.js provides auto-instrumentation for HTTP, Express, NestJS, PostgreSQL (pg), and Redis. Initialize the SDK before the NestJS app starts using @opentelemetry/sdk-node with getNodeAutoInstrumentations().
Connecting traces to logs: Include the trace ID in every log entry via @opentelemetry/instrumentation-pino. When you see an error in the logs, you can jump directly to the full trace showing every service hop, database query, and external call that request made.
Alerting strategy
Alerts are the mechanism that turns observability into action. A well-designed alerting strategy wakes the right person for the right reason. A poorly designed one either misses real incidents or cries wolf until the team ignores it entirely.
Severity levels for alerts:
| Severity | Definition | Response time | Notification channel |
|---|---|---|---|
| P1: Critical | Service is down or data loss is occurring. Users are blocked. | Immediate (< 15 min) | Phone call / PagerDuty + Slack #incidents + stakeholder notification |
| P2: High | Service is degraded. Some users are affected. Core functionality is impaired. | Within 1 hour | Slack #incidents + on-call engineer notification |
| P3: Medium | Non-critical issue. Performance degraded but users can work around it. | Within business hours | Slack #alerts channel |
| P4: Low | Anomaly detected. No immediate user impact but worth investigating. | Next sprint | Jira ticket created automatically |
Alerting rules:
- Alert on symptoms, not causes. "Error rate exceeds 5%" is a symptom. "Database CPU is high" is a cause. Alert on the symptom; investigate the cause.
- Every alert must link to a runbook. An alert without a runbook is a puzzle dropped on someone at 2am.
- Set thresholds based on baselines. If your p99 latency is normally 200ms, alerting at 1000ms misses a 5x degradation. Alert at 2-3x the normal baseline.
- Use alert grouping and deduplication. Five alerts about the same database being down should produce one notification, not five.
- Review alert volume monthly. If an alert fires weekly without requiring action, fix the underlying issue or adjust the threshold. Alert fatigue is the biggest threat to effective incident response.
Essential alerts for every S&P project:
| What to alert on | Threshold (adjust to baseline) | Severity |
|---|---|---|
| Error rate (5xx) | > 5% of requests for 5 min | P1 |
| Health check failure | 3 consecutive failures | P1 |
| Response latency p95 | > 3x normal baseline for 10 min | P2 |
| Database connection pool | > 80% utilization for 5 min | P2 |
| Disk/storage usage | > 85% capacity | P2 |
| Certificate expiry | < 14 days | P3 |
| Dependency health check | External API unreachable for 5 min | P2 |
| Queue depth | Growing consistently for 15 min | P3 |
| Memory usage | > 90% for 10 min | P2 |
Health checks and synthetic monitoring
Health check endpoints are mandatory for every production service. They serve two purposes: infrastructure-level liveness probes (Cloud Run, Kubernetes) and application-level readiness checks.
Implement two endpoints:
/health/live: Returns 200 if the process is running. No dependency checks. Used by the infrastructure to determine if the container should be restarted./health/ready: Returns 200 if the service can handle requests. Checks database connectivity, cache availability, and critical third-party dependencies. Used by the load balancer to route traffic.
// health.controller.ts
@Controller('health')
export class HealthController {
constructor(
private health: HealthCheckService,
private db: TypeOrmHealthIndicator,
private redis: RedisHealthIndicator,
) {}
@Get('live')
liveness() {
return { status: 'ok' };
}
@Get('ready')
readiness() {
return this.health.check([
() => this.db.pingCheck('database'),
() => this.redis.pingCheck('cache'),
]);
}
}
Synthetic monitoring runs automated checks against production endpoints every 1-5 minutes from external locations. This detects outages before users report them and catches issues that internal health checks miss (DNS failures, CDN problems, regional connectivity). Use GCP Uptime Checks, AWS CloudWatch Synthetics, or a third-party service (Checkly, Better Uptime). At minimum, monitor the application's main URL, the /health/ready endpoint, and critical API endpoints (authentication, core business flows).
Incident response process
This section is prescriptive. Follow it exactly. The cost of improvising during an incident is too high: miscommunication, delayed response, and repeated mistakes. The process below is the standard for every S&P project with production users.
Severity classification
When an incident is detected, classify it immediately. Do not spend time debating severity: pick the closest match and adjust later if needed. It is always better to overclassify and downgrade than to underclassify and scramble.
| Severity | Definition | Examples | Target resolution |
|---|---|---|---|
| SEV1 | Complete outage or data loss. All users affected. Business operations halted. | Production database down, authentication service unreachable, data corruption detected, payment processing completely failed | 1 hour |
| SEV2 | Major degradation. Core functionality impaired for a significant portion of users. | API latency 10x normal, intermittent 500 errors on critical endpoints, background jobs not processing, search completely broken | 4 hours |
| SEV3 | Partial degradation. Non-critical functionality affected. Users can work around the issue. | PDF export failing, notification emails delayed, analytics dashboard not updating, non-critical third-party integration down | 24 hours |
| SEV4 | Minor issue. Cosmetic or low-impact. No significant user impact. | Incorrect error message text, minor UI rendering issue in one browser, slow but functional non-critical endpoint | Next sprint |
Escalation rule: If a SEV2 is not resolved within 2 hours, escalate to SEV1 procedures. If a SEV3 is not resolved within 8 hours and user complaints are increasing, escalate to SEV2.
Detection and triage
Incidents are detected through three channels: (1) automated alerts firing on thresholds (the preferred detection method; (2) user reports via Slack, email, or support channels) triage immediately, these are often more severe than they appear; (3) internal discovery during development, deployment, or routine monitoring.
Triage checklist (do this within the first 5 minutes):
- Confirm the incident is real (not a false alarm, not a monitoring gap).
- Assess scope: how many users are affected? Which functionality?
- Assign severity using the classification table above.
- Determine if a recent deployment is the likely cause. If yes, prepare to rollback.
- Open the incident Slack channel or thread.
Incident commander role
For SEV1 and SEV2 incidents, an Incident Commander (IC) must be designated within 15 minutes. The IC is not necessarily the most senior engineer, they are the person who coordinates the response. The IC does not debug; they direct.
IC responsibilities: Own the incident until resolved or explicitly handed off. Assign roles (who debugs, who communicates, who prepares rollback). Make decisions when the team is stuck: "we are rolling back in 5 minutes unless someone has a fix" is an IC decision. Manage the communication cadence (see below). Track the timeline for the post-incident review. Declare the incident resolved and confirm with stakeholders.
Who becomes IC: The on-call engineer is the default IC for SEV1/SEV2. If they are better suited to debugging (they know the affected code), they should hand IC to another available engineer. For SEV3/SEV4, the engineer who triaged the incident handles it directly: no formal IC needed.
Communication protocol
This is the part teams most often get wrong. Silence during an incident is interpreted as incompetence, regardless of how hard the team is working on a fix. Structured communication is mandatory.
SEV1 communication cadence:
| Time | Action | Channel |
|---|---|---|
| 0-5 min | Post initial incident notice | Slack #incidents channel |
| 0-15 min | Notify project tech lead and CTO | Direct Slack message or phone call |
| Every 30 min | Post status update (what we know, what we're doing, next update time) | Slack #incidents + client-facing channel if applicable |
| On resolution | Post resolution notice with brief summary | Slack #incidents + email to stakeholders |
| Within 48h | Conduct post-incident review | Scheduled meeting |
SEV2 communication cadence:
| Time | Action | Channel |
|---|---|---|
| 0-15 min | Post initial incident notice | Slack #incidents channel |
| 0-30 min | Notify project tech lead | Direct Slack message |
| Every 60 min | Post status update | Slack #incidents |
| On resolution | Post resolution notice | Slack #incidents |
| Within 1 week | Conduct post-incident review | Async document or meeting |
SEV3/SEV4: Post in the project's Slack channel. No formal cadence required. Document resolution for the team.
Incident notification template (first message):
INCIDENT: [SEV level]
Service: [affected service/project]
Impact: [what users are experiencing]
Status: Investigating
Incident Commander: [name]
Next update: [time, e.g., "in 30 minutes"]
Status update template:
UPDATE ([SEV level]) [service]
Current status: [Investigating / Identified / Fixing / Monitoring]
What we know: [1-2 sentences on root cause or current hypothesis]
What we are doing: [current action]
Next update: [time]
Resolution template:
RESOLVED ([SEV level]) [service]
Duration: [start time to resolution time]
Impact: [summary of user impact]
Root cause: [1-2 sentences]
Resolution: [what fixed it]
Follow-up: Post-incident review scheduled for [date]
Resolution workflow
Once the incident is triaged and the IC is assigned, follow this resolution sequence.
Step 1: Assess recent changes. Most incidents are caused by recent deployments. Check: was there a deployment in the last 2 hours? Were any infrastructure changes made? Did a third-party dependency go down? If a deployment is the likely cause and the fix is not immediately obvious, rollback first, investigate second. A rollback that restores service in 5 minutes is better than a fix that takes 45 minutes.
Step 2: Gather diagnostic data. Pull logs filtered by the affected time window and correlation IDs. Check the dashboard, which golden signal went red first? Check traces for the affected endpoints. Check database metrics (connection pool, slow queries, locks). Check external dependency status pages.
Step 3: Identify and apply the fix. For code defects, deploy through the normal pipeline (or use the hotfix procedure for SEV1 if the pipeline is too slow). For infrastructure issues, apply the fix directly (scale up, restart, failover). For third-party outages, implement the fallback (circuit breaker, degraded mode) or communicate the dependency to stakeholders.
Step 4: Verify resolution. Confirm the fix via the same signals that detected the incident. Confirm with the reporter or affected users. Monitor for 30 minutes after resolution to ensure the fix holds.
Step 5: Declare resolved. The IC posts the resolution message and schedules the post-incident review.
Post-incident review (blameless postmortem)
Every SEV1 and SEV2 incident must have a post-incident review. SEV3 incidents get a review if the team decides the incident revealed a systemic issue worth examining.
The review is blameless. The goal is to understand what happened and what to change, not who to blame. People make mistakes because systems allow them to. Fix the system.
Post-incident review template:
# Post-Incident Review: [Incident title]
| Field | Value |
|-------|-------|
| **Date of incident** | YYYY-MM-DD |
| **Duration** | [start time: end time, total duration] |
| **Severity** | SEV1 / SEV2 / SEV3 |
| **Incident Commander** | [name] |
| **Author** | [name] |
| **Review date** | YYYY-MM-DD |
## Summary
[2-3 sentences: what happened, who was affected, how it was resolved.]
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | [First signal: alert fired / user report received] |
| HH:MM | [Incident declared, IC assigned] |
| HH:MM | [Key investigation step or discovery] |
| HH:MM | [Fix applied / rollback executed] |
| HH:MM | [Service confirmed restored] |
| HH:MM | [Incident declared resolved] |
## Root cause
[What caused the incident? Be specific. "The database ran out of connections
because the connection pool was sized for 20 connections but the service was
scaled to 8 replicas, each opening 20 connections, exceeding the database
limit of 100."]
## Impact
- Users affected: [number or percentage]
- Functionality affected: [what was broken]
- Data impact: [any data loss or corruption: state explicitly if none]
- Duration of user-facing impact: [time]
## What went well
- [Things that worked during the response]
## What went wrong
- [Things that failed or slowed down the response]
## Action items
| Action | Owner | Due date | Jira ticket |
|--------|-------|----------|-------------|
| [Specific remediation action] | [name] | [date] | [ticket ID] |
| [Process improvement] | [name] | [date] | [ticket ID] |
| [Monitoring improvement] | [name] | [date] | [ticket ID] |
Post-incident review rules: Hold the review within 48 hours of a SEV1, within 1 week of a SEV2: memories fade fast. Every action item gets a Jira ticket with an owner and a due date; action items without tickets do not get done. Share the review document with the full engineering team, not just the people involved. Store reviews in the project's Confluence space or in a dedicated #post-incidents Confluence space accessible to all engineers.
On-call expectations
On-call is not optional for projects with production users. The specific rotation depends on team size and project criticality, but the expectations below apply universally.
On-call responsibilities
The on-call engineer is the first responder for production issues during their rotation. You must be reachable within 15 minutes (phone, Slack, PagerDuty), have a working laptop with VPN and deployment access, and triage incoming alerts and user reports. For SEV1/SEV2, you become the IC or hand off to someone better positioned. Handle SEV3/SEV4 during business hours, you do not wake up for a P4 at 3am. Document anything you fix or discover in the team's channel.
On-call rotation
Rotate weekly: longer rotations cause burnout, shorter ones cause context-switching overhead. The rotation must be documented and visible (shared Google Calendar, PagerDuty schedule, or Slack status). For small teams (2-3 engineers), alternate between primary on-call and backup. The backup is contacted if the primary is unreachable within 15 minutes. On-call handoff happens at a consistent time (e.g., Monday 10am) with a summary posted in the team channel: what happened, any ongoing issues, any alerts that need attention.
On-call health
On-call should not be punishing. If the on-call engineer is woken up more than twice in a week, the system has reliability problems that need engineering attention (not just a more resilient on-call engineer. Track on-call interrupt volume and prioritise reliability work when it trends up. Compensate on-call work appropriately) out-of-hours incident response is real work. After a severe SEV1 incident, the on-call engineer should take compensatory time.
Runbooks
A runbook is a step-by-step guide for diagnosing and resolving a specific operational issue. Every alert must link to a runbook. A runbook that does not exist is an engineer reading logs at 2am trying to figure out what to do from scratch.
When to create a runbook
Create a runbook when you set up a new alert, when you resolve an incident with non-obvious resolution steps, when a process requires more than 3 steps to execute (deployment, database migration, credential rotation), or when you are the only person who knows how to fix something.
Runbook template
# Runbook: [Alert name or issue description]
## When this triggers
[What alert fires, or what symptoms indicate this issue.]
## Impact
[What users experience when this happens.]
## Diagnosis steps
1. [Check specific dashboard/metric]
2. [Run specific query or command]
3. [Look for specific log pattern]
## Resolution steps
1. [Step-by-step fix with exact commands]
2. [Expected outcome after each step]
3. [How to verify the fix worked]
## Escalation
[Who to contact if the steps above don't resolve the issue.]
## History
[Previous occurrences and any context from post-incident reviews.]
Runbook hygiene
Store runbooks in the project's Confluence space, linked from the alert configuration. Review them quarterly, a wrong runbook is worse than no runbook. When a runbook is used during an incident, update it afterward with anything that was missing or unclear. Runbooks should be executable by any engineer on the team, not just the author.
Error budgets and SLOs
Service Level Objectives (SLOs) define "good enough" in measurable terms. Error budgets are the math that connects SLOs to engineering decisions.
Defining SLOs
An SLO is a target for a specific metric over a specific time window. The most common SLOs:
| SLO type | Example | Measurement |
|---|---|---|
| Availability | 99.9% of requests return a non-5xx response over 30 days | (successful requests / total requests) * 100 |
| Latency | 95% of requests complete within 500ms over 30 days | p95 latency histogram |
| Correctness | 99.99% of data processing jobs complete without error over 30 days | (successful jobs / total jobs) * 100 |
How to set SLOs: Start with what your users actually need, not what sounds impressive. A 99.99% availability SLO requires redundancy, automated failover, and operational maturity that most projects do not have. For most S&P projects, 99.9% availability (8.7 hours of downtime per year) is a reasonable starting point. SLOs must be measurable with your existing observability tooling, an SLO you cannot measure is a wish, not an objective. Review SLOs quarterly and tighten them as the system matures.
Error budgets
The error budget is the inverse of the SLO. A 99.9% availability SLO means you have a 0.1% error budget: roughly 43 minutes of downtime per month.
How to use error budgets: When the budget is healthy, the team has room to ship features and take calculated risks with deployments. When the budget is depleted, the team shifts focus to reliability work (fixing flaky tests, improving monitoring, hardening infrastructure. Error budgets make the trade-off between velocity and reliability explicit. Instead of arguing about whether to ship a risky change, look at the budget. This is not about punishing the team for outages) it is about making informed decisions with real data.
Critical thinking
-
Observability is an investment, not overhead. Teams that skip structured logging and metrics because "we'll add it later" always regret it: "later" arrives when production is on fire and there is no data to diagnose the problem. The cost of adding observability on day one is small. The cost of adding it during a SEV1 is immense.
-
Not every project needs the same depth. An internal tool with 10 users does not need the same alerting strategy as a client-facing platform with 50,000 users. Scale your observability investment to the risk profile. But structured logging and basic health checks are the minimum for any production service: no exceptions.
-
Alert fatigue is a system design problem, not a discipline problem. If your team ignores alerts, the solution is not to tell them to pay more attention. The solution is to fix the alerts: reduce false positives, tune thresholds to baselines, group related alerts, and delete alerts that never result in action.
-
Post-incident reviews are only valuable if action items get done. A blameless postmortem that produces 10 action items is useless if none of them are completed. Limit action items to 3-5 high-impact changes, assign them as sprint work, and track them to completion.
-
SLOs are a conversation tool, not a contract. The primary value of an SLO is that it creates a shared language between engineering, product, and the client about what "reliable" means. Use SLOs to have better conversations about trade-offs, not to create blame when they are missed.
-
On-call rotation must be sustainable. If on-call is miserable, the best engineers will leave or refuse to participate. Invest in reliability to make on-call quiet, provide compensatory time, and treat chronic on-call pain as a bug in the system, not a cost of doing business.
Checklist
For every production service
- Structured JSON logging is configured (Pino in NestJS) with correlation IDs
- Log levels are used consistently (error, warn, info, debug)
- Sensitive data is redacted from logs (tokens, passwords, PII)
- The four golden signals are collected as metrics (latency, traffic, errors, saturation)
- A dashboard exists showing the golden signals with appropriate time ranges
- Health check endpoints are implemented (
/health/liveand/health/ready) - Alerts are configured for error rate, latency, health check failures, and resource saturation
- Every alert links to a runbook
- Synthetic monitoring checks production endpoints on a schedule
For every project with production users
- Incident severity levels (SEV1-SEV4) are defined and understood by the team
- The incident communication protocol is documented and accessible
- A Slack
#incidentschannel (or equivalent) exists - On-call rotation is defined, documented, and visible
- The on-call engineer has deployment and rollback access
- Post-incident review template is available in the project's Confluence space
- Runbooks exist for common failure scenarios
- SLOs are defined for the project's core user-facing functionality
- The incident response process from Security is integrated with this process
After every incident
- Post-incident review conducted within the required timeframe (48h for SEV1, 1 week for SEV2)
- Root cause identified and documented
- Action items created as Jira tickets with owners and due dates
- Runbooks updated based on incident learnings
- Alerts and thresholds reviewed and adjusted if needed
- Post-incident review shared with the engineering team
AI tips
- Diagnose log patterns. Paste structured log entries from an incident window and ask AI to identify the sequence of events, spot the earliest error signal, and suggest likely root causes. AI is effective at pattern matching across large log volumes.
- Draft post-incident reviews. Provide the incident timeline and Slack messages from the incident channel. Ask AI to structure these into the post-incident review template. Review for accuracy. AI organizes information well but may misinterpret causation.
- Generate runbooks from incident resolutions. After fixing a production issue, describe the diagnosis and fix steps and ask AI to produce a runbook following the template. This captures knowledge while it is fresh.
- Tune alert thresholds. Export metric data for an alert that fires too often and ask AI to analyze the distribution and suggest a threshold that reduces false positives while catching real incidents.
- Build dashboard queries. Describe what you want to monitor and which metrics backend you use (Prometheus, CloudWatch, GCP Monitoring). Ask AI to generate the PromQL or MQL query, these are syntactically tricky and AI handles the syntax well.
- Analyse incident trends. Feed multiple post-incident reviews to AI and ask it to identify recurring themes: common root causes, services that fail most often, time-of-day patterns. This informs where to invest in reliability work.
Resources
S&P internal:
- Security incident first-response. Security-specific incident containment and evidence preservation
- Architecture & System Design. System context and container diagrams that inform observability design
- Engineering Principles: "Write it down" and decision-making principles that underpin incident documentation
Observability tools:
- OpenTelemetry. Vendor-neutral observability framework (traces, metrics, logs)
- Pino / nestjs-pino. Structured JSON logging for NestJS
- Prometheus client for Node.js. Metrics exposition library
- Grafana. Dashboarding and visualization
- GCP Cloud Logging / Cloud Monitoring / Cloud Trace. Managed observability for GCP
Incident management references:
- Google SRE Book (Monitoring Distributed Systems) The four golden signals and alert design
- Google SRE Book (Being On-Call) On-call expectations and sustainability
- Google SRE Book (Postmortem Culture) Blameless postmortem practices
- PagerDuty Incident Response Guide. Comprehensive incident response framework
- Atlassian Incident Management Handbook. Incident commander role and communication
General references:
- Microsoft Code-with-Engineering Playbook (Observability) Observability practices and patterns
- 12-Factor App (Logs) Treat logs as event streams
- SLO Workbook (Google). Practical guide to implementing SLOs