← All examplesInfrastructure & Platform

Building an Observability Stack

Implement structured logging, metrics, and alerting to reduce incident response time and improve system reliability.

observabilitymonitoringreliabilitySRE

The full NCT chain

NNarrative

“When something breaks in production, we're flying blind. Our logging is inconsistent, we have no centralized metrics, and alerts are either too noisy or too late. Last month's P1 incident took 6 hours to diagnose because we couldn't correlate logs across services. Our MTTR is 4x the industry benchmark. Customers are losing trust — two enterprise clients escalated to their account executives about reliability. If we build a proper observability stack, we cut incident diagnosis time, reduce MTTR, and restore customer confidence.”

CCommitment 1

Implement structured logging across all services with request tracing and correlation IDs

TTasks

Define structured logging format and required fields
Add correlation ID propagation across service boundaries
Migrate existing log statements to structured format
Set up centralized log aggregation and search
Verify log correlation works across 3 critical request paths

CCommitment 2

Ship a metrics dashboard covering the four golden signals (latency, traffic, errors, saturation) for all production services

TTasks

Instrument all services with metrics collection
Build dashboards for latency, traffic, errors, and saturation
Set up service-level objectives (SLOs) for each service
Create runbook templates linked from dashboards

CCommitment 3

Configure actionable alerting that pages the on-call engineer within 2 minutes of a customer-impacting issue

TTasks

Define alert severity levels and escalation paths
Implement alerts based on SLO burn rates
Set up on-call rotation and PagerDuty integration
Reduce alert noise by consolidating duplicate alerts
Run game day to validate alerting end-to-end

When to use this

Context

Use this NCT when incident response is slow, on-call engineers lack visibility, and customer trust is eroding due to reliability issues. Especially relevant when you're moving upmarket to enterprise customers who expect high availability.

Analysis

Why this NCT works

The Narrative grounds observability work in customer impact and business risk — not just engineering best practices. The Commitments follow the three pillars of observability (logs, metrics, alerts) but are scoped to specific, measurable outcomes. The 'game day' task in the third Commitment ensures the system works before the next real incident.