← All examplesInfrastructure & Platform

Building an Observability Stack

Implement structured logging, metrics, and alerting to reduce incident response time and improve system reliability.

observabilitymonitoringreliabilitySRE

The full NCT chain

NNarrative

When something breaks in production, we're flying blind. Our logging is inconsistent, we have no centralized metrics, and alerts are either too noisy or too late. Last month's P1 incident took 6 hours to diagnose because we couldn't correlate logs across services. Our MTTR is 4x the industry benchmark. Customers are losing trust — two enterprise clients escalated to their account executives about reliability. If we build a proper observability stack, we cut incident diagnosis time, reduce MTTR, and restore customer confidence.

CCommitment 1

Implement structured logging across all services with request tracing and correlation IDs

TTasks
  • Define structured logging format and required fields
  • Add correlation ID propagation across service boundaries
  • Migrate existing log statements to structured format
  • Set up centralized log aggregation and search
  • Verify log correlation works across 3 critical request paths
CCommitment 2

Ship a metrics dashboard covering the four golden signals (latency, traffic, errors, saturation) for all production services

TTasks
  • Instrument all services with metrics collection
  • Build dashboards for latency, traffic, errors, and saturation
  • Set up service-level objectives (SLOs) for each service
  • Create runbook templates linked from dashboards
CCommitment 3

Configure actionable alerting that pages the on-call engineer within 2 minutes of a customer-impacting issue

TTasks
  • Define alert severity levels and escalation paths
  • Implement alerts based on SLO burn rates
  • Set up on-call rotation and PagerDuty integration
  • Reduce alert noise by consolidating duplicate alerts
  • Run game day to validate alerting end-to-end

When to use this

Context

Use this NCT when incident response is slow, on-call engineers lack visibility, and customer trust is eroding due to reliability issues. Especially relevant when you're moving upmarket to enterprise customers who expect high availability.

Analysis

Why this NCT works

The Narrative grounds observability work in customer impact and business risk — not just engineering best practices. The Commitments follow the three pillars of observability (logs, metrics, alerts) but are scoped to specific, measurable outcomes. The 'game day' task in the third Commitment ensures the system works before the next real incident.

Ready to build your own NCT?

Start with a Narrative. Add Commitments. Break them into Tasks. Free forever to start.

Free forever. No credit card needed.