Building an Observability Stack
Implement structured logging, metrics, and alerting to reduce incident response time and improve system reliability.
The full NCT chain
“When something breaks in production, we're flying blind. Our logging is inconsistent, we have no centralized metrics, and alerts are either too noisy or too late. Last month's P1 incident took 6 hours to diagnose because we couldn't correlate logs across services. Our MTTR is 4x the industry benchmark. Customers are losing trust — two enterprise clients escalated to their account executives about reliability. If we build a proper observability stack, we cut incident diagnosis time, reduce MTTR, and restore customer confidence.”
Implement structured logging across all services with request tracing and correlation IDs
- Define structured logging format and required fields
- Add correlation ID propagation across service boundaries
- Migrate existing log statements to structured format
- Set up centralized log aggregation and search
- Verify log correlation works across 3 critical request paths
Ship a metrics dashboard covering the four golden signals (latency, traffic, errors, saturation) for all production services
- Instrument all services with metrics collection
- Build dashboards for latency, traffic, errors, and saturation
- Set up service-level objectives (SLOs) for each service
- Create runbook templates linked from dashboards
Configure actionable alerting that pages the on-call engineer within 2 minutes of a customer-impacting issue
- Define alert severity levels and escalation paths
- Implement alerts based on SLO burn rates
- Set up on-call rotation and PagerDuty integration
- Reduce alert noise by consolidating duplicate alerts
- Run game day to validate alerting end-to-end
When to use this
Context
Use this NCT when incident response is slow, on-call engineers lack visibility, and customer trust is eroding due to reliability issues. Especially relevant when you're moving upmarket to enterprise customers who expect high availability.
Analysis
Why this NCT works
The Narrative grounds observability work in customer impact and business risk — not just engineering best practices. The Commitments follow the three pillars of observability (logs, metrics, alerts) but are scoped to specific, measurable outcomes. The 'game day' task in the third Commitment ensures the system works before the next real incident.
Related examples
Enterprise API Rate Limiting
Redesign API rate limiting to serve enterprise customers without impacting the reliability of the platform.
Product & EngineeringReducing CI Pipeline Time
Cut CI build times from 18 minutes to under 5 minutes to restore developer flow and increase shipping frequency.
Infrastructure & PlatformAchieving SOC 2 Compliance
Complete SOC 2 Type II certification to unblock enterprise sales and reduce security questionnaire burden.
See how these teams use NCT
Ready to build your own NCT?
Start with a Narrative. Add Commitments. Break them into Tasks. Free forever to start.
Free forever. No credit card needed.