At 3:17 a.m. connected a Tuesday, my telephone buzzed pinch nan alert that would reshape nan measurement I deliberation astir API design.
Our customer-facing API had stopped responding. Not slow degrading; it was wholly dead. Three downstream services went pinch it. By nan clip I sewage to my laptop, customer support tickets were flooding in.
The guidelines cause? A azygous database replica had gone down, and our API had nary fallback. One nonaccomplishment cascaded into full unavailability. I spent nan adjacent 4 hours manually rerouting postulation while our customers waited.
That nighttime costs america $14,000 successful service-level statement (SLA) credits and a batch of trust. But it taught maine thing I now use to every API I build: Every creation determination should walk what I telephone “The 3 a.m. Test.”
The 3 a.m. Test
The trial is simple: When this strategy breaks astatine 3 a.m., will nan on-call engineer beryllium capable to diagnose and hole it quickly?
This azygous mobility has eliminated a astonishing number of “clever” creation choices from my architectures:
- Clever correction codes that require archiving lookup? Fail.
- Implicit authorities that depends connected erstwhile requests? Fail.
- Cascading failures that return down unrelated features? Fail.
After that incident, I rebuilt our API infrastructure from nan crushed up. Over nan adjacent 3 years, handling 50 cardinal regular requests, I developed 5 principles that transformed our reliability from 99.2% to 99.95% and fto maine slumber done nan night.
Principle 1: Design for Partial Failure
Six months aft nan first incident, we had different outage. This time, a downstream costs processor went unresponsive. Our API dutifully waited for responses that ne'er came, and petition threads piled up until we crashed.
I realized we’d solved 1 problem but created another. We needed systems that degraded gracefully alternatively of failing catastrophically.
Here’s what we built:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
class ResilientServiceClient: def __init__(self, primary_url, fallback_url): self.primary = primary_url self.fallback = fallback_url self.circuit_breaker = CircuitBreaker( failure_threshold=5, recovery_timeout=30 ) async def fetch(self, request): # Try superior pinch circuit breaker protection if self.circuit_breaker.is_closed(): try: response = await self.call_with_timeout( self.primary, request, timeout_ms=500 ) self.circuit_breaker.record_success() return response except (TimeoutError, ConnectionError): self.circuit_breaker.record_failure() # Fall backmost to secondary try: return await self.call_with_timeout( self.fallback, request, timeout_ms=1000 ) except Exception: # Return degraded consequence alternatively than error return self.degraded_response(request) |
The cardinal insight: A degraded consequence is almost ever amended than an error. Users tin activity pinch old information aliases reduced functionality. They can’t activity pinch a 500 error.
After implementing this pattern crossed our services, we stopped having cascading failures. When nan costs processor went down again (it did, 3 much times that year), our API returned cached pricing and queued transactions for later processing. Customers hardly noticed.
Principle 2: Make Idempotency Non-Negotiable
This instruction came from a $27,000 mistake.
A mobile customer had a bug that caused it to retry grounded requests aggressively. One of those requests was a payment. The retry logic didn’t see idempotency keys. You tin conjecture what happened next.
A azygous customer sewage charged 23 times for nan aforesaid order. By nan clip we noticed, we’d processed copy charges crossed hundreds of accounts. The refunds, nan customer work hours, nan engineering clip to fix it costs $27,000.
Now, each mutating endpoint requires an idempotency key. No exceptions.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
class IdempotentEndpoint: def __init__(self): self.idempotency_store = RedisStore(ttl_hours=24) async def handle_request(self, request, idempotency_key): # Check if we've already processed this request existing = await self.idempotency_store.get(idempotency_key) if existing: # Return cached consequence — don't re-execute return Response( data=existing['response'], headers={'X-Idempotent-Replay': 'true'} ) # Process nan request result = await self.execute_operation(request) # Cache for early retries await self.idempotency_store.set( idempotency_key, {'response': result, 'timestamp': now()} ) return Response(data=result) |
We besides started rejecting requests without idempotency keys for immoderate POST, PUT aliases DELETE operation. Some customer developers complained initially. Then they thanked america erstwhile their retry bugs didn’t origin information corruption.
Principle 3: Version successful nan URL, Not nan Header
I learned this 1 by watching a inferior technologist debug an rumor for six hours.
We’d been versioning our API done a civilization header: X-API-Version: 2. It seemed clean. Kept nan URLs tidy.
But erstwhile thing went wrong, our logs showed nan URL and consequence codification — not nan headers. The technologist was looking astatine logs for /users/123 and couldn’t fig retired why nan behaviour was different betwixt 2 clients. Six hours later, he yet thought to cheque nan type header.
We moved versioning to nan URL way that week:
/v1/users/123 /v2/users/123 |
Now type accusation shows up in:
- Every log entry
- Every trace
- Every correction report
- Every monitoring dashboard
The debugging clip savings unsocial justified nan migration. But we besides established versioning rules that prevented early pain:
- Breaking changes require a caller version
- Additive changes (new optional fields) don’t require a caller version
- We support astatine slightest 2 versions simultaneously
- 12-month deprecation announcement earlier sunsetting immoderate version
When we do deprecate a version, clients get a Deprecation header informing them for months earlier we really move it off.
Principle 4: Rate Limit Before You Need To
We almost learned this instruction nan difficult way.
A partner institution integrated pinch our API. Their team’s implementation had a bug: When they sewage a timeout, they’d retry immediately. Infinitely. With exponential parallelism.
At 2 p.m. connected a Thursday, their strategy started sending 50,000 requests per second. We didn’t person complaint limiting. We’d ever planned to adhd it “when we needed it.”
We needed it.
Fortunately, our load balancer had basal protection that kicked successful and started dropping requests. But morganatic postulation sewage dropped too. For 47 minutes, our API was fundamentally a lottery — possibly your petition would get through, possibly it wouldn’t.
The adjacent week, we implemented gradual complaint limiting:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
class TieredRateLimiter: def __init__(self): self.limiters = { 'per_client': TokenBucket(rate=100, burst=200), 'per_endpoint': TokenBucket(rate=1000, burst=2000), 'global': TokenBucket(rate=10000, burst=15000) } async def check_limit(self, client_id, endpoint): # Check each tiers, return first failure for tier_name, limiter in self.limiters.items(): key = client_id if tier_name == 'per_client' else endpoint result = await limiter.check(key) if not result.allowed: return RateLimitResponse( allowed=False, retry_after=result.retry_after, limit_type=tier_name ) return RateLimitResponse(allowed=True) |
The cardinal specifications that made this really useful:
- Always return Retry-After headers truthful clients cognize erstwhile to effort again.
- Include X-RateLimit-Remaining truthful clients tin spot their budget.
- Use different limits for different customer tiers (partners get much than anonymous users).
- Separate limits per endpoint (the hunt endpoint tin grip much than nan costs endpoint).
That partner’s bug happened again six months later. This time, their requests sewage rate-limited, our different clients were unaffected, and I didn’t moreover find retired until I checked nan metrics nan adjacent morning.
Principle 5: If You Can’t See It, You Can’t Fix It
The scariest outages aren’t nan ones wherever everything breaks. They’re nan ones wherever thing is subtly incorrect and you don’t announcement for days.
We had an rumor wherever 3% of requests were failing pinch a circumstantial correction code. Not capable to trigger our readiness alerts (we’d group those astatine 5%). Not capable for customers to flood support. But capable that hundreds of users per time were having a bad experience.
It took america 2 weeks to notice. Two weeks of a surgery acquisition for existent users.
After that, we built observability into each endpoint:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
class ObservableEndpoint: async def handle(self, request): trace_id = self.tracer.start_trace() start_time = time.time() try: response = await self.process(request) # Record occurrence metrics duration_ms = (time.time() - start_time) * 1000 self.metrics.histogram('request_duration_ms', duration_ms, { 'endpoint': request.path, 'status': response.status }) self.metrics.increment('requests_total', { 'endpoint': request.path, 'status': response.status }) return response except Exception as e: # Record nonaccomplishment pinch context self.metrics.increment('requests_errors', { 'endpoint': request.path, 'error_type': type(e).__name__ }) self.logger.error('request_failed', { 'trace_id': trace_id, 'error': str(e) }) raise |
Our minimum observability requirements now:
- Request count by endpoint, position codification and client
- Latency percentiles (p50, p95, p99) by endpoint
- Error complaint by endpoint and correction type
- Distributed tracing crossed work boundaries
- Alerts astatine 1% correction rate, not 5%
The 3% nonaccomplishment issue? With our caller observability, we would person caught it successful minutes, not weeks.
The Results
After 3 years of applying these principles crossed our API infrastructure:
That past metric is nan 1 I attraction astir most. I went from being woken up doubly a week to erstwhile each 2 months.
What I’d Tell My Past Self
If I could spell backmost to earlier that first 3 a.m. call, I’d show myself:
- Build for nonaccomplishment from time one. Every outer telephone will yet fail. Every database will yet spell down. Design for it earlier it happens, not after.
- Make nan safe point nan easy thing. Requiring idempotency keys feels for illustration clash until it saves you from a $27,000 mistake. Rate limiting feels unnecessary until a partner’s bug tries to return you down.
- Invest successful observability early. You can’t hole what you can’t see. The costs of bully monitoring is thing compared to nan costs of not knowing your strategy is broken.
- Boring is good. The clever solution that’s difficult to debug astatine 3 a.m. isn’t clever. Version successful nan URL. Return clear correction messages. Make nan evident choice.
APIs don’t past by accident. They past by design, specifically, by designing for nan infinitesimal erstwhile everything goes wrong.
Now, erstwhile my telephone buzzes astatine 3 a.m., it’s usually conscionable spam. And that’s precisely really I for illustration it.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to watercourse each our podcasts, interviews, demos, and more.
Group Created pinch Sketch.
English (US) ·
Indonesian (ID) ·