roke Production Before Demo Day (10 Minutes Left)

Ten minutes before Demo Day, I took down production with a single command.

Not “degraded performance.” Not “a minor incident.”
Hard-down. The kind where the status page doesn’t even have time to update before Slack explodes.

I stared at my screen like it was going to apologize.

It didn’t.

The Setup: A Last-Minute Deploy Before Demo Day

The demo was simple: a clean customer journey, a fast checkout, a couple of analytics events firing, done.
We’d been rehearsing it all week.

But that morning, I noticed something “harmless”:

A config value still pointed to the old storage bucket
A feature flag name didn’t match the docs
One environment variable was missing in the production dashboard

Nothing dramatic. The kind of cleanup you do when you’re feeling responsible.

So I did what every confident engineer does right before a high-stakes demo:

I hot-fixed production.

No full pipeline. No canary. No staged rollout.
Just a quick deploy to “avoid embarrassment.”

I told myself: This is safer than showing broken behavior in the demo.

That sentence will age like milk.

The Click Heard Around the Company

I merged the change.
The CI passed.
The deploy started.

Then the monitoring dashboard turned into a crime scene.

CPU spiked.
Latency shot up.
Error rate went vertical.

And then: 500s everywhere.

At first, I assumed it was transient—warm-up, caches, something normal.
Then I opened logs and saw the line I’ll never forget:

“Cannot connect to database.”

Which is funny, because nothing in my change was “about the database.”

But production doesn’t care what you intended.

The Real Cause: How I Took Down Production

Here’s what actually happened:

That “minor config cleanup” changed a shared environment variable used by two services:

the app API (the thing I was thinking about)
the background worker (the thing I forgot existed)

The worker booted with the new config, failed to connect, entered a crash loop, and hammered the system with retries.
The queue backed up.
The API started timing out waiting on dependent calls.

Within minutes, the system was basically DDoS’ing itself—from the inside.

And because retries were “resilient,” it failed loudly and expensively.

Resilience without limits is just self-harm with extra steps.

The Worst Part: Everyone Was Watching

Demo Day wasn’t only internal. We had stakeholders. A partner team. A couple of investors.

My calendar reminder popped up:

“Demo starts in 10 minutes.”

My Slack popped up:

“Are we down?”
“Checkout is failing.”
“Is this related to the deployment?”

It’s amazing how fast you can type when your soul leaves your body.

The Recovery: Roll Back, Then Breathe

I did the only sane thing:

Rollback immediately (no debugging, no hero work)
Disable the worker to stop the retry storm
Scale down anything looping
Verify DB connectivity and queue health
Replay failed jobs after stabilization

We got production back in time for the demo.

Barely.

The demo happened.
It worked.

But I didn’t hear a single word of it.

All I could hear was the command I ran.

The Lesson: Preventing Production Incidents

Here’s what I learned: the problem wasn’t my code.

The problem was the process I skipped.

If you want to prevent this kind of incident, build guardrails that save you from yourself—especially when you’re stressed.

1) Freeze Production Before Big Events

Set a rule: no production changes X hours before demos/releases unless it’s a critical incident fix.

Most “quick fixes” aren’t critical. They’re anxiety.

2) Treat Config Like Code

Configuration is production code wearing a hoodie.

version it
review it
test it
deploy it via pipeline

If a change can break your system, it deserves the same discipline as source code.

3) Canary Deploys (Even for “Tiny” Changes)

Deploy to a small slice of traffic first.

If metrics spike, you catch it early—before “entire company” becomes your QA team.

4) Put Hard Limits on Retries

Retries need guardrails:

exponential backoff
jitter
max retry count
circuit breakers

Infinite retries don’t increase reliability. They increase blast radius.

5) One-Click Rollback Must Always Work

Rollback shouldn’t be “a plan.” It should be a reflex.

If rollback is risky or slow, you’ll hesitate.
Hesitation is how small incidents become big ones.

The Post-Demo Rule I Keep to This Day

After that day, I adopted one rule:

If it’s 10 minutes before a demo, production is not the place for bravery.

Because production doesn’t reward confidence.
It rewards discipline.

And if you ever break prod right before Demo Day…

Welcome to the club.

Just make sure you’re the person who builds the guardrails afterward.