← All posts

Most engineering failures are not surprises

Most engineering failures are not surprises

One of the more irritating things about engineering is how often people talk about failure like it dropped out of the sky.

As if the outage, the missed handoff, the broken deploy, the mystery config drift, the undocumented dependency, the impossible-to-reproduce weirdness, the PM theatre, the “why is this suddenly my problem?” moment — as if all of that just materialised out of nowhere like an act of God.

It usually didn’t.

Most engineering failure modes are not surprises. They are neglected truths with a timestamp.

The signs are nearly always there:
- ownership is vague
- the standards are half-written or selectively applied
- the process relies on one person remembering how things work
- nobody wants to be the bad guy and force a proper boundary
- people confuse “we got away with it” with “this is sound”
- the system has known weak points that everybody steps around until eventually somebody steps on one

Then when it does go wrong, everyone acts like the problem is the latest symptom rather than the structure that made the symptom inevitable.

Failure is usually cumulative

A lot of failures are not single mistakes. They are layered compromises.

One weird naming convention no one cleaned up.
One secret stored somewhere stupid because it was “temporary”.
One pipeline that only Dave understands.
One shared folder with something sensitive in it because apparently that seemed fine at the time.
One undocumented exception.
One bit of tribal knowledge standing in for an actual standard.
One role boundary left fuzzy because “we’ll work it out as we go”.

Each one seems survivable on its own. And often it is. That is the trap.

The danger is not that any one compromise instantly kills the system. The danger is that enough of them accumulate and suddenly the system is no longer robust, it is just lucky.

There is a difference.

Heroics are not a control framework

A lot of engineering organisations quietly run on heroics and then act surprised when things are brittle.

If your delivery model depends on:
- the one engineer who knows the weird thing
- the senior who jumps in and sorts it when it goes sideways
- the person who remembers the hidden manual step
- the bloke everyone Slacks when the pipeline starts behaving like it has a head injury

then you do not have a robust system. You have a reliance loop with good PR.

Heroics are useful in emergencies. They are not meant to be the operating model.

The more often a team says “it’s fine, X can sort that”, the more likely it is that what they really mean is “we have not designed this properly and are currently borrowing resilience from a person”.

That debt gets called in eventually.

Ambiguity is one of the most expensive failure modes

People massively underrate ambiguity as a source of cost.

They think failure comes from lack of tooling, lack of process, lack of skill, lack of time. Sometimes it does. But a lot of the time the real problem is much less dramatic:

nobody is actually clear on who owns what, what “done” means, where the boundary is, what the standard is, or what should happen when reality deviates from the happy path.

That sort of ambiguity is murder.

Not dramatic murder. Slow murder. Administrative murder. The kind where five capable people can spend an entire week stepping around each other because the system never made the decision explicit.

And then someone writes a postmortem about communication.

No. The problem was structure.

“Temporary” is where a lot of bad architecture comes from

There should be a museum for things that were only meant to be temporary.

Temporary secrets.
Temporary access.
Temporary exceptions.
Temporary manual steps.
Temporary bypasses.
Temporary bits of infrastructure no one was ever going to love but everyone was willing to tolerate for now.

The issue with temporary is not that it exists. Sometimes a temporary measure is completely rational.

The issue is that most organisations have no real mechanism for turning temporary into either:
- permanent and properly governed
- or removed

So it just sits there, gathers dependencies, becomes socially normal, and eventually gets defended like it was a deliberate design decision all along.

That is how nonsense hardens into architecture.

## Failure often starts where responsibility gets blurry

One of the classic failure modes in engineering is the zone where something is clearly important but not clearly owned.

Not owned enough for someone to improve it.
Not unowned enough for someone to escalate it.
Just vague enough that everyone assumes somebody else probably has it.

That is where loads of operational rubbish lives:
- handoff gaps
- stale documentation
- pipeline weirdness
- access patterns no one is comfortable with
- exceptions no one wants to formally bless
- platform tasks that are “kind of shared”

Shared ownership is often just a polite way of saying nobody can force the issue.

And if nobody can force the issue, the system decays by default.

Postmortems often stop too early

A lot of postmortems are basically symptom autopsies.

They identify:
- what happened
- when it happened
- who was involved
- which exact thing broke

Fine. Useful as far as it goes.

But they often stop right before the important question, which is:

what made this sort of failure easy to produce in the first place?

That is the bit people avoid, because the answer is often structural, political, or culturally inconvenient.

Maybe the ownership model is weak.
Maybe the standards are too soft.
Maybe the process is theatre.
Maybe a team is carrying responsibilities it does not have the authority to enforce.
Maybe the organisation keeps treating delivery friction as an interpersonal issue instead of a systems issue.
Maybe the architecture is not bad in theory, but impossible in practice because it assumes a level of discipline the organisation simply does not have.

That is where the real answer usually is.

Engineering culture has a bad habit of romanticising tolerance for nonsense

There is a particular kind of engineer pride that says:
- I can work around this
- I can fix it on the fly
- I can hold the whole thing together
- I can compensate for the system being stupid

And to be fair, sometimes that is useful.

But there is a dark side to it: high-competence people can accidentally hide system failure by being too good at absorbing it.

They become shock absorbers for bad structure.

The organisation then misreads the situation and concludes things are basically okay, because the work keeps getting done.

Meanwhile, the competent people get more tired, more annoyed, and more central to everything. Which makes the system even weaker, not stronger.

A system that only works because strong people quietly carry weak structure is not healthy. It is being subsidised.

What I think actually helps

Most of the time, the fix is less glamorous than people want.

Not more slogans.
Not more maturity-model chat.
Not more decorative governance.
Not another six-layer process for asking permission to move a button.

Usually it is things like:
- make ownership explicit
- make boundaries enforceable
- remove hidden manual steps
- document the path people actually use, not the one you wish they used
- turn tribal knowledge into standards
- reduce reliance on specific people
- design the boring path properly so people stop needing heroic exceptions

In other words: make the system easier to use correctly than incorrectly.

A lot of engineering pain comes from asking human discipline to compensate for structural laziness. That is backwards.

Final thought

Most engineering failure modes are not mysterious.

They are visible, recurring, and broadly predictable. The problem is not usually that people could never have seen them. The problem is that systems let them sit around long enough to become normal.

And once a bad pattern becomes normal, people stop treating it like risk and start treating it like background texture.

That is usually the moment you should worry.

Because when failure finally shows up, it rarely arrives as a surprise.

It arrives as the bill for things everyone already knew were wrong.