It should really say "overcomplicated system tend to break more frequently". It ...

willis936 · on Aug 4, 2021

Indeed. Ships contain redundancy. Should this complication be removed to reduce downtime?

Over-engineering does not mean "making the thing worse". It does if your engineers don't know what they're doing. If they do then the added complexity increases reliability.

chousuke · on Aug 4, 2021

Regarding "no downtime", there are very few applications where that's actually a good goal to have at all. Quite often, simple systems can provide you high availability where your downtime is a few seconds or minutes at most (during larger maintenance operations), and in many cases you can hide those blips by simply retrying (with proper backoff). There aren't many systems where you actually need to guarantee "no downtime".

What I think is important most of the time instead of zero downtime is predictable behaviour when failures occur, so that your system doesn't end up in an undefined state where you don't know if you can recover without data loss.

lmilcin · on Aug 4, 2021

There is more systems that can't tolerate downtime than you think. And this is mostly because you treat it as fabric of everything around you and you only notice when it fails.

Mobile networks? Power delivery (basically all utilities)? Broadband internet? Factories? Payment systems? Air traffic control? Any internet services at all?

One can think that only Google or Facebook need to maintain high availability, but basically any internet services needs to do that or face possibility of loosing clients.

Smaller company may have less clients but the downtime still affects their clients and in turn them the same way. Loosing 10% clients after snafu is as painful when you have 1 million clients as when you have 100 clients.

It is also not about having absolutely no downtime -- this usually cannot be guaranteed. It is about having less downtime.

closeparen · on Aug 4, 2021

All those systems have hours, even days of cumulative downtime per year. Planned and unplanned. The sky does not fall. When you are "three nines" - the other 0.1% is the downtime you're tolerating.

Power outages of a few minutes to a few hours are utterly normal. Power outages up to a few days due to summer heat and winter storms are part of the rhythm of life, depending on where you live. Facilities that really care about power continuity have batteries and generators (although these aren't perfect either, we once lost a datacenter to a transfer switch maintenance).

Broadband is notoriously flaky, to the point that cable technicians' vague arrival windows are a meme. Serious businesses get several independent connections. Even consumers can now fall back to tethering their phones.

Credit card authorization gets skipped during downtime. Actual payment settlement occurs in nightly batches, which humans have many hours to shepherd and patch. FedWire keeps banker's hours. Stock markets suspend trading when necessary.

Stopping the line is a normal part of the lifecycle of a manufacturing process: when something goes wrong, when there's going to be an upgrade, even for regular scheduled maintenance.

Most internet services have some downtime.

chousuke · on Aug 4, 2021

Of what you listed, only the telephone network is one where I don't recall experiencing downtime; probably because it degrades gracefully to reduced functionality if something fails. Can't say much about air traffic control, but I imagine they have failures too and just have backup protocols in place when primary systems fail. A payment system is probably the closest to a computer system where you really don't want to drop any incoming requests, though after the initial payment event has been recorded, the behind-the-scenes processing can tolerate quite a lot of delay in the worst case.

Networks certainly fail all the time, and power delivery issues aren't uncommon either; the downtime just tends to be localized and if you depend on your internet connection or power, you have backup links and UPS systems that reduce or avoid the impact of downtime.

I never said you shouldn't strive for high availability, especially if you're moving vast amounts of customer traffic, I specifically made the argument against "no downtime", because people often seem to think that if your system doesn't have five nines of uptime it's unsuitable for handling "real" traffic, and that to achieve enough high availability you somehow need a highly complex system.