Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It should really say "overcomplicated system tend to break more frequently".

It is not simplicity that makes for less downtime, it is unnecessary complication that does the opposite.

I spend time to complicate my applications a little bit to make sure there is no downtime, something pretty important when one of the largest banks on Earth will stop along with your application.

The simplest solutions would typically not be able to ensure no downtime operation. I need code so that I can do rolling upgrades and I need code so that my application can partition work and rebalance it reliably as cluster map changes.

The problem starts when you start overdoing it. Maybe you are expecting too much in terms of guarantees. Or you want a simple guarantee but then you duct tape to it a huge and complicated clustering solution. Now you have a lot of problems. Your team doesn't know how it works. Your team doesn't know how it fails. It is not easy to tell if you are obviously integrated the right way. And so on.

The goal, as usually, should be to "keep it simple, but not simpler than is necessary".

For example, the approach we have chosen was to get by with as little guarantees as possible implemented as simply as possible.

We decided on immutable data. We decided objects being saved as documents each with entire state of the object after each change. This costs a lot in space and processing needs, but you know what? It is fine. I work for a bank after all. The one think that costs more than space and processing power is downtime, and that's what we are trying to focus on. We know how to deal with duplicate data.

So another rule of thumb: try to find compromises in your application, use them to replace hard problems with easier problems.

Making an application super reliable is hard problem. Adding more storage space and memory is (relatively) easy one. If you can solve hard problem by replacing it with an easier one, you are winning.



Indeed. Ships contain redundancy. Should this complication be removed to reduce downtime?

Over-engineering does not mean "making the thing worse". It does if your engineers don't know what they're doing. If they do then the added complexity increases reliability.


Regarding "no downtime", there are very few applications where that's actually a good goal to have at all. Quite often, simple systems can provide you high availability where your downtime is a few seconds or minutes at most (during larger maintenance operations), and in many cases you can hide those blips by simply retrying (with proper backoff). There aren't many systems where you actually need to guarantee "no downtime".

What I think is important most of the time instead of zero downtime is predictable behaviour when failures occur, so that your system doesn't end up in an undefined state where you don't know if you can recover without data loss.


There is more systems that can't tolerate downtime than you think. And this is mostly because you treat it as fabric of everything around you and you only notice when it fails.

Mobile networks? Power delivery (basically all utilities)? Broadband internet? Factories? Payment systems? Air traffic control? Any internet services at all?

One can think that only Google or Facebook need to maintain high availability, but basically any internet services needs to do that or face possibility of loosing clients.

Smaller company may have less clients but the downtime still affects their clients and in turn them the same way. Loosing 10% clients after snafu is as painful when you have 1 million clients as when you have 100 clients.

It is also not about having absolutely no downtime -- this usually cannot be guaranteed. It is about having less downtime.


All those systems have hours, even days of cumulative downtime per year. Planned and unplanned. The sky does not fall. When you are "three nines" - the other 0.1% is the downtime you're tolerating.

Power outages of a few minutes to a few hours are utterly normal. Power outages up to a few days due to summer heat and winter storms are part of the rhythm of life, depending on where you live. Facilities that really care about power continuity have batteries and generators (although these aren't perfect either, we once lost a datacenter to a transfer switch maintenance).

Broadband is notoriously flaky, to the point that cable technicians' vague arrival windows are a meme. Serious businesses get several independent connections. Even consumers can now fall back to tethering their phones.

Credit card authorization gets skipped during downtime. Actual payment settlement occurs in nightly batches, which humans have many hours to shepherd and patch. FedWire keeps banker's hours. Stock markets suspend trading when necessary.

Stopping the line is a normal part of the lifecycle of a manufacturing process: when something goes wrong, when there's going to be an upgrade, even for regular scheduled maintenance.

Most internet services have some downtime.


Of what you listed, only the telephone network is one where I don't recall experiencing downtime; probably because it degrades gracefully to reduced functionality if something fails. Can't say much about air traffic control, but I imagine they have failures too and just have backup protocols in place when primary systems fail. A payment system is probably the closest to a computer system where you really don't want to drop any incoming requests, though after the initial payment event has been recorded, the behind-the-scenes processing can tolerate quite a lot of delay in the worst case.

Networks certainly fail all the time, and power delivery issues aren't uncommon either; the downtime just tends to be localized and if you depend on your internet connection or power, you have backup links and UPS systems that reduce or avoid the impact of downtime.

I never said you shouldn't strive for high availability, especially if you're moving vast amounts of customer traffic, I specifically made the argument against "no downtime", because people often seem to think that if your system doesn't have five nines of uptime it's unsuitable for handling "real" traffic, and that to achieve enough high availability you somehow need a highly complex system.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: