Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Simple Systems Have Less Downtime (2020) (gkogan.co)
303 points by gk1 on Aug 4, 2021 | hide | past | favorite | 210 comments


previous discussion from a year ago when first submitted:

https://news.ycombinator.com/item?id=22471355


“Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better.” – Edsger W. Dijkstra


I think every talk on complexity is meaningless, without taking into account the difference between accidental and essential complexity.

There are problem domains where there is simply a minimal complexity required implicitly. You can’t write an insanely simple program that will render vector fonts, simply because the problem inherently has to have a fixed amount of complexity. Anything more (accidental complexity) is of course bad, and “clean” code or whatever should try to minimize its amount, but it is harmful in my view if we keep believing that it could be made simpler. One really bad example is this list: http://harmful.cat-v.org/software/


>> I think every talk on complexity is meaningless, without taking into account the difference between accidental and essential complexity.

Yeah. This is both true and pointless.

Nobody codes things too complex on-purpose, at least not normal people. So it's not the difference in types of complexity, it's the difference in our ability to understand accidental and essential complexity.

I find that I do a really poor job at this, and I'm the first to stand on the soapbox and rail against systems being too complex. Just like programmers naturally introduce bugs into code without realizing it, programmers naturally introduce complexity into code without realizing it.

We may theoretically be able to talk about the differences in type, or how to manage each, but our real problem begins in our conceptual models inside our brains or various types of problems, and this happens a long time before any code is ever written.

In my personal practice I've come up with a few gimmicks that help me refactor my preconceptions. It is very difficult, however, for people to adopt practices that continue to inform them that they make lots of mistakes. Everybody wants to code as if they're conquering the boss-level at the end of a game. We sell coding, tooling, frameworks, and practices to other coders under the assumption that they're going to have a blast, not that they're going to continuously be reminded that they're screwing things up.

So yes, I agree, but without moving past that statement into something more tractionable, it's more of a truism than a starting point.


I've worked with a lot of people who'd never thought about the difference between accidental and essential complexity.

This resulted in them coding things too complex unintentionally, so while not on-purpose, it was absolutely a useful conversation to have. It's a good starting point.

I don't have a good single rule for a "second step," though. It's going to depend on the details of your project, by and large, though I think some principles to try to strive for include composition, encapsulation of details, and so on - nothing new or startling, but just adding the "is this essential or accidental" lens to the decision-making process.


Yup. They don't think about it, but the mind is a funny thing. If you ask them why they've done something a certain way (without bringing up the topic of complexity), folks usually have some good reasons for the things they do.

However, if you bring up the topic, then take a look at some code or architecture? Then suddenly we're talking about all the ways we might have done it better/less-complex if things had been different.

Most every coder I know understands the topic, and most are even willing to go on at length about how important it is, including me! (grin). It's the actual application where things fall apart.

>> I don't have a good single rule for a "second step," though.

I do. It was bugging me so I spent a couple of years coming up with one. Seems to work great for me. YMMV. I do not believe it is as context-dependent as most in our industry seem to think. (It's very much problem dependent, though. It's just vastly more related to the business problem than the technical considerations. [Discussion goes here about tech-related problems not related to the solution foisted on the solutions team])


> Nobody codes things too complex on-purpose, at least not normal people. So it's not the difference in types of complexity, it's the difference in our ability to understand accidental and essential complexity.

It's not so simple — I've seen people create over-architected monsters because they thought everyone coded like some consultant's magazine article/talk/book claimed, because they were trying to be taken seriously and didn't think they could push back, because they had talked their way into a job they weren't qualified for and tossing around buzzwords was a way to obscure the fact that they had no idea what they were doing, because the client insisted that all real projects must run in the way the consultants in expensive suits assured them everyone serious runs them (ignoring massive differences in scope and resources or long term success), or because the institutional incentives made it easier to say their project would handle everything anyone had ever thought might be useful into one application rather than working with the actual users to build multiple smaller, simpler tools.

One way to think about that is that it's neither accidental nor essential complexity but environmental. Fixing that can be quite hard but it's often work the political capital if you have it because the alternatives might not be survivable.


> accidental and essential complexity

I feel like I am turning into a bot for posting the Out of the Tar Pit paper:

http://curtclifton.net/papers/MoseleyMarks06a.pdf

This changed my understanding of computer science and our product virtually overnight. We are using a hybrid model of Functional Relational Programming (see section 9 in the paper). This is in production right now and its clearly the right answer for managing complexity in our product.

If you think "eww i have to learn some FRP language" - No. You just need to tack SQLite onto whatever preferred language you use today and model all your business logic as SQL queries over properly-normalized tables. If you can achieve this with 6NF, you have an infinitely-extensible domain model. If you don't know where to start, 3NF is the safest place. Most humans tend to think in terms of 3NF when referring to the business entities.


The problem is that performance and normalization do not (always) go well together.

Let's say you have billions of rows of event data you want to perform summary counts for by a few different key columns.

Doing this up front as the events are ingested is going to allow for much more efficient querying on an already grouped table than having to group on your billions of events in each SELECT query.

I'm not saying don't normalize. But normalizing creates its own problems too you may need to think about.


> The problem is that performance and normalization do not go well together.

> Let's say you have billions of rows of event data you with to perform summary counts for by a few different key columns.

Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

Such a terrible tradition in our industry in focusing on outlier cases, or on what FB/Google/... might need when making technical decisions.


>Guess what? 99% of people on here don't have billions of rows of event data they need to regularly aggregate. Problem solved.

having built a real time analytics solution a few years ago because we wanted one we controlled for our startup, it doesn't take much to get to billions of rows of event data in the modern web.


At last $dayjob we were a speck compared to even the smallest "webscale" consumer B2C app and we gathered 800mil tracing events a week.


Yup. How many variations of analytics startups/companies exist out there? How many customers do they each have with how many events happening per customer?

Analytics on large-ish data is definitely a common challenge. And sure, CRUD is even more common. But knowing what your options are for each challenge makes sense.


this relates https://news.ycombinator.com/item?id=28047618 if the giants are generating exabytes per year it seems reasonable to expect that your smaller application can end up with a few terabytes in the same time.


OK well I'm telling you based on my experience and it wasn't at FB or Google. :) Do most people have these problems, no you're right. Should everyone ignore them and be unaware of options when they do have the problem? Probably not?


I had those problems several times within the last few years - having to aggregate and detect various signals from billions of data points - and keep the system flexible enough that different business people could “try out” various ideas and play with the results.

All sorts of different patterns and architectures have to be brought together to make sense of it.

But the technique described above still has its place within such a system. For example - you boil billions of data points down to a handful (several thousand) facts, such as events that occurred or anomalies. Then those several thousand facts can be analysed standalone (as described above) without going back to the source. Etc.


Sqlite has triggers, and Sqlite has indices, and Sqlite lets you write custom functions in your host language.

While I'm not necessarily arguing in favour of using Sqlite for this, it most certainly can do this up front during ingestion.

So yes, sometimes you do end up sacrificing normalisation, but often that means keeping a normalised form, and having code that can re-generate summary data from the normalised form.


The question to ask is whether building a cache of normalized data will be more efficient than addressing the complexities of non-normalized data, eg. duplication, renaming, integrity problems, etc.


I've seen the downside of this. And it's having to have an oncall staff of DBA's hand-grooming fragile databases and running deduplication processes that sometimes get so far behind that it's a mathematical impossibility to catch up with new data coming in. (never mind what happens to your maintenance schedule when you have entire teams focused on "fixing this problem" for weeks at a time).

If it were up to me (and it almost never is, because I'm not a DB expert), I would ALWAYS normalize as much as possible when designing a database. I really can't even wrap my brain around why large (and old) databases ever get into this state. But I've seen it at two different employers and it's very painful (and costly) to deal with. My take was that these databases were probably originally set up by people who had no idea what they were doing, and ended up locking the company into a shitty implementation that kept the company crippled 15 years later. But that's just me.


Totally. All I mean to say is that "everything should always be normalized" doesn't necessarily make sense. You need to consider your situation.


Premature optimization is the root of all evil. It’s unlikely normalizing data is going to be an actual problem you can’t solve via caching.


I hate how this quote has been perverted. “Premature optimization” was meant to apply to people making their programs an unreadable/unmaintainable mess to save a couple of CPU instructions. Figuring out how to organize your data is not premature optimization, it’s literally one of the first things you have to tackle when starting a project, because trying to change it down the line is going to be miserable, especially if you can’t afford any downtime.


Yes, and I think the parent was pointing out that denormalization is often a premature optimization that makes the data schema an unreadable/unmaintainable mess to save a few CPU instructions. The normalized schema is answering the question of "how to organize your data". Denormalizing that schema is an attempt at optimization.


> The problem is that performance and normalization do not (always) go well together.

Absolutely agree. That is why we use in-memory SQLite projections of business state that are scoped per-user-session so that row counts never exceed double digits in any given table.

For us, adding indexes to these special database instances would actually make things go slower.


Use a columnar database for this. Use materialized views. Don’t do this on the prod DB for the CRUD app. Simple :)


That sounds good!


I'm curious, can you recommend an open source example that implements these ideas? My first reaction when I read

> You just need to tack SQLite onto whatever preferred language you use today and model all your business logic as SQL queries over properly-normalized tables.

was to think "well that can't work for _all_ my business logic". But I'd like to see how this idea works in practice before I jump to that conclusion.


I want to second this.

My first thought was the same as yours (no way that works for everything).

It seems like a neat idea. But I can't imagine that anyone has ever done this for anything non-trivial. I want some sort of compelling argument that this works at all before I'm willing to accept this as anything else than a pipe dream.


I am curious what edge cases (i.e. "non-trivial" things) you have in mind that this would not work for.

For us, the more complex the business gets, the more justified this path becomes. Our business fails to scale if we have to write custom code for every customer. Writing custom SQL per customer (against a common schema) is far more tenable.

Are you thinking of some specific logical determination that would be infeasible in SQL vs in code? Or, is this more about webscale arguments?


It's the same sort of disbelief that I would have towards being introduced to a skyscraper made out of cardboard. At first I would think, "oh, this is just like concept art." ~No, we actually built it.~ "Err, so, like nobody can actually go in there." ~It supports people just fine.~ "Um, for what, 20 minutes." ~Well, Nick has lived in it for 20 years now.~

Finally, I'm not going to venture in (let alone actually live in it myself) without someone sitting me down and carefully explaining the structural integrity principles (and other logistics) that make the cardboard skyscraper possible.

Hearing business rules in sql works for some unknown entity doesn't really give any information about the viability for anything else. Maybe the initial modeling is really hard, but all your customers have near identical concerns, so you only pay that cost once. Maybe having multiple customers is actually the key because if one customer's needs violates some assumption that makes the whole thing work, you can tell them go to someplace else. Maybe your customers only have five business logic rules apiece that are all well spec'ed.

What's the domain? The industry? What problems are actually being solved? Can we have an example (even a contrived one)? Is there an open source example? How about a mathematical proof or even principle that argues for why this can be expected to always (or even nearly always) work? How about a youtube video with someone describing the approach?

EDIT: How about this. Show me a partial parser using this approach (grammar rules can be considered business logic, right). Just parse function declarations in C or something. I'll be able to project such an example onto what the full solution would look like to see if I believe it could actually work in general.


Your cardboard skyscraper analogy is fantastic. It does a good job capturing the tone of my disbelief.

I'm disappointed to see there wasn't a response because I was pretty willing to entertain the notion of business-logic-as-normalized-tables. But I needed, like, any evidence.


Going to high levels of normalization also makes the code harder to understand. It's very extensible, but every system that I have dealt with that had high levels of normalization was a huge pain to figure out.


I wrote a piece about that some time ago. The vast majority of the complexity in my professional career now is extra curricular complexity and that wasn't true a decade ago. It's build processes, it's CI, it's containers, it's a dozen arbitrarily different frameworks, microservices, dependency management, it's all of that stuff.

The trope I see all the time is enterprise tooling in small projects. Small teams basically moon-lighting as devops 50% of their time and reducing their capacity to solve the actual domain problems significantly.

The domain complexity is something you can't remove, but complexity you've chosen to introduce through tooling should be hard thought about, and the less time you spend thinking about stuff outside the problem you're actually trying to solve the better.


But I really enjoy painting bike sheds and shaving yaks!

YAGNI is so hard for people to really grok.

Get the revenue model right and suddenly every feature is cheap; get it wrong and every decision is infinitely expensive.


>without taking into account the difference between accidental and essential complexity.

This is acutely summarized by Larry Wall's "Waterbed Theory of Complexity" which is basically Einstein's "Make things simple as they have to be but not any simpler than that". i.e if complexity is like water -- not very compressible. If you compress that complexity (a part of the bed) -- it will manifest elsewhere in the bed as more awkward/expanded because it's all connected.

Many discussions on complexity do make this distinction.


"Essential" is always relative to requirements. If the requirements include interacting with a poorly-designed, buggy, complicated piece of other software, then yeah, you have essential complexity. Enlightenment is when you realize that you can keep zooming out and dumping what seem like essential requirements but are really just BS that follows from interacting with constantly-changing crapball software stacks.


Nah, that’s a cop out. There’s nothing enlightening about shrugging off complexity.

If a system is hard to interface with, the complexity is still accidental, it is just outside your own system and maybe out of your control.

Even if you’re a “middleware” company that connects multiple shitty systems together, you’re still adding zero value to anything but your own pocket. The complexity is still there.

We still have the right to call a spade a spade. In the past I had to say “no” to managers about interfacing with shitty systems. Sometimes you CAN make a difference.


Aren't we kind of agreeing, given your last two sentences?


I'd say "essential" refers to the value statement.

The translation of that into requirements is by itself one of the largest sources of accidental complexity.

And even the value statement sometimes is wrong and leads to unnecessary complexity too.


> The translation of that into requirements is by itself one of the largest sources of accidental complexity.

Good point!


Actually I think Djikstra's comment still applies. A lot of time the essential complexity is due to design. That design may have said essential complexity because the design is poor, when a different design would have less essential complexity. But as Djikstra says, the design with more essential complexity often sells better.


You can't say "more essential complexity" because that violates the definition. It's the complexity required by the task and that is constant and independent of the design.


That page is a joke right? That last alternative really makes it seem that way. If you want simplicity, why would you use sed to get the first lines of a file? Head is a much simpler program that has a single job while sed has its own text transformation language.


I totally agree, but I think today’s systems have more incidental than essential complexity.

The sentiment behind that list is okay but it does demonstrate a lot of ignorance The biggest joke is simple tables or file systems as a SQL alternative. If you do that for anything that needs more than just a map you will eventually end up with a badly implemented buggy slow relational database.


Truest comment I’ve read so far.

If simplicity is always the right solution, then why do so many programming languages have dedicated parsing libraries for .csv, which has to be the world’s most simplest data format ever?


CSV isn’t so much “simple” as “underspecified”. Sure, Comma separated values, but what happens when a value contains a comma? It emerges that people normally use quoted strings ‘"a,b"’ for that, but then what do you do when you need to include quote characters in your values? Etc. etc.

The basic rule for CSV is: Don’t. Or at least use a library which emits RFC 4180-compatible results. If you need to parse some co-called “CSV” non-RFC-compatible monstrosity, do whatever you need to parse it, but don’t have any illusions of it being in any way “standard”.


This is the problem with “artificial simplicity” – people make systems too simple, so the abstractions leak heavily, and by the time you want to get anything done you have a massive pile of leaked complexity to wade through.


> keep believing that it could be made simpler. One really bad example is this list: http://harmful.cat-v.org/software/

That's indeed a pretty bad example, regrettably of a website often shared in programming circles. The author simply listed his beliefs about stuff he doesn't like vs stuff he likes, and never bothered to justify them. In many places the list is obsolete, too.


I don't understand how they could classify CSV in the "less harmful" column -- its rules are insanely complex.

Also, UTF-32 seems way simpler conceptually than UTF-8, so I don't understand how UTF-8 ended up "less harmful" and UTF-32 ended up "harmful".


Would turning that font rendering into a web service an accidental complexity, or an essential one?


Accidental, of course. There is never a need to introduce latency in text delivery, simply a long line of unmitigated wants.


Thai reduces a number of companies, eg Slack, to an “accidental complexity” then.


Are you saying that you think Slack's primary challenge is font rendering?


No, it’s providing a web service to do something that would work (and used to work) much better without it.

Their primary challenge is making money, and that part probably is better as a web service, but it makes it worse from technical point of view. Just like with hypothetical FaaS - worse technically, but you can have adds.


I'll bite, what exactly worked much better than Slack without the web?


Pretty much anything, starting from IRC - it offered more functionality, from logging to scripting, but the general idea is the same.


essential complexity == simplicity. Make it as simple as it can be but not simpler.

To me it’s sort of implied that it’s unneeded complexity when we talk about it as something that should not be there and it’s making our life hard.


Most technical people don't get paid to solve problems, they get paid to work on problems. That paycheck is an hourly/weekly/monthly paycheck.


I remember one dev kept talking about simplicity, even made a session about it and then just lumped business logic, data access and transfer layer mapping into a single API file...


Whats the problem?


Only works for small projects, doesn't scale in terms of maintenance.

Same sort of issue makes people use microservices not because of their true advantages but simply to enforce boundaries between features because so many people lack discipline to make a well-structured (distributed) monolith.


Without knowing the problem it’s hard to say if they were right or wrong to do that.


I just know that "divide and conquer" works and impure functions with side effects don't.


I mean, your statement assumes a definition for "work". Reusability and maintainability are features, and for certain scopes, they may be overkill. I think this is where the really good software developers shine: You have to know the rules before you break them for sure, but pretending there isn't opportunity costs in software development is a big problem. Shipping anything gets you feedback, and that feedback can drastically change what you decide to build.

This is coupled with organizational dynamics where to get resources you have to prove value. Proving value is a lot harder than scaling/rewriting a software system, because if you need to scale a software system, you have rewards and money attached which are good business drivers. If you need to prove business value, no one cares (and you won't get resources) until you do.

It's a fine line to walk for sure.


I am a developer not a manager, getting resources is not my problem but maintaining bad code is. In the end of the day business does not understand technical debt, so it's my duty to manage it transparently and well in advance. Perhaps start-ups that need to capture market or something are a different story.


"complexity sells better" AWS sales team :)


He also have the quote: "Simplicity is prerequisite for reliability."


Dunno about that one. Our most reliable integrations have a lot of quite complex error handling in order to be highly reliable.


> And to make matters worse: complexity sells better

Well said Dijkstra. Some of the my favorite algorithms are beautifully simple, mostly in the way of being naturally recursive. My compiler course in university involved a lot of recursion, but everything just flowed. Sometimes I'd mindlessly write some code and just assume I'd recurse down the AST, and everything just worked.

Obviously simplicity isn't always achievable in large software, but hopefully there are many simple pieces involved that come together in a clean and concise way.


Similar quote from Steve Jobs (1998)

"That's been one of my mantras - focus and simplicity. Simple can be harder than complex: you have to work hard to get your thinking clean to make it simple. But it's worth it in the end because once you get there, you can move mountains."


> [...] and education to appreciate it.

Also true with personal finance:

* https://canadiancouchpotato.com/2016/01/25/why-simple-is-sti...


Seems like Dijkstra was seeking popularity too.


Not building what doesn't need to exist won't get you promoted. You don't get promoted by avoiding entire classes of problems with a simple, reliable system. You get promoted by pulling heroics to build some absurdly complex cloud Kubernetes bullshit, and then pulling more heroics to fix the endless stream of production issues that will result from your overengineered nightmare.


"Complexity kills" is an old mantra a lot of us fossils are familiar with, but the new kids still need to learn this lesson the hard way.

What makes things worse is the ability to stitch together twenty different AWS services together because - well - you can.

Solving hard problems in simple and maintainable ways, I would argue, IS the job. But what I see is engineers solving problems they don't even have, for THIS IS THE WAY. Also, there is a FAANG guy on the team that claims that's how they did it over there.


The cargo cult mentality is the thing that I've seen throughout my career. FAANG does it this way therefore we must do it this way (I've also heard it parroted at the senior leadership and executive levels more than once!). IMO, this is the driving force behind juniors and intermediates who pull too much complexity into their solutions.


Yes, and I've commented about this again and again. The push for microservices, K8S, Docker. It's all the same thing. It all comes from Google. Or occasionally Facebook. We can thank them for many of the bad ideas of the past 20 years. Open floorplan offices.

K8S could only have come from Google. It's a reflection of their corporate structure. Conway's Law and all that. Much like microservices. It's a reflection of their internal chaos. And yet every company that isn't Google tries to emulate them and fails. Then they wonder why microservices suck so much. It's not that microservices suck. It's that Google sucks. Google sucks so much that microservices are pretty much the only way they can operate as an entity.


I was going to type out something about the lack of cohesive technical vision being a massive part of the problem but Conway's Law is far too succinct.


http://web.mit.edu/nelsonr/www/Repenning%3DSterman_CMR_su01_...

I've posted this here before, though got no comments. But yes, firefighter/arsonists get more promotions and kudos than careful thinkers who don't bother setting fires to fight.

EDIT:

One past discussion here (2015): https://news.ycombinator.com/item?id=8940820


It is very bad. I spent the last year or so rebuilding a system with a fundamental architecture error that it would likely be fighting against for decades. My reward? The refactor was so difficult to do that I ended up getting dinged on timeline. And promotions got delayed The incentive structures at big companies are out of whack. Instead of the original developer, who was promoted to staff engineer, gettting dinged for making the mistake, I got dinged for not fixing it fast enough, and my career slowed down as a result.

The refactor was incredibly difficult and launched without issue and I am probably back on track for a promotion, but I have started to make noise about the incentives being askew


> I have started to make noise about the incentives being askew

Choosing the hard path again I see.


Haha, yep. I just think it's fundamentally unjust for the developers doing the (easiest) green field development to get promoted while making significant architecture mistakes, then punishing later developers for pointing out those mistakes and getting stuck with fixing them. Until someone makes noise, this is just going to keep happening.


I've had this exact same experience. Half-baked, unmaintainable features are launched. The developers get promoted and leave the team. The team is left to clean up and pay the consequences until development speed slows to a crawl.

The team isn't able to get promoted because they're busy doing boring, undervalued work. The engineers responsible for the mess ultimately end up ahead.

If your goal is to get promoted then clearly the best path is to take shortcuts and not act in the long-term interest of the team. You'll get promoted for it. If your goal is to build a product that won't be burdened in operations/maintenance then you have to move slowly, correctly, and you'll be setting yourself back.

This was my experience at a FANG.


Precisely. If I were to have neglected to point out the issue, or refused to lead the fix, I would probably have been promoted by now. Instead I chose the hard path and got punished for it. However, the team will benefit far into the future because they will no longer be fighting a schema that was just wrong. The only cost they had to pay was my career velocity.


It's a big red flag when a dev makes something really cool that is (according to them) 90% done, then hands it off to someone else to finish. It's a sign that they:

1. Can't finish what they started

2. Are going to rewrite stuff unnecessarily

3. Are willing to throw their coworkers under the bus


I will say, in this case it is not what happened. The previous system launched successfully with an incorrect schema. The platform grew and when it came time to add more features, some of them were either impossible to do because of the incorrect schema, or they would have been so complex to implement that the system would have fallen apart. Needless to say, the schema needed to be corrected, but the wrong schema was already baked in all across the platform.

So it's not like the main developer handed something off that was unfinished. They handed something off that was finished but wrong. And because of that we had to replace the engine while in flight, so to speak.


And companies wonder why loyalty is hard to find...

Far easier to jump ship at opportune moments for fast tracking a career than fighting dysfunctional internal incentive structures.


Is k8s really considered that complex? Yes, you need a lot of tooling to go from code to orchestration. But when you are commoditizing software to run on a desktop, on prem, in the cloud, across many customers and topologies, reproducible deployments are way simpler than hand-rolled whatevers.


From what I understand, the labour/complexity in K8S lies not with the deployment, but with managing said deployment over the long term.

This is a real problem for system architects/security professionals that need to maintain a bird's eye view of the system - a job that gets increasingly complex with every additional microservice once you hit critical mass.

This is not to say that K8S does not work - they do, but for certain problems. The cargo-cult mentality, however, threatens to introduce complexity in systems where there is need for none.


Not exactly. But k8s enables a lot of complexity that is usually underestimated.


Kubernetes itself is not the problem. It's the DIY PaaS setups people build with YAML files, bash scripts, Jenkins jobs and webhooks that people build on top of it.


People need to stop trying to get promoted then.

I'm very happy building simple and easy to maintain systems without flashy features. I couldn't care less about a promotion.


I swear 50% of microservices is feeling really productive as you have invented so many more things that need gluing together.


Failing upward and job security from complex systems is a strange thing. It's never done outright. Like, I've never had a conversation with another engineer where we conspire or they tell me they're going to or have built a complex system just for this purpose. It's only observed amusingly after the fact.


Technical employees who frequently act like firefighters (especially those that draw attention to themselves when doing so, i.e., the martyr types) are toxic and should be removed ASAP.


No more firefighters, no more fires! Qed.


I'm an architect and I keep things simple too. The problem I have is people around me expect more complex/intricate solutions. They think simple is not sophisticated and savvy. They think the competition is ahead of our 'old, simple solutions'. I have to continually justify and explain that KISS is always the right approach.


Well, it's a trade-off, isn't it?

Complexity can result from features which pay for themselves. Complexity is not automatically always wrong.

At scale, complexity solves problems that you might not otherwise realize exist. Some of these problems are intrinsically complex, they can't be solved with a simpler solution. Not without "simple" organically growing into a much worse mess than the dominant complex solution that was designed by acknowledging complexity from the start.

But what I want to respond to is your trying to explain that "KISS is always the right approach". Rules of thumb like these are cheap.

They cause people to apply the same patterns regardless of the situation. Dispensing folk wisdom indiscriminately only serves to stop all thought and analysis.

Zealous rule-based engineering is the deathrattle of good systems design, and it makes me ill at ease.


Right, but because the ‘default’ inclination is towards ever increasing complexity, I still believe the mental exercise of attempting to simplify is almost always worthwhile.

Meaning to say, I find KISS almost universally applicable, there are not many situations where you wouldn’t want to try and simplify (if you can).


People don’t always agree on what simplicity is. To some, older proven technology is simpler, even if it requires more time, hustle and code to use. To the other, the same problem solved with a newer intricate tech, but solved quicker and leaner, is a simpler solved problem.


I smiled when I heard a hedge fund was using a "data engineering" solution that many today would consider outdated - I guess I'd invest with them.


If they are just looking to sell things to make money without regard to quality, then they're probably right. People underestimate how large of a market that is.


I gave a talk at CU Boulder[1] in 2019 on complexity in cloud infrastructure and the design of rsync.net.

What I wrote, and then said several times was that:

"Simple systems fail in boring ways. Complex systems fail in fascinating ways."

I gave examples from Chernobyl, Air France Flight 447, etc. as examples of such. It's a topic that fascinates me and I had hoped to give several more such talks - but alas, the pandemic.

In my opinion, the best discussion of this topic is the book _Normal Accidents_ by Charles Perrow[2].

[1] https://www.colorado.edu/libraries/2019/11/08/cyberinfrastru...

[2] https://en.wikipedia.org/wiki/Normal_Accidents


I'd love to see/read that talk if you happen to youtube it or write it up.


No, it was neither recorded nor transcribed.

Some of my observations on complexity were touched upon in this interview, however:

https://console.dev/interviews/rsync-john-kozubik/

Specifically:

"rsync.net has no firewalls and no routers. In each location we connect to our IP provider with a dumb, unmanaged switch.

This might seem odd, but consider: if an rsync.net storage array is a FreeBSD system running only OpenSSH, what would the firewall be ? It would be another FreeBSD system with only port 22 open. That would introduce more failure modes, fragility and complexity without gaining any security."

... and I recommend these console.dev interviews in general - they're usually very interesting.


When people face a new problem they intuitively try to find something they can add as a solution, not thinking about what they can remove to prevent the problem from occurring in the first place. This becomes apparent in the way marketing works. You sell products by promising they will solve a problem. You don't sell things by telling people what they really need to hear, that they just need less of everything. Simplicity is unfortunately not a product you can sell.


Part of this seems to be that people are afraid of removing parts of a system. It's easier to add stuff because you "know" you aren't removing some important bit, so it's "safer", even though the new stuff is likely to interact with the old stuff in unexpected ways and break things anyway.


This is where the strangler pattern can turn around and strangle you.

At my last job, we hired a new team to transition our stack to a new architecture. They built it, and SOME components of the legacy stack DID get replaced and deprecated. Some did not. Which is why across our two stacks, we used no less that 4 totally different, separate database implementations. 3 different NoSQL cache implementations. 2 different types of kubernetes clusters, (RKE and EKS). 3 totally separate monitoring and alerting systems. (just a few examples, but the whole problem is way too big to list out here). The worst problem was that staff who designed the legacy systems had either left, or didn't remember how they built it or why they made all the choices they did. So the ONLY effective means of knowing whether a legacy component could be replaced, was to simply shut it down and wait to see if anybody complained. (we called it a "scream test"). In some cases, customers would simply stop paying, and nobody (but finance) would notice for MONTHS. We'd still have people logging in from non-paying customers. Or in other cases, the customers had totally shut down their own operations, so there was nobody to pay, and nobody to call us and say "hey, we're not doing business with you anymore". Engineering never wrote a formal way of validating the whole system, so when features would break, the only way we'd know is if a customer noticed (assuming they're still using the feature), and called Support.

Now, this only happened in the 2 years I had been there. In the previous 15 years, I imagine this had happened several times (aborted modernizations/migrations).

In all that time, nobody had been allowed to sit and plan out how the whole system could be simplified, so that the whole organization could even understand what they were deploying and paying AWS for.


IME, that fear is the consequence of insufficient testing and specifications. But more the former. If your tests (unit, integration, regression) aren't sufficient then removing and changing existing code is like doing surgery in the dark.

Good test suites alleviate fear and allow people to make changes (alterations, additions, deletions) without fear, greatly increasing the speed of development and delivery.



Corporate America rewards higher budgets and adding new anything, in all practicality Corporate America rewards complexity.

There's no glory in simplicity.


Because then you can employ more of your lawyer buddies to consult on the complexity.


Part of the problem is counterparties. Usually, large parts of "the system" are owned by someone not-you. This differentiates who can do the work (or more typically, halt the work) in each case.

Add something - you

Remove something - you + other owners + all other users

Which explains why system migrations only succeed when championed by a VP+.


The other side of the coin is that complexity creates jobs because you need individual who guide you through it.

Lawyers and bureaucrats careers are based on this complexity of criminal justice systems and government administration, respectively.


> "It seems that perfection is attained not when there is nothing more to add, but when there is nothing more to remove." - Antoine de Saint Exupéry


Did you just watch this tour of the Starbase with Elon Musk?

He spends about 5 minutes talking about how removing things is the most essential part of a good design: https://youtu.be/t705r8ICkRw?t=833


A pattern I see with some companies -- usually, but not always, companies with little turnover, which leads to internal stagnation -- is the foundational belief that THEY ARE VERY SPECIAL and need things done a CERTAIN WEIRD WAY that is not supported by any commercial offering.

Often, this leads to homegrown solutions that are never at feature-parity with COTS options, but have the benefit of costing more.

If they buy something instead, they festoon it with a bunch of customization and external, parallel tools that must be kept in sync at the data and configuration level for everything to "work".

What they NEVER do is re-examine their assumptions about how they must work, or interrogate why they can't go the same route as the bulk of the market on these points and thus have a simpler, more maintainable, less frustrating system in place.

It's not quite this bad, but it's ALMOST like a company insisting that "well, here, for historical reasons, 2 and 2 are 5, not 4, so we have to do . . . . "

It's exhausting, but OTOH, well, to the extend that my employer is involved, we tell them our recommendations but ultimately bill by the hour.


It should really say "overcomplicated system tend to break more frequently".

It is not simplicity that makes for less downtime, it is unnecessary complication that does the opposite.

I spend time to complicate my applications a little bit to make sure there is no downtime, something pretty important when one of the largest banks on Earth will stop along with your application.

The simplest solutions would typically not be able to ensure no downtime operation. I need code so that I can do rolling upgrades and I need code so that my application can partition work and rebalance it reliably as cluster map changes.

The problem starts when you start overdoing it. Maybe you are expecting too much in terms of guarantees. Or you want a simple guarantee but then you duct tape to it a huge and complicated clustering solution. Now you have a lot of problems. Your team doesn't know how it works. Your team doesn't know how it fails. It is not easy to tell if you are obviously integrated the right way. And so on.

The goal, as usually, should be to "keep it simple, but not simpler than is necessary".

For example, the approach we have chosen was to get by with as little guarantees as possible implemented as simply as possible.

We decided on immutable data. We decided objects being saved as documents each with entire state of the object after each change. This costs a lot in space and processing needs, but you know what? It is fine. I work for a bank after all. The one think that costs more than space and processing power is downtime, and that's what we are trying to focus on. We know how to deal with duplicate data.

So another rule of thumb: try to find compromises in your application, use them to replace hard problems with easier problems.

Making an application super reliable is hard problem. Adding more storage space and memory is (relatively) easy one. If you can solve hard problem by replacing it with an easier one, you are winning.


Indeed. Ships contain redundancy. Should this complication be removed to reduce downtime?

Over-engineering does not mean "making the thing worse". It does if your engineers don't know what they're doing. If they do then the added complexity increases reliability.


Regarding "no downtime", there are very few applications where that's actually a good goal to have at all. Quite often, simple systems can provide you high availability where your downtime is a few seconds or minutes at most (during larger maintenance operations), and in many cases you can hide those blips by simply retrying (with proper backoff). There aren't many systems where you actually need to guarantee "no downtime".

What I think is important most of the time instead of zero downtime is predictable behaviour when failures occur, so that your system doesn't end up in an undefined state where you don't know if you can recover without data loss.


There is more systems that can't tolerate downtime than you think. And this is mostly because you treat it as fabric of everything around you and you only notice when it fails.

Mobile networks? Power delivery (basically all utilities)? Broadband internet? Factories? Payment systems? Air traffic control? Any internet services at all?

One can think that only Google or Facebook need to maintain high availability, but basically any internet services needs to do that or face possibility of loosing clients.

Smaller company may have less clients but the downtime still affects their clients and in turn them the same way. Loosing 10% clients after snafu is as painful when you have 1 million clients as when you have 100 clients.

It is also not about having absolutely no downtime -- this usually cannot be guaranteed. It is about having less downtime.


All those systems have hours, even days of cumulative downtime per year. Planned and unplanned. The sky does not fall. When you are "three nines" - the other 0.1% is the downtime you're tolerating.

Power outages of a few minutes to a few hours are utterly normal. Power outages up to a few days due to summer heat and winter storms are part of the rhythm of life, depending on where you live. Facilities that really care about power continuity have batteries and generators (although these aren't perfect either, we once lost a datacenter to a transfer switch maintenance).

Broadband is notoriously flaky, to the point that cable technicians' vague arrival windows are a meme. Serious businesses get several independent connections. Even consumers can now fall back to tethering their phones.

Credit card authorization gets skipped during downtime. Actual payment settlement occurs in nightly batches, which humans have many hours to shepherd and patch. FedWire keeps banker's hours. Stock markets suspend trading when necessary.

Stopping the line is a normal part of the lifecycle of a manufacturing process: when something goes wrong, when there's going to be an upgrade, even for regular scheduled maintenance.

Most internet services have some downtime.


Of what you listed, only the telephone network is one where I don't recall experiencing downtime; probably because it degrades gracefully to reduced functionality if something fails. Can't say much about air traffic control, but I imagine they have failures too and just have backup protocols in place when primary systems fail. A payment system is probably the closest to a computer system where you really don't want to drop any incoming requests, though after the initial payment event has been recorded, the behind-the-scenes processing can tolerate quite a lot of delay in the worst case.

Networks certainly fail all the time, and power delivery issues aren't uncommon either; the downtime just tends to be localized and if you depend on your internet connection or power, you have backup links and UPS systems that reduce or avoid the impact of downtime.

I never said you shouldn't strive for high availability, especially if you're moving vast amounts of customer traffic, I specifically made the argument against "no downtime", because people often seem to think that if your system doesn't have five nines of uptime it's unsuitable for handling "real" traffic, and that to achieve enough high availability you somehow need a highly complex system.


One of my hobbies is tracking down the origin of concepts and phrases. One such phrase is "complexity is the enemy", which I'd first encountered through the Jargon File, though it appears elsewhere, e.g., http://www.neugierig.org/software/blog/2011/04/complexity.ht...

Tracing that trough Google's Ngram Viewer, I found the extended form, "complexity is the enemy of reliability" in a short item in a 1959 issue of The Economist (Jan 18, 1958, volume 186 https://books.google.com/books?id=aDsiAQAAMAAJ&q=%22complexi...).

That in turn was referencing the UK's Annual Report of the Chief Inspector of Factories for the Year 1956, which had been published in late 1958. Among the factors assessed was factory and equipment reliability, which scaled precisely inversely with the number of components in equipment.

https://www.worldcat.org/title/annual-report-of-hm-chief-ins...

Software isn't quite the same as mechanical equipment where individual components may wear, but each software component does exist and interact in relation to others. And eventually that complexity comes home to bite.

A full definition of "simplicity", as with "complexity", lies somewhat in the eye of the beholder. But extraneous components are a problem.

(I cannot find either reference online, presently, though I've copies of both documents. I'd tracked the Economist issue through Google Books, and received a copy of the report through a friend. I'm told that the UK generally has excellent availability of government publications, though none of https://www.gov.uk, the National Archives, nor the British Library seem to turn up the Factories report presently.)


> I'm told that the UK generally has excellent availability of government publications, though none of https://www.gov.uk, the National Archives, nor the British Library seem to turn up the Factories report presently.

Since this is from 1956, its copyrights have already expired (or waived, as the case may be, see https://lists.wikimedia.org/pipermail/wikipedia-l/2005-May/0...), so scanning it is a fair game, but if the British government (probably the National Archives) scanned the document, it would add another 25 years* to the digital version (unless released under OGL).

* At least in Britain, in the US simple photographic copies are not considered as derivative works and therefore not in copyright.


I need to check with the friend who'd said it was available.

I turned up a copy through a friendly online librarian (as with the Economist article). Given the age of both publications and their rather minimal commercial value, the hoop-jumping is annoying.

That said, the general availability (if not legality) of information online has absolutely exploded over the past decade or so.

Abesence of copyright in published works in the US is a specific and legislated exception to exclusive rights in copyright, and applies only to the US Federal government. Not states, counties, cities, or other governmental units. And not foreign governments either. 17 USC 105(a)

https://www.law.cornell.edu/uscode/text/17/105


Ah, okay. I forgot that bit, I've removed the erroneous section.


> A full definition of "simplicity", as with "complexity", lies somewhat in the eye of the beholder.

Perhaps "somewhat", but not entirely or arguably even mostly:

http://curtclifton.net/papers/MoseleyMarks06a.pdf


I'd extended a concept of Charles Perrow's a few months back on determinants of simplicity vs. complexity in systems, generally. See: https://joindiaspora.com/posts/97208f300fc4013901a3002590d8e...

(I think I may have swapped the ranges on "Threshold sensitivity", which seem more reasonably to be "low/high" than "high/low".)


Simplicity is simple to define for simple problems, but complex to define for complex ones.


Simplicity is a shape that composes or has symmetry in that dimension.

The greatest act of simplification I did was reduce the problem from a continuous one, some one wanted to filter on distance and I moved it to a set of three buckets.

This discounts all the things argued out of existence or replaced with a sql query.

Which reminds me, the longer you can keep something relational the larger your chance at reducing complexity, because you can project down to a lower dimensional space to solve specific problems. 80% of work is in choosing that right lower space and making the projection. Once there, all problems are flat.


Is simplicity a fitness function?

Are fitness functions universal?


i like your hobby. i agree fewer parts is part of the definition of simplicity. i'd suggest "predictability" fits somewhere in the definition as well, in the sense that when you make a change to the system you can predict the outcome or side-effects; also, a person reasoning from first (or fewer assumptions) will have an easier time predicting the design of the system


Predictability to me seems more a consequence of than a contributor to simplicity. Though there are very simple systems (e.g., three-body problem, double-pendulum) which exhibit highly unpredictable behaviours.


When applying this to software and you have a complex system there is a lot to learn from Joe Armstrong's excellent paper:

https://erlang.org/download/armstrong_thesis_2003.pdf

As well as from the way the Erlang language and runtime are organized.


I disagree with the notion that a container ship is a simple system because 13 people can man it. It just means that the complexity is hidden in deeper layers.


Yes! In the future, a whole car factory will have only one employee. His job: feed the dog. The dog’s: keep humans away from the machines.

Just as a ship with dozens of people got replaced by 13: the complexity was encapsulated in simpler interfaces.

Diesel motors: just change this lever. Up is fast, down is slow. But it took a century to refine this absurdly complex device until this simple interface became possible.

I think the point here is: don’t build your own diesel motors. It’s unlikely you’ll beat a good model and your interface to it will probably be worst. Remember: people take years learning from their own mistakes to hide the complexity under simpler interfaces layers.

Unless your company IS building diesel motors.


So what happens to your simple encapsulated system when the diesel motor breaks down, or one of your hydraulic lines breaks? Does your simple encapsulated system them also say "we have no downtime, because we have a simple system"?


You leave port with 10 replacements for every component of the encapsulated device. Or just a full replacement for it.

Hidden complexity in the case of the diesel motor was made possible also because of know failure modes (how many years until the low hanging fruits all got solved, how many decades until the not so easy got too?) and standardization of its components (how many centuries just to understand and standardize the universal joint https://en.wikipedia.org/wiki/Universal_joint?).


I'm pretty sure that is the exact point the parent post is trying to make!


But the question is, is the 13 manned ship still simpler than one that requires people doing all of those jobs? ie. is maintaining the machinery less complicated than maintaining the equivalent number of people that would be required to do the same jobs? Arguably yes.


I'm doubtful. You might be able to get to harbor more easily if something breaks down along the way, but actually getting the ship fixed afterwards might take longer, because the apparent simplicity is built using highly complex systems. You see the same thing in modern cars, which are often a real pain to fix, because it's all computers and complex parts, rather than one axle going from the steering wheel directly to the wheels.


> because the apparent simplicity is built using highly complex systems.

More complex than humans, with wives, kids, need for downtime, hazard pay, insurance requirements, mental and physical limitations, and so on? A lot of the complexity of humans is invisible to us because we're just used to it.


People don't go out of their way to build complex systems.

What happens is that it starts simple and becomes complex as more features are added often with limitations attached e.g. time, money etc.

So yes this guy migrated from Marketo to Hubspot and it was simple. But the idea it will simply remain that way is laughable.


While what you’re saying is, of course, fair, the other side is also true: people often choose to complex a technology “because it’s future-proof!”, while they need just a simple system. There are tons of examples: CQRS over simple databases, Hadoop while a single server suffices, or even people choosing JIRA over a simple trello board.

As such, the obvious answer is “it depends”, and making the right trade-off is rarely properly captured in a single rule of thumb such as “always choose simple”.


> CQRS over simple databases, Hadoop while a single server suffices, or even people choosing JIRA over a simple trello board

I've heard this from many engineers over the years.

But what I've often found is that they were never involved in the decision making process and so aren't aware of all of the business requirements. Once they are made aware usually they agree with the choice.

Common example being the business requirement to have high availability.


It seems to me that the concepts of high availability, scalability and resilience are things that are just not taught or emphasized at most companies average at all. As Ops, this is a daily battle of education and prod readiness checklists.

It is even the simplest things that are ignored such as:

  - what happens to your db query when you have 1 million records?
  - what happens when the cache goes away?
  - do you even use a cache?
  - why are you treating your cache as a DB?
  - can we run 2 copies of your service?
  - can your service scale up and down?
  - does your service have state?
Kubernetes has been a boon for us in this department. Since pods can be moved/deleted at a moments notice we have had to make our software resilient and stateless. Yes, kubernetes is complex at first but it pays off hugely when you are no longer getting paged because 'server X hung and needs to be rebooted'.


> a single server suffices

A genuine question that I wonder about: How does a single server protect you against someone in your DC pulling a plug by mistake or your cloud provider being unreliable? If it’s a database it seems like a huge potential for going out of business.

I guess if you are running single-server databases you better have a good backup strategy anyway, but is that a risk people and businesses just accept?


If you have a single server pulling of a plug (or even any HW failure which cannot be fixed quickly) is a very rare even (unless you use cheap desktop hardware instead of a decent server with ECC RAM, hot-swap HDD, redundant PSU e. t. c.).

If instead of using this single server (with manual fail-over in case it would fail) you'll start to create complex HA cluster from multiple servers not having enough people to properly design, test and maintain it you can end up having a less reliable system.


I think what you are missing is that simple databases in a well run data-centre are surprisingly reliable.

The tradeoffs involved in avoiding some of the simple "the cleaner unplugged the DB server" introduce complexities which may themselves cause outages which are much harder to solve.


In short: yes.

The recent fire in a French DC saw serious disruption across all kinds of industries in many European countries. I know a sales engineer in a hardware manufacturing place, who was back to pen and paper until the IT provider figured out restores. But it wasn't that bad, as both the clients and competitors were decimated too.


> How does a single server protect you against someone in your DC pulling a plug

That’s why you have redundant power supplies with separate power cables.


> CQRS over simple databases,

I've never understood the hype about CQRS. To me it looks like a nightmare for a tradeoff that you "may" only need at some point in time. Particularly given GDPR, HIPAA, PCI and other compliance frameworks that require you "the right to forget". All books about CQRS talk and talk about the wonders of it and when it comes to that detail they just limit themselves to say "yeah, you should consider it"... but the reality is that the implications are huge.


That's not always true. Developers like to well, develop, and so they like to implement all of these things irrespective of the fact if those things are actually needed.


That happens way more often that people think. Developers do indeed like to develop things.

We frequently run into customers who want to implement some feature or system, which really should just be handled by their load balancer, database system or some standard Unix tool.

I currently work with a customer who have an... interesting approach to api-gateways. We already ripped out 75%, because it either does NOTHING or have been replaced by rules in ha-proxy, which we use anyway. The last bit of code can just be integrated directly into the single remaining service behind the gateway.

The same client is running Redis, which seems innocent enough. Fairly simple, easy to manage, but it only hold ONE item. A 150kb JSON document... That's can just go in the memory of the signle process which needs it and which is responsible for refreshing the document anyway. Worst case is that it takes 60 seconds more to restart the service.


I've yet to see something in the wild that starts simple. Usually it starts too complicated and just gets worse over time.


You can started from a complex system by using hyped (popular) technologies X, Y, Z. They may be good, but not necessary in every single case. People like complex tools because they look good on CV. And when everyone around uses them it feels like you are doing thing wrong if you have a simpler solution.

Just add K8s to the mix and you are starting from a relatively complex system.


> People like complex tools because they look good on CV

Is this really a thing? I have yet to see a single case of it.

OTOH what I found more often than I could count is the belief that complex systems or tools are necessary because the problem is complex (it isn't) and the company's situation and requirements are so special (they aren't) -- not intentionally but simply due to lack of time and lack of contemplation of the problems and available solutions.


Well, I don't think anyone would openly tell what they prefer shiny X to boring Y because X would look good on CV. It's just an impression I'm getting sometimes.

But what I've seen multiple times: someone is enthusiastic about a new technology and advocates for using it in a project at $job even when a simple, but equally or better suitable for the task option is available.

It may sound offensive, but my impression is that many developers are like children - they like to play with new toys and quickly bored by old. And in software industry they have an opportunity to choose new toys and be paid for playing with them.

Also I've seen many discussions on social networks where people talk that nowadays it is hard to find a good job if you don't have X, Y, Z on your CV so they may feel pressured to try these new tools even if they don't enthusiastic about them.


I think anything that is growing and is new has a tendency to become complex. Simplicity is a constant balance. You have to allow end endure the complexity to get anywhere new and then re-integrate it by creating order. Such is coding and such is life haha.


I fell for the everything must be a microservice / distributed across as many servers as possible trap. Even though I've read so many warnings about it here on HA and knew upfront I might have to rewind everything.

The setup :

- Distributed file system using GlusterFS

- DNS load balancing using Amazon Route 53

- PostgreSQL HA clusters using Patroni

- A WireGuard mesh topology between all instances.

Even though it was much fun setting this all up, the ballooning complexity of all this outpaced the benefits and it felt more fragile than where I came from. I decided to just scale vertically and keep it simple.

Most of my refactoring these days is coming up with more simple, more manageable solutions.


> Even though it was so much fun

That’s the root of quite a bit of evil in software. A closely related motive is a desire to show off.

Play is good for learning but not for production systems. You’ll regret it later when you are up at 4am on a Sunday morning troubleshooting some Byzantine stack.

The antidote is the realization that simplicity is harder than complexity. Simple but highly effective systems are the ones that should inspire awe and admiration. Complexity is a sign of an immature design or a lack of high level conceptual thinking. A system should be only complex enough to capture essential complexity (problem domain requirements) and no more.

One more thing… there is a ton of submarine marketing in our industry that encourages complexity because it’s profitable for vendors and cloud providers.

The design you outline leads to more cloud resource consumption, more lock in to cloud platforms, and eventually a need for lots of service mesh, config management, and orchestration products that if the project grows will eventually start costing money.


> That’s the root of quite a bit of evil in software. A closely related motive is a desire to show off.

Yes but its also driven by keeping oneself employable.

" I built a simple system to do X" is not a winning conversation in an interview. GP's complex system is what ppl say they built.


I don't think that this sort of selfish thinking is dominant. The explanation is much more simple - many in our field see complexity as a virtue, not an enemy.


That's not how you frame it. You don't say simple, you say efficient. "I built a highly efficient system to do X that required only $$$ in monthly cloud spend and handled NN operations per second..."

You can also point out how fast you built it, since simpler systems often take less time.


The more I think about it the more it seems to me 'simplicity' is just another name for 'encapsulated/isolated/modularised complexity'


This is only true if all the complexity is necessary and intrinsic to the problem. Simplicity means there is as little incidental (unnecessary) complexity as possible.


Simplicity is a name for isolated complexity only if there is as little unnecessary complexity as possible? You can only contain intentional complexity. I think you are agreeing with me.


I think you're underestimating the amount of inessential unnecessary complexity in most software and systems designs. There is a lot that can be trimmed without touching required complexity.


Did I say anything contrary to what you just stated at any point? I literally said simplicity is an illusion, just a name for tamed complexity.


The Author generalized "fewer features lead to less downtime" to "simple systems have less downtime".

A simple system such as a hand-written web server is very likely to crash. It's very likely cannot serve many users if we don't make it serve requests concurrently. What's worse, one tiny exception in one request would bring the whole system down. It stops serving any request to any user.

A relatively complex system with a self-recovery mechanism added goes a long way compared to the former. The Erlang implementation is much more complex than "an Erlang without processes". Kubernetes is complex, and it definitely gives us less downtime than simple scripts we have written before we use Kubernetes.

The real world is chaotic. If a system needs to strive for less downtime, it needs to have some features that mimic biological creatures, instead of being idealistic simple.


> A simple system such as a hand-written web server is very likely to crash.

> Kubernetes is complex, and it definitely gives us less downtime than simple scripts we have written before we use Kubernetes

There is a point in between 100LoC hand-written web server and K8s cluster. E. g. you need to serve static files and load is small to moderate (say <=1Gbps) you can do this in two ways:

1. Create a K8s cluster where HTTP traffic will be dynamically proxied to a container running on some set of nodes reading files from GlusterFS cluster.

2. Install nginx on a pair of physical servers (with disk configuration depending on the load: from HDD mirror to SSD RAID 10) and something like CARP for HA IPs shared between them.

2nd option is much simple and in most cases will be more reliable: in theory K8s option can have higher availability but it requires much more effort to not screw up something and much harder to troubleshoot when something goes wrong.


My single Intel NUC in my basement on residential cable serves cloud services at higher reliability than Office 365.

Sure, it doesn't do it for millions of people, but it's drastically less failure-prone, despite the lack of resilient design.

What boggles the mind is people are often sold the cloud even though their org isn't serving millions of users, and could just as easily operate as a box in a closet.


> What boggles the mind is people are often sold the cloud even though their org isn't serving millions of users, and could just as easily operate as a box in a closet.

Capex vs opex may play a part here. Also, companies in general seems to have transitioned to using services in general. 20/30 years ago companies had cleaners in them. Now everyone uses a cleaning service.


I have read about a dozen and a half articles about capex vs. opex, and it still makes absolutely no sense to me why companies prefer paying over double for opex what they'd pay for equivalent capex over the same lifecycle.

Either corporate accounting is some mystical art that makes money appear where there isn't if certain practices are followed, or there's some collective mass delusion that opex is just better? I don't know, it's a concept that truly baffles me.


For a startup it makes perfect sense to avoid high CAPEX: it is not worth to invest into own infrastructure which will have positive ROI in 3 years, if you are not sure that you startup will still be alive in 3 years.

Why big established companies prefer high OPEX to lower CAPEX is less clear.


Indeed, I fully understand why a startup goes for the cloud: Either it scales fast enough that the dynamic scaling of cloud is key, or it fails and assets become pointless anyways. But even then, once a startup hits a certain level of stability, it makes sense to invest in your own infrastructure to minimize expense.


I think a better point would be "simple systems are quicker to fix" or, more concise, "avoid unnecessary complexity". In some way, complexity allows flaws to get in, but that does not mean it's unnecessary.

Take, for example, website hosting. You can put your full page on a single server, but then you're prone to the machine going down, the disk having errors etc.. A more complex system, like Kubernetes, will be far harder to troubleshoot, but it will not go down just because one machine died.

Or, as an engineering example, take planes: They're a lot more complex than ships. Fixing them when something is wrong can take long. But that complexity is needed so what when something fails, you get to the airport instead of going through a rapid unscheduled disassembly.


Until recently even the huge airliners had physical strings going from the rudder all the way to the wing and tail control surfaces. Precisely because more complex systems were less reliable.

Also, even with fly by wire there’s sometimes (or often, or always, no idea) a backup, eg a direct electric / hydraulic link.


sounds like the 737MAX. . .


This is sort of like the old "Water is wet; news at 11." thing.

But simple is not easy. In my experience, it often comes after complexity.

My general approach to simplification is to first get it working; even in a complicated fashion, then start removing stuff, until it stops working. If I can't get it working on the latest simplicity, I am forced to add the last thing back.

But I can usually get it going, which may sometimes require a rearchitecting.

One of the happiest times for me, when I'm writing code, is when I get to toss out a whole bunch of painstaking work that I did.

I'm going through that now. I'm working on an app that has necessary complexity, and I'm finding ways to toss out code all over the place.


"First, make it correct. Then, make it beautiful. Then, if you need to, make it performant. Because 9 times out of 10, making it beautiful also makes it performant enough."

- Joe Armstrong


So true. I am working with a system currently with tons of "moving parts", once it's up it's neat, but there's always at least 3 things broken every time I install it. The person who created it is the de facto expert, and everyone else just relies on them to fix it or extend the more tricky bits.

It does leave one wonderful if the niftiness is worth the pain, or if there is a way of getting the good bits without the complexity.


Sure, but you need a path there. How do you get from complex to simple? That's really hard. It's relatively easier to keep things simple over time.

By default Money does not know or care about conscientious design. It wants functionality and it wants it now, future be damned. This is not irrational. Usually there is no future.

The problem start with your unlikely success, and yes now you have money to throw at it, but ironically the people that are best suited to wrestle complexity to the ground are not super motivated by money. They are motivated by a peculiar kind of beauty that is, by hypothesis, missing in the successful, complex system, and so they will not want to contribute, money or no.


I'm very confused by the OP.

There's legitimately nothing "simple" about the massive container ship system described in the OP. It's a complicated semi-autonomous cyber-physical system with multiple complex failure modes possible in the software, electrical, hydraulic, and/or mechanical domains. The reason said system has both the superficial appearance of simplicity to its operators and the high-probability of low-downtime is because its very complex systems and subsystems were fastidiously designed and built using tools, processes, and practices which helped manage and understand all that complexity. Not because it is "simple".


In my experience complexity is more a factor of poor governance, management, focus, and purpose than the actual technology itself.


Would somebody agree that a MySQL/PostgreSQL cluster has more downtime than a replicated cluster? How often do high availability mechanism cause downtime itself?


To add a point: "simple" can also mean "standardized".

Self-hosted or cloud-hosted Kubernetes is a highly complex system. But using it to power your internal tooling (e.g. Jira, Confluence, GitLab, Jenkins, whatever) over classic, manually-installed-on-VMs deployment has the advantage that you can drop in anyone with experience in Kubernetes to manage your workloads.


Kubernetes is a complex solution but it solves a very complex problem. It could have been made slightly simpler (and slightly less flexible too) but it's a really good solution IMO. The structure is mostly straight forward and intuitive (at least to me as a developer).


I would go one level up and ask "do we actually need Jira, Confluence, GitLab, Jenkins, whatever on premise?". The question then is not "K8s or manual way", but "Do we actually need K8s?"


Many companies in the EU actually blanket ban any storage of their sensitive information in the cloud, especially the Atlassian cloud after the Australian espionage law change.


That's a good point! Then we raise the question one level up: "does the current law regarding storage of sensitive information make sense?" It's a never ending story for those who like to delve in these sort of things.


For all what it's worth: yes. The larger the cloud, the larger the motivation for enemies to attack it. And in this case, the enemy is China (and, for European companies, also the US/FVEY alliance) - all of which have already been caught multiple times doing industrial espionage.

Core company secrets belong on premises of the company owning the secret, not in a cloud where it is fair game for security services and criminals of all kinds and nations.


Yes, that’s right. But then how long can you achieve simplicity after your startup grows?

That’s very hard to implement



This reads like a sales pitch. And the idea of letting a sales team sort and organize leads is impractical.

I absolutely agree with the premise - but some of the anecdotes are laughable.


I'm going to make a controversial claim for the sake of argument.

We can and will debate what programs are "simpler" all day, it's all subjective... except for one metric: line count. Line count is objective.

Take pause before rejecting any change that uses fewer lines of code.

It takes an awfully good abstraction to beat simply having less code.


It seems like it should be objective, but it isn't. Because you get people writing ridiculous "clever" one-liners that are difficult to parse and understand. That clever one-liner could be re-written to do the exact same thing and be much more clear and readable, but it would take, say, 8 lines instead of 1.

Then there's the subjectiveness of what you actually consider a "line".


Fair. "Less code" is probably a better metric, but still has the same problem you describe.

Still, some argue that an 8 line for-loop is better than a 1 line map or something, but a map is more constrained in what it can do than a free style for-loop.

And 8 vs 1 lines in some corner of the program is less important than whether or not we make it a coding standard to write and use AbstractBeanThingyamabobs for all our stuff, etc.


and then there's "syntactic sugar"


Kolmogorov complexity is close to what you're getting at, but is about total string length of the program which is more useful than line count when line count can be gamed. That is, a shorter program is "simpler" than a longer program. In quotes because it's not necessarily true (see code golfing, the language itself may become very complex to permit such a short program and the requisite knowledge and competency then increases the total complexity to achieve, or even understand, the simpler result). A similar technique that doesn't discount meaningful variable names might use syntactic token count. That way:

  printf("Hello, World!\n");
and:

  Put("Hello, World!\n");
Can be treated as the same complexity (the difference is 3 characters, and they are otherwise equivalent).

There are also analysis methods (names escaping me, and Google fu is weak today) that look at loops, procedure calls, dependency graphs, and other things to attempt to discern (should be treated as guidelines and not rules) the complexity of a program using objective metrics.

https://en.wikipedia.org/wiki/Kolmogorov_complexity


> There are also analysis methods (names escaping me, and Google fu is weak today) that look at loops, procedure calls, dependency graphs, and other things to attempt to discern (should be treated as guidelines and not rules) the complexity of a program using objective metrics.

Cyclomatic complexity https://en.wikipedia.org/wiki/Cyclomatic_complexity might be what you're thinking about.


That was it, thank you.


>the language itself may become very complex to permit such a short program

Hit the nail on the head. We can't fight thermodynamics, we can only cheat by drawing lines and pumping entropy one way. "Simplicity" is obtained by hiding complexity away. Modern container ships are reliably run by 13 people because of a huge system of builders and maintainers for everything from diesel engines to navigation systems.


Familiarity is a rug we get to sweep the complexity under.

If people know a language, like regexes (or APL as an extreme case), they can hide a lot of complexity.


"Complexity is the enemy of availability."


Arguments are convincing


or put the way round: smart systems have more downtime.

In short: smart is stupid.


“I didn't have time to write a short letter, so I wrote a long one instead.”

― Mark Twain


Yes, let's replace these computers and smartphones with pen, paper and dumbphones.


If you were to do that, those "systems" would definitely have much less downtime.

It might not be as efficient or convenient, but that wasn't the question. If the efficiency of convenience of two competing system is sufficient, the less complex one will have fewer parts to fail, and cost less downtime.

So for example for a "who touched this cabinet last" sheet, pen and paper would indeed often be better than smartphones.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: