Cloud Native Software Engineering

bob1029 · on July 5, 2023

I really like Figure 1. This reminds me of a particular meme wherein the extremes are ideal.

Right now, we are in the left-hand side. Trying to go straight to the right-hand side. We have zero interest in a middle ground approach.

Spivak · on July 5, 2023

And the all serverless approach isn't even possible in general, they aren't drop-in replacements for one another, and god help you they aren't trivial. Real production workloads architected around Lambda get really complex really fast.

bob1029 · on July 5, 2023

> the all serverless approach isn't even possible in general

It absolutely is. You can build an entire webapp that runs completely on top of HTTP trigger functions. We are doing this right now and it's a fantastic experience. We arent even using SPA crap anymore. Final, server-rendered HTML coming directly out of the functions. Traditional HTTP form submissions, etc. The only JS exists to facilitate certain UX and multimedia integrations.

If the FaaS solutions in the cloud are feeling complicated, you are 100% using them wrong. The entire point is to take all the traditional infrastructure complexity and dumpster it onto the cloud vendor. Trying to maintain control over all the tiny details is where you go off rails. You have to learn to let go if you want to understand why these ideas are so compelling. Combining actual cloud-native (aka 'locked-in') serverless with containerization ideologies is like mixing water & oil.

In my experience, you basically have to start a new company from zero if you want to explore these ideas correctly. The philosophies of different technology persons in the org will almost certainly ensure none of it goes correctly. There isn't a lot of room for an ego on the deck of a properly-architected FaaS app. So, back to the original point - maybe it isn't possible in general. But, not for the reasons you seem to indicate.

> aren't drop-in replacements for one another

This is like complaining that the concrete foundation of a structure cannot be easily substituted with another not-so-exact approximation that you found down the street. If you want proper serverless, you have to pick a vendor you trust and stick with them. Anything else is complexity circus.

BackBlast · on July 5, 2023

> If you want proper serverless, you have to pick a vendor you trust and stick with them.

Some of us have to consider platform risk in our designs and not just ignore it. This is what makes lock-in an issue in the first place. Some times, well, most of the time it isn't an issue. But when it is, it's a huge potentially-company-destroying issue.

Different companies have different risk profiles, and it may be perfectly fine for many to just ignore this risk while others, quite reasonably, have to consider it.

bob1029 · on July 5, 2023

> Some of us have to consider platform risk in our designs and not just ignore it.

Has anyone considered pushing back on the higher-order consequences of chasing a multi-platform solution?

Do you truly know how much this objective is going to cost you in terms of complexity, et. al.?

I have absolutely zero issues convincing our clients (banks/finance) that one vendor is an advantage.

Spivak · on July 5, 2023

> aren't drop-in replacements for one another

I mean that for all problem domains where the former is a solution the latter isn't, sorry if that wasn't clear.

Where Lambda bit us is when you need to have compute for longer than 15 minutes which happens when processing large documents. You can split the individual operations on whatever your atomic unit of thing is (eg pages) but the split/merge steps are what kills you. The moment you have to step out of all Lambda the simplicity breaks down and making serverfull/serveless work together is more complex than just going all serverfull.

The other painfully annoying thing is when you need state and storage (ie websockets) but maybe that's better now. I know for sure that EFS is still painful.

RicDan · on July 5, 2023

> managed kubernetes > engineering effort medium

eh, about that…

kitd · on July 5, 2023

K8s is complicated if you go digging around in the guts. But using a managed offering is barely more complicated than docker compose.

bottled_poe · on July 5, 2023

What does this comment mean?

scrozart · on July 5, 2023

What about that?

_sojh · on July 5, 2023

My guess: commenter will soon claim that docker containers and yaml manifests are too complicated. That’s what I tend to see when vague comments like this are left.

Now, I saw a straw man somewhere…

Is there a learning curve to adopt k8s? Of course! What tech doesn’t have some sort of learning curve?

If you are an application developer, there’s very little of the underlying systems that you need to know about or be concerned with. Build a container image that runs your process and exposes a port. Some trivial yaml can launch that image and attach storage and network. Everything else is done for you.

Sure, things get more complicated if you have complicated deployment needs or try to run workloads in k8s in a non-standard way, but that’s true of any system.

Before container orchestration systems there was just as much complexity. You had to ship the code to remote machines, maybe with config management, but probably ssh and a git pull. You had to signal a restart, to some process monitor, init.d, systemctl, or something like eye or god. You had to check if the service was available. You had to expose ports and route traffic. You had to handle crashes. You had to reboot. You had to monitor resource usage. You had to secure the network. You had to bubble up events to be visible to operators, you had to handle phased rollouts. You had to be able to rollback failures.

The list goes on…

There’s just as much complexity deploying services at scale with or without k8s. Without k8s though (or nomad to some extent), all that complexity is scattered across various ad-hoc systems that are OS dependent whereas k8s wraps everything and exposes a common and documented api, and an orchestrator that gathers and emits easily digestible events every step of the way. It’s a framework in a sense. The same way you can jump into a rails code base and know generally where to find things, you can ship your container images to any k8s cluster and should be able to get the same results with little to no drift between each cluster.

Does everyone need k8s? Most don’t need it. But, I’ll argue that if you’re deploying services in a highly available and fault tolerant way, you’re going to be just as complex, if not more, without k8s.

rajamaka · on July 5, 2023

But we don't live in the time before container orchestration, we live in the time of managed services that autoscale and deploy our apps in several lines of yaml.

I think when people make the suggestion k8s is overly complicated for their needs they are referring to that comparison.

_sojh · on July 5, 2023

The choices aren’t k8s or managed cloud offerings.

The choices are k8s, managed cloud offerings, bare vms, or bare metal.

Managed cloud offerings are expensive. Plenty of shops are still provisioning virtual machines from a minimal OS image, or renting cheap metal that can get them much further for a fraction of the cost.

Plenty of people operate in a pre-container orchestration world today and that isn’t going away.

throwawaaarrgh · on July 5, 2023

I see there's no discussion of one of the largest flaws in cloud computing designs today: mutable system architecture & operation. Almost every cloud computing system today is solely operable using mutable state driven by API calls. There is no practical, portable way to design a system such that you can ensure its operation is what you first intended.

Any API call at any time can introduce a problem, or the system itself can simply get into a bad state, and there is no way to simply return back to the original good state. One has to jump through a lot of hoops to to to figure out how to fix the state, live, like breaking a plate and then quickly trying to glue it back together. When the state of the art's best practice should be to grab another perfect plate from the shelf.

To be sure, some cloud systems so have immutable features. Container and virtual machine technology depends on it. Yet the systems that control those containers and VMs have no immutable properties with which to build and operate the larger system-of-systems.

If you design a system architecture today, you cannot build the cloud infrastructure, get it to a working state, and then make an immutable snapshot of how the system works, and restore it later if the system changes or becomes unstable. Again, you can snapshot a VM, or a container; but you cannot snapshot the system that runs those VMs and containers, nor the system that depends on those, etc. If you make a policy change to an S3 bucket, you can not simply restore the policy (and all other aspects of a bucket) to where it was before. Almost no cloud systems have this property, because they are just a mess of APIs and mutable state.

This all becomes apparent when you operate large scale systems using tools like Terraform. Terraform is a Configuration Management tool, designed to modify the system's mutable state to return it to some desired state. It's the same as using Puppet or Ansible to modify a running VM, to "fix" the VM in place. This is a laborious and error-prone process which requires a great deal of investment in the Configuration Management - essentially writing programs just to tell the system where each file should be.

But the best practice for VMs is not to "fix" its state. It is to simply deploy a snapshot of a VM, with the exact properties of that VM that it should have, and deleting the old VM. This brings a number of desirable results to operations, and completely does away with the need to program a Configuration Management system to "fix" the VM.

All cloud systems should support immutable operations. But virtually none of them do. This is a tremendous disadvantage and gap in the cloud computing ecosystem. It will cause increased instability and require dedicated jobs just to deal with the Configuration Management problem. This creates a problem for developers, too, who will stay away from learning cloud architecture due to the complexity involved.

hiatus · on July 5, 2023

> It's the same as using Puppet or Ansible to modify a running VM, to "fix" the VM in place. This is a laborious and error-prone process which requires a great deal of investment in the Configuration Management - essentially writing programs just to tell the system where each file should be.

This doesn't track with my experience. Tools like terraform can be used in such a way that, instead of running an apply to update a resource in place, a change to the machine configuration would result in a deletion/new instance being spun up. Some operations are mutations, like updating the name of a security group for instance. But updating a host's configuration should be done by updating user data and creating a new instance, though it's true terraform offers the ability to perform the update either way.

_ktx2 · on July 5, 2023

> If you design a system architecture today, you cannot build the cloud infrastructure, get it to a working state, and then make an immutable snapshot of how the system works, and restore it later if the system changes or becomes unstable

You can. Here's what you'd need:

- a snapshot of the code (eg: git commit SHA)

- a snapshot of the DB schema that correlates to the code (eg: database using linear migrations checked into the code)

- a snapshot of the DB state when upgrades or downgrades take place

- immutable (remove & replace) infrastructure checked into the same git SHA

This requires writing infrastructure, DB schema, and some features a certain way. The problem isn't a technical limitation, it's that you'd have to know a good amount in at least three different domains to set this up, much less run and operate this system. It raises the bar for knowledge well beyond treating each of these systems as isolated systems. That's why most teams don't do it. It's far easier and more realistic for most teams to say" let's fail forward".

klysm · on July 5, 2023

Immutable infra is usually provided by some stateful tool that maps state changes to API calls

_ktx2 · on July 5, 2023

That's not true. Immutable infrastructure state should not be dependent on previous state because there is no state. It's remove and replace only.

If there's a piece of infrastructure that cannot be regarded as immutable and this is mimicking immutability behind an API (like a CDN) then it's not immutable. Understanding this and segregating that infrastructure away from totally immutable components (like servers, functions, load balancers, etc) is key.

The CDN mutability probably can be flushed out by some processes like etags but the most important problems are solved, just hard.

klysm · on July 5, 2023

What tools are you using to accomplish this? Can’t be terraform

_ktx2 · on July 5, 2023

As I said, you need to know how that infrastructure is built. Terraform is just fine. If you call to spin up a VM with a disk that's been imaged to a standard, that is immutable. Functions can also be made immutable by programming in a way that takes into account potential shared state.

klysm · on July 5, 2023

Terraform works by tracking mutable state though which was my point

_ktx2 · on July 5, 2023

Ah, state is relative in this context. A VM lacks state when you determine the starting state. A function lacks state depending on how you code it and the contracts you have. A CDN is always stateful. Sometimes the Terraform state overlaps with an entities internal state but most of the time not. "Immutable infrastructure" state refers to internal state, not Terraform state.

klysm · on July 6, 2023

Ah I see what you mean and I agree it’s possible to achieve immutable infrastructure in that sense.

c_crank · on July 5, 2023

Oh look, another post proclaiming mutability as the source of all nasty bugs and demanding an immutable solution. You know, you'd think that if immutability really was a golden elixir of correct coding, that Scala would have taken over the business world by now instead of shriveling on the vine.

namaria · on July 9, 2023

Oh look another comment pointing out that as long as companies can throw bodies at problems, creating drudgery in the name of profitability, there's no problem to be seen. The fact that good tools exist but are not mainstream is proof positive surely.

nunez · on July 5, 2023

I would say that's not being able to rollback to a last-known-good snapshot-of-the-world a feature, not a bug.

A major premise of cloud computing is that everything is disposable.

Consequently, this is meant to encourage consumers to express as much of their system's configuration with code and automation as possible.

For example: the change to the S3 bucket policy that you mentioned would, theoretically, be done by changing the policy doc in Terraform and reapplying the Terraform configuration.

Modification to those policies would, theoretically, only be allowed by anyone with the "AdministratorAccess" role, which would only be provided to the IAM user running Terraform (via temporary STS tokens) or admins through some sort of a "oh-shit" process.

pxc · on July 5, 2023

> A major premise of cloud computing is that everything is disposable.

That norm is itself a way of coping with the hopeless state of state management, a lesson learned from running imperative/convergent configuration management systems on VMs running mutable operating systems a generation before 'cloud native' was a thing.

That whole 'best practice' is a workaround for shaky foundations.

namaria · on July 9, 2023

It seems to me every time someone promises to make it easy to do something complex, they've just created a more efficient way to sweep problems under the carpet.

ivix · on July 5, 2023

I'm a little confused by this comment. It feels like you are complaining about the problems that are solved by configuration management and infrastructure as code, yet you seem to reject it as too complicated.

All the problems you raise are solved by infrastructure as code.

throwawaaarrgh · on July 6, 2023

No they are not.

Infrastructure as Code is literally is nothing more than keeping a config file in Git and running a script to apply the config to a REST API (or whatever). Terraform stands in as an overcomplicated script. Infrastructure as Code does not determine if what is being managed is immutable or not.

Configuration Management is a program designed to attempt to change mutable state so that it once again resembles what you want.

Immutable Infrastructure is immutable. You cannot change the state. Hence configuration management is pointless and unnecessary.