I wish they mentioned how long a full deployment takes Meta using this method, t...

cortesoft · on Dec 3, 2023

> Maybe if you're working at Meta's scale it makes sense... But I think most well designed services and applications should be able to get by just fine with a full reboot of any single server.

I feel like this should be the opposite... I don't work at Meta scale, but I do work for a CDN with 10s of thousands of servers, and everything we do is based on the idea that some machines will always be going down, because some hardware is going to fail every day just from probability. You have to design everything for failure.

Given that, it shouldn't be hard to take servers out of production for patching and updates.

In other words, a hyperscaler is going to have less incentive to minimize down time than smaller shops.

vbezhenar · on Dec 3, 2023

I don't follow. Reboot is downtime. Of course your architecture must allow for downtime if it happens, but it's lost money either way, your hardware is not doing any useful work while rebooting. So more computers you have, more money is lost. At small scale that's not significant, but at large scale that might become significant so there's more incentive to reduce downtime.

jonhohle · on Dec 3, 2023

A reboot, a software deployment (kernel upgrade), server replacement, etc. are all the same process. That simplifies things dramatically. You can micro-optimize the 30s it takes to reboot a server, or you can simplify a runbook to have one process for any “deployment”. Different scenarios require different things but for most “web scale” things that need to be overprovisioned anyway, I’d take the simpler process.

nickstinemates · on Dec 3, 2023

These servers don't take 30s to reboot. Some servers take many minutes. It's a lot.

latchkey · on Dec 3, 2023

Worse, some just don't come back without manual intervention. Power supplies don't last forever and might run fine while the machine is on, but after a reboot... boom, gone.

yjftsjthsd-h · on Dec 3, 2023

I'd prefer kexec to kpatch, then

bradknowles · on Dec 7, 2023

Spin new servers up before you take the old ones down. Effectively zero loss of time for that service.

jononor · on Dec 5, 2023

Sounds like something to fix rather than to paper over?

tentacleuno · on Dec 3, 2023

Isn't it more significant at smaller scale? That is, if you have less computers running to serve requests, the downtime of the singular system will be more pronounced (as opposed to rebooting one machine out of 20 in a rack).

cortesoft · on Dec 3, 2023

If it isn’t an emergency patch, we do all our maintenance at low traffic times (e.g. the middle of the night local time for the data center). Your capacity planning is based on peak traffic, so you can afford to have more machines out during low traffic times.

j16sdiz · on Dec 3, 2023

Yes, you need to overprovision the server.a little bit.

But you got a much simpler process.

Process ain't free either.

solatic · on Dec 3, 2023

> You have to design everything for failure. Given that, it shouldn't be hard to take servers out of production for patching and updates.

There's a big, big difference, especially at the scale of hundreds of thousands / millions of servers, between designing such that your architecture can suffer 1% of servers being offline and 10% of servers being offline. If you have 1 million servers, even if you could take 1% of them offline at once (i.e. 10,000 servers), if it takes 5 minutes to reboot, you then need to wait 5 minutes * 100 one-percent-buckets = 500 minutes, or 8.3 hours to do a full patch. When you have critical security updates (like Heartbleed) you simply cannot have unpatched servers exposed to the Internet for that much time. And that's not including the amount of time it takes to actually send reboot/patch commands to 10,000 servers.

The larger the bucket, the more likely a bad patch is noticed by the public, and the more likely that an ordinary traffic spike (for which your extra capacity is ordinarily there for) will overwhelm your servers (since your extra capacity is being used to handle patch rollout and rebooting). Sure, you can plan and add even more capacity to compensate, which makes it take even longer to rollout a patch, and now Finance is knocking at your door wondering if you really need all these servers and maybe you could decommission some of them to save money.

It's a fundamentally difficult problem at hyperscaler scale.

bradknowles · on Dec 7, 2023

I think AWS does pretty well with that philosophy at hyper scale.

AWS has millions of servers in a single AZ.

bradknowles · on Dec 7, 2023

If you've seen any of Werner Vogel's talks, you will notice that Amazon and AWS feel the same way. Any scalable service should be able to withstand the loss of some of their components and keep operating.

MarkSweep · on Dec 3, 2023

Edit: ignore the below numbers, I got hours and minutes confused.

If you each sever spends 7.5 minutes each month rebooting, that is 1% of all your computers wasted. If you have 10,000 servers, that’s worth 100 severs. If you have 1 million servers, that’s worth 10,000 servers. If each server costs $10,000, that’s $100 million dollars of compute capacity. You can see how that amount of lost computer capacity can start to justify spending engineering time on driving down the amount of time servers are rebooting.

sghiassy · on Dec 3, 2023

There’s 1,440 minutes per day. In a month with 30 days that’s 43,200 minutes per month.

7.5min/43,200min = 0.00017361

Where are you getting 1% of all your computers wasted per month???

MarkSweep · on Dec 3, 2023

Sorry I got hours and minutes confused and overstated the benefits.

So the correct numbers would be 7.5 minutes of downtime divided by 43,200 minutes times 1,000,000 servers. That’s 173 servers wasted. That is probably still enough servers wasted to devote some engineering time to increasing utilization.

cortesoft · on Dec 3, 2023

Capacity is measured by peak usage, but most data centers experience a daily traffic cycle with low times at night. We just patch/upgrade our servers at the local low time, when traffic is much lower and a lot of your machines are idle.

sweettea · on Dec 3, 2023

https://youtube.com/watch?v=ILTqn1EYIXQ is the original talk, which says it takes 4 days to deploy a KLP to the whole fleet.

mattboardman · on Dec 3, 2023

There's more in-depth information on that subject here:

https://www.usenix.org/conference/osdi23/presentation/grubic

ChatGTP · on Dec 3, 2023

[flagged]

fragmede · on Dec 3, 2023

Ads are just what pays the bills. The cluster is also used for all sorts of useful things, like mass communication, especially during emergencies, support groups, and providing information to help manage a pandemic.

Software engineers aren't nuclear physicists or machinists. I'm not sure how you want to use them to build nuclear reactors.

adrianN · on Dec 3, 2023

People become software engineers instead of choosing a different profession because there is so much money in ads.

fragmede · on Dec 3, 2023

And lawyers only choose that profession to ambulance chase, and doctors only choose that profession to do boob jobs. People get into software development for more than just money, y'know?

phyrex · on Dec 3, 2023

It’s a very reductive take. Meta is helping a lot of people get and keep in touch for free (like my family is all over the place and we use WhatsApp extensively to communicate). There’s clearly a demand for this, and doing it for the price of a couple of ads doesn’t seem sad to me

travoc · on Dec 3, 2023

Did Meta create WhatsApp? Or did they just buy it to sell ads and track its users?

ChatGTP · on Dec 3, 2023

No, the users are the products. It’s not free.

phyrex · on Dec 3, 2023

You’re not paying them out of your own pocket. I’ll happily use a better word for that if you have one.

kyboren · on Dec 3, 2023

Ass, gas, or grass; nobody rides for free. In this case you're paying with your ass. You're turning tricks for Zuck.

Meta pimps your mind out to anybody who wants to take a crack at you, to manipulate your political views or extract money out of you. In return, they give you purses and pay for you to get your nails did.

ChatGTP · on Dec 3, 2023

It’s not free.

sweettea · on Dec 3, 2023

Ads act as a recommendation engine, helping people find solutions to their problems sometimes before they ever realize there is a solution.

yjftsjthsd-h · on Dec 3, 2023

Ads are intended to convince people to spend money. In a perfect world that might mean actually being useful solutions, but I have zero faith that that's even a passing concern in reality.

saagarjha · on Dec 3, 2023

Yeah and casinos offer people some good clean fun. Definitely no incentives here to ”suggest” things people they don’t need or benefit from having, right?

pinkgolem · on Dec 3, 2023

i dont know what ads you are getting, but most of it is ether trash dropshipping products, gambling products(often bad mobile games) or other worthless stuff.

there are some plattforms with relative "high" quality ads, but facebook is not one of them.

ChatGTP · on Dec 3, 2023

lol common now. It’s to make Facebook a bunch of money. Not for tell the future.

I use facebooks products and it’s never helped me or anyone I know in the way you describe.