Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wish they mentioned how long a full deployment takes Meta using this method, that seems like an important detail to omit.

> So, if you’d rather not have downtime with your servers, data centers, and clouds, follow Meta’s example and use live patching. You’ll be glad you did.

Maybe if you're working at Meta's scale it makes sense... But I think most well designed services and applications should be able to get by just fine with a full reboot of any single server. I can't really fathom the complexity of managing millions of servers though.



> Maybe if you're working at Meta's scale it makes sense... But I think most well designed services and applications should be able to get by just fine with a full reboot of any single server.

I feel like this should be the opposite... I don't work at Meta scale, but I do work for a CDN with 10s of thousands of servers, and everything we do is based on the idea that some machines will always be going down, because some hardware is going to fail every day just from probability. You have to design everything for failure.

Given that, it shouldn't be hard to take servers out of production for patching and updates.

In other words, a hyperscaler is going to have less incentive to minimize down time than smaller shops.


I don't follow. Reboot is downtime. Of course your architecture must allow for downtime if it happens, but it's lost money either way, your hardware is not doing any useful work while rebooting. So more computers you have, more money is lost. At small scale that's not significant, but at large scale that might become significant so there's more incentive to reduce downtime.


A reboot, a software deployment (kernel upgrade), server replacement, etc. are all the same process. That simplifies things dramatically. You can micro-optimize the 30s it takes to reboot a server, or you can simplify a runbook to have one process for any “deployment”. Different scenarios require different things but for most “web scale” things that need to be overprovisioned anyway, I’d take the simpler process.


These servers don't take 30s to reboot. Some servers take many minutes. It's a lot.


Worse, some just don't come back without manual intervention. Power supplies don't last forever and might run fine while the machine is on, but after a reboot... boom, gone.


I'd prefer kexec to kpatch, then


Spin new servers up before you take the old ones down. Effectively zero loss of time for that service.


Sounds like something to fix rather than to paper over?


Isn't it more significant at smaller scale? That is, if you have less computers running to serve requests, the downtime of the singular system will be more pronounced (as opposed to rebooting one machine out of 20 in a rack).


If it isn’t an emergency patch, we do all our maintenance at low traffic times (e.g. the middle of the night local time for the data center). Your capacity planning is based on peak traffic, so you can afford to have more machines out during low traffic times.


Yes, you need to overprovision the server.a little bit.

But you got a much simpler process.

Process ain't free either.


> You have to design everything for failure. Given that, it shouldn't be hard to take servers out of production for patching and updates.

There's a big, big difference, especially at the scale of hundreds of thousands / millions of servers, between designing such that your architecture can suffer 1% of servers being offline and 10% of servers being offline. If you have 1 million servers, even if you could take 1% of them offline at once (i.e. 10,000 servers), if it takes 5 minutes to reboot, you then need to wait 5 minutes * 100 one-percent-buckets = 500 minutes, or 8.3 hours to do a full patch. When you have critical security updates (like Heartbleed) you simply cannot have unpatched servers exposed to the Internet for that much time. And that's not including the amount of time it takes to actually send reboot/patch commands to 10,000 servers.

The larger the bucket, the more likely a bad patch is noticed by the public, and the more likely that an ordinary traffic spike (for which your extra capacity is ordinarily there for) will overwhelm your servers (since your extra capacity is being used to handle patch rollout and rebooting). Sure, you can plan and add even more capacity to compensate, which makes it take even longer to rollout a patch, and now Finance is knocking at your door wondering if you really need all these servers and maybe you could decommission some of them to save money.

It's a fundamentally difficult problem at hyperscaler scale.


I think AWS does pretty well with that philosophy at hyper scale.

AWS has millions of servers in a single AZ.


If you've seen any of Werner Vogel's talks, you will notice that Amazon and AWS feel the same way. Any scalable service should be able to withstand the loss of some of their components and keep operating.


Edit: ignore the below numbers, I got hours and minutes confused.

If you each sever spends 7.5 minutes each month rebooting, that is 1% of all your computers wasted. If you have 10,000 servers, that’s worth 100 severs. If you have 1 million servers, that’s worth 10,000 servers. If each server costs $10,000, that’s $100 million dollars of compute capacity. You can see how that amount of lost computer capacity can start to justify spending engineering time on driving down the amount of time servers are rebooting.


There’s 1,440 minutes per day. In a month with 30 days that’s 43,200 minutes per month.

7.5min/43,200min = 0.00017361

Where are you getting 1% of all your computers wasted per month???


Sorry I got hours and minutes confused and overstated the benefits.

So the correct numbers would be 7.5 minutes of downtime divided by 43,200 minutes times 1,000,000 servers. That’s 173 servers wasted. That is probably still enough servers wasted to devote some engineering time to increasing utilization.


Capacity is measured by peak usage, but most data centers experience a daily traffic cycle with low times at night. We just patch/upgrade our servers at the local low time, when traffic is much lower and a lot of your machines are idle.


https://youtube.com/watch?v=ILTqn1EYIXQ is the original talk, which says it takes 4 days to deploy a KLP to the whole fleet.


There's more in-depth information on that subject here:

https://www.usenix.org/conference/osdi23/presentation/grubic


[flagged]


Ads are just what pays the bills. The cluster is also used for all sorts of useful things, like mass communication, especially during emergencies, support groups, and providing information to help manage a pandemic.

Software engineers aren't nuclear physicists or machinists. I'm not sure how you want to use them to build nuclear reactors.


People become software engineers instead of choosing a different profession because there is so much money in ads.


And lawyers only choose that profession to ambulance chase, and doctors only choose that profession to do boob jobs. People get into software development for more than just money, y'know?


It’s a very reductive take. Meta is helping a lot of people get and keep in touch for free (like my family is all over the place and we use WhatsApp extensively to communicate). There’s clearly a demand for this, and doing it for the price of a couple of ads doesn’t seem sad to me


Did Meta create WhatsApp? Or did they just buy it to sell ads and track its users?


No, the users are the products. It’s not free.


You’re not paying them out of your own pocket. I’ll happily use a better word for that if you have one.


Ass, gas, or grass; nobody rides for free. In this case you're paying with your ass. You're turning tricks for Zuck.

Meta pimps your mind out to anybody who wants to take a crack at you, to manipulate your political views or extract money out of you. In return, they give you purses and pay for you to get your nails did.


It’s not free.


Ads act as a recommendation engine, helping people find solutions to their problems sometimes before they ever realize there is a solution.


Ads are intended to convince people to spend money. In a perfect world that might mean actually being useful solutions, but I have zero faith that that's even a passing concern in reality.


Yeah and casinos offer people some good clean fun. Definitely no incentives here to ”suggest” things people they don’t need or benefit from having, right?


i dont know what ads you are getting, but most of it is ether trash dropshipping products, gambling products(often bad mobile games) or other worthless stuff.

there are some plattforms with relative "high" quality ads, but facebook is not one of them.


lol common now. It’s to make Facebook a bunch of money. Not for tell the future.

I use facebooks products and it’s never helped me or anyone I know in the way you describe.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: