Rollback is of course useful when things go wrong (and by the way the routers CF use natively support rollback features). What I’m questioning is flapping as a structured way to carry out network changes.
Even a slow flap can cause issues downstream. Imagine a router handling hundreds of thousands of routes. Its software has a memory leak so any route received increases its RAM usage. A slow flap may well bring that router to a halt. Now you might say, “hey, this is not my fault”, but it is still something that could happen to your routers or your peers.
Another aspect is that network devices can get Terabits/s of traffic. Now, a router is mostly stateless, but if you do this flapping thing to a firewall, what you get is a lot of sessions with behavior1 and then switching to behavior2 and so on, which can cause high buffer utilization or packet drops.
So, yes, of course you “flap” (rollback) when things go wrong, but you probably don’t do it intentionally to test what’s going on in a network change.
> Its software has a memory leak so any route received increases its RAM usage.
Surely you realize this as a weak reason but thought the argument against is that it’s my problem for someone else’s misbehaving software? I mean anyone sane in networking would treat this as not their problem (or at least work with the major providers for whom it is to make this possible).
However the strongest reason why I don’t buy this is that routes change regularly as a matter of course so changing a route forward and back is no different from changing it twice and so this bug would already be causing you issues and this is maybe a small percentage of extra advertisements.
> what you get is a lot of sessions with behavior1 and then switching to behavior2 and so on, which can cause high buffer utilization or packet drops.
Again, this explanation largely relies on FUD rather than concrete explanations. BGP routes change regularly and often. Such issues if they exist are already problems and briefly advertising a new route for a period of time as a dry run doesn’t alter those issues in any meaningful way. The problem is you’re treating “flap” as somehow magically different from any normal route change when it’s not really meaningfully so.
In the session scenario, I was talking about firewalls, not BGP routers (although, of course, you could have firewall features on a BGP router).
What I'm saying is, there are ways to validate and carry out network changes in a pretty robust way, including gradual rollout (if that's what you want) by using route or firewall rules priority or other mechanisms.
I keep being skeptical about this flapping strategy, but if this works in your setup, good for you.
Even a slow flap can cause issues downstream. Imagine a router handling hundreds of thousands of routes. Its software has a memory leak so any route received increases its RAM usage. A slow flap may well bring that router to a halt. Now you might say, “hey, this is not my fault”, but it is still something that could happen to your routers or your peers.
Another aspect is that network devices can get Terabits/s of traffic. Now, a router is mostly stateless, but if you do this flapping thing to a firewall, what you get is a lot of sessions with behavior1 and then switching to behavior2 and so on, which can cause high buffer utilization or packet drops.
So, yes, of course you “flap” (rollback) when things go wrong, but you probably don’t do it intentionally to test what’s going on in a network change.