In busy systems, GOMAXPROCS=cpu.limits will still lead to the process exceeding its cfs quota and be throttled as the Go runtime has additional threads beyond just GOMAXPROCS. IME, the only way to (nearly) guarantee there's no throttling is to set cpu.limits=GOMAXPROCS+1 to leave some room for these system threads. Unfortunately, there's broad adoption of uber/automaxprocs (which makes the same error) and utilizing the downward API to do this for other cases.
Clicked expecting this to be about the history (https://officialrealbook.com/history/) of jazz fake books, which eventually evolved into the Real Book!
This is a design flaw in Kubernetes. The article doesn't really explain what's happening though. The real problem is that there is no synchronization between the ingress controller (which manages the ingress software configuration, e.g. nginx from the Endpoints resources), kube-proxy (which manages iptables rules from the Endpoints resource), and kubelet (which sends the signals to the container). A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases (and the cases it doesn't will have exceeded your timeout anyhow). Things become more complicated when there are sidecar containers (say an envoy or nginx routing to another container in the same pod) and that often requires shenanigans such as shared emptyDir{} volumes that waits (with fsnotify or similar) for socket files to be closed to ensure requests are fully completed.
It's more of a design compromise than an outright flaw though. Since you can't know if your order to shut down a pod has arrived or not in a distributed system (per the CAP theorem), you either have to do it the way k8s has already implemented it or you have to accept potentially unbounded wait pod shutdown (and by extension new release rollout) durations in times of network partitions. K8s just chose Availability over Consistency in this case.
You can argue whether it would not have been preferable to choose C over A instead (or even better, to make this configurable), but in a distributed system you will always have to trade one of these two off. The hacks with shared emptyDir volumes just moves the system back to "Consistency" mode but in a hacky way.
I would say that's true for networking.k8s.io/v1beta1 Ingress, but not for networking.k8s.io/v1 which is much better.
There's still some issues around "concerns" maybe eg:
Should the Ingress also handle redirecting? ALB Ingress has its own annotations DSL to support this, and the nginx has a completely different annotations DSL to support this. I don't think Envoy does, though.
But then there's the question of supporting CDNs; some controllers support it with annotations and some through `pathType: ImplementationSpecific` and a `backend.resource` CRD (which doesn't have to be a CRD; they could become native networking.k8s.io/v1 extensions in the future that the controllers can opt in to support). This becomes great when combined with the operator framework (+ embedded kubebilder).
So, I think there's a lot of potential for things to get better.
A great success example in the ecosystem is cert-manager, that a lot of controllers rely on as a peer dependency in the cluster.
> A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases
That's precisely what we did in one of my previous client. To increase portability, we wrote the smallest possible sleep equivalent in C, statically linked it, stuck it into a ConfigMap and mounted it to the pods so every workload would have the same pre-stop hook.
It was funny to watch when a new starter in the team would find out about that very elegant, stable and useful hack and go "wtf is going on here?" :D
This dealt with pretty much all our 5XXs due to unclean shutdowns.
I mean, technically, you can recreate this scenario on a single host as well. Send a sigterm to an application and try to swap in another instance of it.
System fundamentals are at the heart of that problem: SIGTERM is just what it is, it's a signal and an application can choose to acknowledge it and do something or catch it and ignore it. The system also has no way of knowing what the application chose to do.
All that to say, I'm not sure it's as much of a flaw in Kubernetes as much as it's the way systems work and Kubernetes is reflecting that.
In my view it is a clear flaw that the signal to terminate can arrive while the server is still getting new requests. Being able to steer traffic based on your knowledge of the state of the system is one of the reasons why you'd want to set up an integrated environment where the load-balancer and servers are controlled from the same process.
The time to send the signal is entirely under control of the managing process. It could synchronize with the load-balancer before sending pods the term signal, and I'm unclear why this isn't done.
I don't think there is anything reasonable to synchronize with that will guarantee no new connections. You can remove the address from the control plane synchronously, but the stale config might live on in the kubelet or kube-proxy distributed throughout the cluster. I don't think you want to have blocking synchronization with every node every time you want to stop a pod.
The alternative is that you wait some amount of time before dying instead of explicit synchronization, which is exactly what this lame-duck period is. You find out that you should die ASAP, and then you decide how long you want to wait until you actually die.
I don't really see an issue with adding synchronisation, there's no fundamental reason why having endpoint consumers acknowledge updates before terminating removed pods would be horrifically expensive. Especially with endpoint slices.
With 10,000 nodes running kube-proxy it is a bit expensive and, more importantly: error prone. A problem on a single node that wasn't even talking to the app could stop that app from exiting indefinitely if acks were required and clusters this size already do gigabits of traffic in endpoints watches.
Additionally, there's no acks possible for clients of headless services, so just kube-proxy handling this doesn't go far enough.
But yeah, maybe accept that as a tradeoff for clusterip services, but more deeply integrate the real load balancer options.
Its design is good enough. There's just enough protocol to make it portable, and it's almost completely extensible so you can make it do basically anything.
Hey, this is Aaron from GitHub. We're using devicemapper w/ LVM backed pools. Would love to hear about your experience there. We definitely see this problem during periods of high container churn.
That's funny, we have an internal bug open right now about a kernel panics that happen with devicemapper (with XFS as the base filesystem). We found that the issue was exacerbated if you used loopback devices, but on paper it should still happen in non-loopback mode (the current theory is that it's a bug in XFS). Our kernel team is still investigating the issue, but they cannot seem to reproduce the issue with direct-lvm (and loop-lvm is inconsistent in reproducing the issue).
If you can consistently reproduce the issue, would you mind providing the backtrace and/or coredump? Is it possible for you to reproduce the issue on a machine without needing to be hit by GitHub-levels of traffic, and if so can you provide said reproducer?
For reference, our backtraces show that the kernel dies at Xfs_vm_writepage. Though of course different kernel versions may have varying backtraces.
You can reach me on the email in my profile, or asarai(at)suse.com.
My schroot tool used for building Debian packages could panic a kernel in under five minutes reliably, when it was rapidly creating and destroying LVM snapshots in parallel (24 parallel jobs, with lifetimes ranging from seconds to hours, median a minute or so).
This was due to udev races in part (it likes to open and poke around with LVs in response to a trigger on creation, which races with deletion if it's very quick). I've seen undeletable LVs and snapshots, oopses and full lockups of the kernel with no panic. This stuff appears not to have been stress tested.
I switched to Btrfs snapshots which were more reliable but the rapid snapshot churn would unbalance it to read only state in just 18 hours or so. Overlays worked but with caveats. We ended up going back to unpacking tarballs for reliability. Currently writing ZFS snapshot support; should have done it years ago instead of bothering with Btrfs.
In my work identity, we saw a similar problem in our testing, where blkid would cause undesired IO on fresh devices. Eventually, we disabled blkid scanning our device mapper devices upon state changes with a file /etc/udev/59-no-scanning-our-devices.rules containing:
ENV{DM_NAME}=="ourdevice", OPTIONS:="nowatch"
Alternately, you could call 'udevadm settle' after device creation before doing anything else, which will let blkid get its desired IO done, I think.
Yes, we did something similar to disable the triggers. Unfortunately, while this resolved some issues such as being unable to delete LVs which were erroneously in use, it didn't resolve the oopses and kernel freezes which were presumably locking problems or similar inside the kernel.
A known (and now fixed) kernel issue affects the scheduler and cgroups subsystem, triggering crashs under kubernetes load (fixed by 754bd598be9bbc9 and 094f469172e00d).
The fix was merged in Linux 4.7 (and backported to -stable, in 4.4.70).
So if you run an older kernel, maybe you are hit by this?
These GitHub commands are provided by hub:
pull-request Open a pull request on GitHub
fork Make a fork of a remote repository on GitHub and add as remote
create Create this repository on GitHub and add GitHub as origin
browse Open a GitHub page in the default browser
compare Open a compare page on GitHub
release List or create releases (beta)
issue List or create issues (beta)
ci-status Show the CI status of a commit
With close to a decade of remote work experience under my belt at companies that are more than 50% remote (GitHub) down to being the only remote guy, I have concluded that some things are best left in the physical world...
There is an actual whiteboard on my wall and a webcam on a stand. When I want to whiteboard something, I get up, switch camera angles, and literally draw it out. I have found this to be far more productive than any whiteboard apps out there.
For something more interactive, Google Draw isn't awful, but it's not a pen and paper.
Thanks for commenting! That's true, I do have a whiteboard within arms reach in my home office. Sometimes I'll take a picture of it and send it along when talking about a problem.