We revisit the G1GC recommendation every once it a while. In fact, I am doing benchmarks and testing for G1GC versus CMS with Elasticsearch 6.0.0 right now, so that we have a better idea of where we stand.
Disclaimer: I'm an Elasticsearch dev employed by Elastic.
Cool, I've a pretty big cluster with some GC issues (p90 - 15s, p99 - 60s) during node failures, and would be super interested in those results! If there's anything a user can do to help, my email is on my user page :D
We observed in past that long GC is the cause of node failures. When long GC happens node doesn’t respond, master node decides that this node had left the cluster :\
Ya, we often see a node die of natural causes, and then the garbage produced from recovering the node and relocating the data ends up bringing down the rest of the cluster via long GC pauses.
Disclaimer: I'm an Elasticsearch dev employed by Elastic.