Just my 2 cents. I've been trying my scientific computing code (QM/MM, not ML) i...

Just my 2 cents. I've been trying my scientific computing code (QM/MM, not ML) in a cluster using various configs (6xK40, 4xK80, 6xK20, etc) and the performance I noticed of the K80 is quite strange. I've been using the CUDA_DEVICE 0,1,2,3 of that config and if I try to use more than one logical GPU, the performance is not 1:1, but more like 1:0.6

The only conclusion I've been able to find is that the K80 presents itself as 2 different devices (0,1 or 2,3 in that config) but the performance is not 2x, at all. There is quite a lot of PCI bus contention, hurting badly the performance of my code (as it is just running many <10ms kernels at a time). So far, having 2xK40 seems to be a better value and performance proposition than 1xK80 on the same bus, but the flops/watt aspect of that equation favors greatly the K80.