While I’ve observed PyTorch running faster for my (convolutional-based) research, it’s within a few milliseconds of TensorFlow and Keras. If this type of difference mattered (it might for some uses), I would imagine you’d use CuDNN directly. I guess that’s the point, the libraries are all wrapping the same library. It’s like measuring the IO performance between programming language standard libraries (they should all be close to the speed of the underlying system call).
We just plain can't do data augmentation quickly enough with TF. Queues-schmeyes, doesn't matter. Still tops out at about 35MB/sec on MS COCO and starves even a single Titan Xp. On the same hardware, with the same data augmentation steps, PyTorch gets ~50MB/s or so and saturates the GPU, since it never has to wait for data. In fact it can even read faster than that, and automatically parallelize the forward pass across several GPUs. You do still retain full control over placement, however. Super slick.