You're welcome! Yes, we have KV cache. Being able to implement this efficiently in terms of hardware requirements and compute time is one of the benefits of our deterministic chip architecture (and deterministic system architecture).
I think currently 1. Unlike with graphics processors, which really need data parallelism to get good throughput, our LPU architecture allows us to deliver good throughput even at batch size 1.