The XOR trick was sometimes useful in the past, on weird CPUs that had non-equivalent registers and which also lacked register exchange instructions (the Intel/AMD CPUs have non-equivalent registers, but they have a register exchange instruction, so they do not need this trick).
The XOR trick is not useful on modern CPUs for swapping memory blocks, because on modern CPUs the slowest operations are the memory accesses and the XOR trick needs too many memory accesses.
For swapping memory, the fastest way needs 4 memory accesses: load X, load Y, store X where Y was, store Y where X was. Each "load" and "store" in this sequence may consist of multiple load or store instructions, if multiple registers are used as the intermediate buffer. Ideally, an intermediate register buffer matching the cache line size should be used, with accesses aligned to cache lines.
Hopefully, std::rotate is written in such a way that it is compiled into such a sequence of machine instructions.
The XOR trick is not useful on modern CPUs for swapping memory blocks, because on modern CPUs the slowest operations are the memory accesses and the XOR trick needs too many memory accesses.
For swapping memory, the fastest way needs 4 memory accesses: load X, load Y, store X where Y was, store Y where X was. Each "load" and "store" in this sequence may consist of multiple load or store instructions, if multiple registers are used as the intermediate buffer. Ideally, an intermediate register buffer matching the cache line size should be used, with accesses aligned to cache lines.
Hopefully, std::rotate is written in such a way that it is compiled into such a sequence of machine instructions.