Thanks for the data point. Water cooled, impressive.
But what is the problem here? Power is work per time. More power can be a good thing, especially if the work is useful. This is more likely in the SIMD case, which has much lower OoO cost than scalar code in terms of total work accomplished. At some point, the chip hits a limit and throttles. This means less speedup than otherwise might have been the case, but it's still a useful speedup relative to scalar, right?