The bottleneck with the pointer table may be the summation.
While the fetches of elements can be parallelized, the summation can not, as the addition depends on the result of the previous addition being available.
Some experiments I have done with something that does summation showed a considerable speedup by summing odd and even values into separate bins. Although this applies only to doing something not too closely resembling signal processing algorithms, as the compiler can otherwise optimize out for that.
Part of my video titled "new computers don't speed up old code"
Some experiments I have done with something that does summation showed a considerable speedup by summing odd and even values into separate bins. Although this applies only to doing something not too closely resembling signal processing algorithms, as the compiler can otherwise optimize out for that.
Part of my video titled "new computers don't speed up old code"