Doesn’t that make sense though as each manipulates a different layer in the memory hierarchy allowing the programmer to control the latency and throughput implications. I see it as a good thing.
I wonder if some Apple-made software, like Final Cut, make use of all of those "duplicated" instructions at the same time for getting a better performance...
I know how just the multitasking nature of the OS probably make this situation happens across different programs, but nonetheless would be pretty cool!
Would it be possible to use all of them at the same time? Not necessarily in a practical way, but just for fun? Could different ways of doing this on CPU be done in some extent by one core at the same time, given it's superscalar?
I inferred that they meant the neural engine cores by neural accelerators or it could be a bigger/different AMX (which really should become a standard btw)
1. CPU, via SIMD/NEON instructions (just dot products)
2. CPU, via AMX coprocessor (entire matrix multiplies, M1-M3)
3. CPU, via SME (M4)
4. GPU, via Metal (compute shaders + simdgroup-matrix + mps matrix kernels)
4. Neural Engine via CoreML (advisory)
Apple also appears to be adding a “Neural Accelerator” to each core on the M5?