The activations (outputs) of one layer must be encoded in the same way as the we...

The activations (outputs) of one layer must be encoded in the same way as the weights of that layer as well as the weights of the next layer or the computation fails (unless you manage to write clever kernels for doing math at different levels of precision simultaneously, but even then you're introducing even more lossiness than just using a binary representation for those values).

Example: multiplying a bunch of float16s together gives you a float16. That is passed on to the next layer of float16s. Why should forcing the output of the first step to be float8 confer any advantage here? The only way I can see this argument working is if you make all the layers float8 too, and the reason you can do that is that the output of the first step can be faithfully represented as float8 because it doesn't ever blow up. If that's what the author is saying, it wasn't very clear.