I would argue we, as a community, need to write a portable, yet optimised, byte order convertors. htobe, htole, htobel, htolel, htobell, htolell, and vice versa.
Converting byte order of integer is mostly pointless operation (which is what the article tries to say), what is needed is portable, yet optimized way to build/parse portable binary structures. In my opinion there are two reasons why too much optimization in this is complete waste of time:
1) Even if compilers are not able to optimize manual conversion of integer to/from discrete bytes into same code as word sized access with optional byte order swap, it's mostly irrelevant, as there aren't going to be any significant difference in performance between one four byte access and four one byte accesses (as in both cases you end up with same number of actual memory transactions, which is the slow part, due to caches)
2) when you are handling portable binary representation of something, it's always connected to some IO, which is slow already so any performance boost that you get from microoptimalization like this is completely negligible.
I tend to just hand write few lines of C to pack/unpack integers explicitly when needed as it seems to me as the most productive thing you can do.
By the way all the big endian <-> little endian functions you propose boil down to two implementations for each size of operand: no-op and mirroring of all bytes, both of which are mostly trivial.
What is really missing is portable and efficient way to encode floating point numbers, as there is no portable way to find out their endianity and in floating point case it's more complex than just big vs. little endian.