Unicode collation is very complex. If you wanted to sort according to that, then you would need to devise an isomorphism between unicode strings and bitstrings such that lexicographic ordering in the bitstring domain agrees with unicode collation in the unicode domain. I'm not saying it's impossible, but it's not at all obvious how to do it. On the other hand, if all you want is some total order for acceleration structures, then you can indeed just use an arbitrary encoding like utf8 and do the obvious thing.
> The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc
The obvious solution to this is to pre-compile a list of all possible Unicode strings up to the required length, sort them, and create a lookup table using their indices. I wonder for what kind of project this would be useful.
Seems like it should work for arbitrary byte strings (any charset, any encoding)but obviously the performance characteristics will differ because of non-uniform distribution. But that happens even in ASCII.
Lexicographic encoding of UTF-8 byte sequences matches lexicographic order of the sequence of Unicode code-points. So you can sort UTF-8 strings as byte strings. Not that sorting by code-points has much meaning, but you can use the Unicode collation algorithm first.
TIL, thanks. That makes sense given how the length prefixes look like but I never thought about it. I wonder if this was by chance or if the UTF-8 creator thought about it.