Presumably it would be fine, since unicode input is still a string of bytes at t...

moonchild · on April 24, 2024

Unicode collation is very complex. If you wanted to sort according to that, then you would need to devise an isomorphism between unicode strings and bitstrings such that lexicographic ordering in the bitstring domain agrees with unicode collation in the unicode domain. I'm not saying it's impossible, but it's not at all obvious how to do it. On the other hand, if all you want is some total order for acceleration structures, then you can indeed just use an arbitrary encoding like utf8 and do the obvious thing.

IncreasePosts · on April 24, 2024

Isn't that exactly what the Unicode collation algorithm is?

https://unicode.org/reports/tr10

tl;dr from its wikipedia page:

> The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc

moonchild · on April 24, 2024

Ah! You are right.

bawolff · on April 24, 2024

You can just use the icu library to do that. Its a very common thing to do when working with unicode collations.

dsamarin · on April 24, 2024

The obvious solution to this is to pre-compile a list of all possible Unicode strings up to the required length, sort them, and create a lookup table using their indices. I wonder for what kind of project this would be useful.

thfuran · on April 24, 2024

If that works for you for anything but very small strings, I want to buy your computer.

remram · on April 24, 2024

Even for 3-codepoint strings it already wouldn't fit in any computer that exists. And that is not sufficient to encode all 1-glyph strings.

AmblingAvocado · on April 24, 2024

It says it sorts one byte at a time. I think this would break for anything not utf-8.

ts4z · on April 24, 2024

Seems like it should work for arbitrary byte strings (any charset, any encoding)but obviously the performance characteristics will differ because of non-uniform distribution. But that happens even in ASCII.

AmblingAvocado · on April 26, 2024

Yes, you’ll get something sorted based on the bytes in the string but it won’t be lexicographically correct - for example, à will be sorted after b.

brirec · on April 24, 2024

This would even break UTF-8, since multi-byte characters are a thing!

dumbo-octopus · on April 24, 2024

How would that break anything? The strings aren't being split.

ursusmaritimus · on April 24, 2024

Lexicographic encoding of UTF-8 byte sequences matches lexicographic order of the sequence of Unicode code-points. So you can sort UTF-8 strings as byte strings. Not that sorting by code-points has much meaning, but you can use the Unicode collation algorithm first.

mzs · on April 24, 2024

% printf "%s\n" A B | sort

A

B

% printf "%s" A B | xxd -b -c4

00000000: 11110000 10011101 10010000 10110100 ....

00000004: 11110000 10011101 10010000 10110101 ....

% printf "%s" A B | xxd -c1 -ps | sort | xxd -r -ps | xxd -b -c4

00000000: 10010000 10010000 10011101 10011101 ....

00000004: 10110100 10110101 11110000 11110000 ....

% printf "%s" A B | xxd -c1 -ps | sort | xxd -r -ps

????????

remram · on April 24, 2024

TIL, thanks. That makes sense given how the length prefixes look like but I never thought about it. I wonder if this was by chance or if the UTF-8 creator thought about it.

AmblingAvocado · on April 26, 2024

Indeed, and is à less than b? Not in Unicode!

EmilyHughes · on April 24, 2024

UTF-8 is downward compatible to ASCII, so anything that is just a standard character (like every character in this comment) is just a byte.