Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Presumably it would be fine, since unicode input is still a string of bytes at the end of the day.


Unicode collation is very complex. If you wanted to sort according to that, then you would need to devise an isomorphism between unicode strings and bitstrings such that lexicographic ordering in the bitstring domain agrees with unicode collation in the unicode domain. I'm not saying it's impossible, but it's not at all obvious how to do it. On the other hand, if all you want is some total order for acceleration structures, then you can indeed just use an arbitrary encoding like utf8 and do the obvious thing.


Isn't that exactly what the Unicode collation algorithm is?

https://unicode.org/reports/tr10

tl;dr from its wikipedia page:

> The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc


Ah! You are right.


You can just use the icu library to do that. Its a very common thing to do when working with unicode collations.


The obvious solution to this is to pre-compile a list of all possible Unicode strings up to the required length, sort them, and create a lookup table using their indices. I wonder for what kind of project this would be useful.


If that works for you for anything but very small strings, I want to buy your computer.


Even for 3-codepoint strings it already wouldn't fit in any computer that exists. And that is not sufficient to encode all 1-glyph strings.


It says it sorts one byte at a time. I think this would break for anything not utf-8.


Seems like it should work for arbitrary byte strings (any charset, any encoding)but obviously the performance characteristics will differ because of non-uniform distribution. But that happens even in ASCII.


Yes, you’ll get something sorted based on the bytes in the string but it won’t be lexicographically correct - for example, à will be sorted after b.


This would even break UTF-8, since multi-byte characters are a thing!


How would that break anything? The strings aren't being split.


Lexicographic encoding of UTF-8 byte sequences matches lexicographic order of the sequence of Unicode code-points. So you can sort UTF-8 strings as byte strings. Not that sorting by code-points has much meaning, but you can use the Unicode collation algorithm first.


% printf "%s\n" A B | sort

A

B

% printf "%s" A B | xxd -b -c4

00000000: 11110000 10011101 10010000 10110100 ....

00000004: 11110000 10011101 10010000 10110101 ....

% printf "%s" A B | xxd -c1 -ps | sort | xxd -r -ps | xxd -b -c4

00000000: 10010000 10010000 10011101 10011101 ....

00000004: 10110100 10110101 11110000 11110000 ....

% printf "%s" A B | xxd -c1 -ps | sort | xxd -r -ps

????????


TIL, thanks. That makes sense given how the length prefixes look like but I never thought about it. I wonder if this was by chance or if the UTF-8 creator thought about it.


Indeed, and is à less than b? Not in Unicode!


UTF-8 is downward compatible to ASCII, so anything that is just a standard character (like every character in this comment) is just a byte.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: