Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> UTF-8 is well-designed that way in that there's never ambiguity about whether you're looking at the beginning byte of a code point.

To expand, if the most-significant-bit is a 0, it's an ASCII codepoint. If the top two are '10', it's a continuation byte, and if they're '11', it's the start of a multibyte codepoint (the other most-significant-bits specify how long it is to facilitate easy codepoint counting).

So a naive codepoint reversal algorithm would start at the end, and move backwards until it sees either an ASCII codepoint or the start of a multibyte one. Upon reaching it, copy those 1-4 bytes to the start of a new buffer. Continue until you reach the start.

[0]: https://en.wikipedia.org/wiki/UTF-8#Encoding



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: