Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I didn't explain this well but the transformation is lossless. No data is lost from compressing like this. It has no impact on the concurrency protocol or network protocol; it just impacts how the data is stored locally.

If we need to, we could split the run back out again into individual characters without losing information. And that does happen - we do that if something later gets inserted into the middle of the run of characters. Pausing doesn't help. Even the same user could later type something in the middle of a word they'd typed previously.

This limits what we can run-length encode. The code in diamond only run-length encodes items when they are:

- Typed sequentially in time

- And sequentially in insert location (position 10, 11, 12, 13, ...)

- And either all of the characters have been deleted or none of them have been deleted

Notice the other fields in that example in the post. Each subsequent character has:

- An id field = the previous id + 1

- A parent field = the id of the previous item

- The same value for isDeleted

If any of this stuff didn't match, the run would be split up.

Instead of storing those fields individually, we can store the id and parent of the first item, an isDeleted flag and a length field. Thats all the information we actually need to store. With yjs style entries (which is what diamond-types actually implements), the code is here if you can make heads or tails of it. The poorly named "truncate" method splits an item in two. Spans are only extended if can_append() returns true: https://github.com/josephg/diamond-types/blob/c4d24499b70a23...

With real data from Martin's editing trace, 182 000 inserted characters get compacted down to 12 500 list items. Which is still pretty good - thats 14x fewer entries. And it means performance doesn't stutter when large paste events happen.



I have a related question to this, if you’re storing [“hello”] as one chunk, what happens when you perform an edit to say adding an extra [“e”] after the [“e”]? In the unoptimised structure I know you can just add the new [“e”] as a child of the original [“e”]. So here would you then delete the chunk [“hello”] and split it into two halves like [“he”] and [“llo”]?


Yes exactly. You replace the chunk with “he” and “llo” then insert the extra item in the middle. The result is [“he”, “e”, “llo”]. The code to do all this inline in a b-tree is pretty hairy but it works pretty well!


Ah, I see. I had thought that the consolidation gave a batched update with a single ID, so 'h' + 'ell' + 'o' would have IDs of 1, 2, and 3 respectively. That would have made an editing conflict in the middle of 'ell' impossible.


Ah that makes sense! Concurrent changes are one thing, but concurrent changes are rare. The big problem if we did it that way is that it would become impossible to insert in the middle of "ell", because we wouldn't be able to name any of those internal positions.

To get around that we assign a sequential ID for every inserted character, regardless of how they're typed or stored. Typing "hello" would assign IDs 1-5 even if you paste "hello" from the clipboard.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: