Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[atom/xray]: Help wanted: Optimize fragment identifiers #26

Open
chaintip opened this issue Mar 26, 2018 · 2 comments
Open

[atom/xray]: Help wanted: Optimize fragment identifiers #26

chaintip opened this issue Mar 26, 2018 · 2 comments

Comments

@chaintip
Copy link
Owner

chaintip commented Mar 26, 2018

Repository: atom/xray
Issue #24: Help wanted: Optimize fragment identifiers

This is an optimization that's in my head, but I haven't performed the measurements necessary to prove that it's really worth the complexity. If you want to work on this, helping with those measurements would be an important first step.

Background

Xray's buffer is a CRDT that stores every piece of text ever inserted in a giant copy-on-write B-tree. Insertions are immutable, and each insertion is initially added to the tree in its whole form. Insertions are subsequently split into fragments whenever new text is inserted within an existing piece of inserted text. As a result, the tree ends up containing a sequence of interleaved fragments, each originating from a different original insertion.

Retaining all of these fragments and their connection back to the original insertion allows us to describe positions in the buffer logically, in terms of an (insertion_id, offset) pair that uniquely describes a position inside the immutable insertions.

To determine where such a logical position lands in the document, we need to determine which fragment of the position's specified insertion contains the position's specified offset. To enable this, each insertion is associated with an auxilarry tree via the Buffer::insertions map. These trees contain only fragments from the original insertion, and map offset ranges to FragmentIds. Once you find a fragment id for a particular (insertion_id, offset) pair, you can use it to perform a lookup in the main fragments tree which contains the full text of the document.

See the presentation I gave at InfoQ 2017 for a clearer explanation.

Dense ids

In the teletype-crdt JS library, we map from an (insertion_id, offset) pair to a fragment in the global document by making our fragments members of two splay trees simultaneously. So you can look up the tree containing all the fragments of a single insertion and seek to the fragment containing a particular offset, then use different tree pointers on that same fragment belonging to the main document tree to resolve its location. It's kind of weird, and the talk may clarify this somewhat, but it may not be mandatory to understand. In Xray, we're storing fragments in a persistent B-tree rather than a splay tree, which means we really can't associate fragments with parent pointers. This means we'll need to find a new strategy for understanding where a particular insertion fragment lives inside the main tree.

In Xray, we instead rely on representing fragment ids as dense, ordered identifiers. "Dense" essentially means that between any two of these identifiers, there should always be room to allocate another identifier. You could think of infinite precision floating point numbers as satisfying this property. Between 1 and 2 there is 1.5, and between 1.5 and 2 there is 1.75, and so on. The inspiration for this approach came from a different text-based CRDT called Logoot.

Currently, we represent a FragmentId as a vector of 16 bit integers. To compare ids, we simply compare each element of the two vectors lexicographically. When we want to create a new id that is ordered in between two existing ids, we search for the first index corresponding to elements in either vector that have a difference of at least one. Once we find one, we allocate a new integer randomly between the two ids.

The LSEQ strategy, discussed in a followup paper, seems like it could be a good source of inspiration for optimizing these ids so as to minimize their length. The longer the ids, the more memory they consume and the longer they take to compare. The linked paper has a really good discussion of the performance profile of allocating these ids based on different insertion patterns in a sequence.

One key idea in the paper is to allocate ids closer to the left or right based on a random strategy at each index of the vector. Another idea is to increase the base of each number in the id, so that the key space expands exponentially as an id gets deeper. It's all in the paper.

The next valuable optimization would build on the LSEQ work and avoid allocation whenever ids are below a certain length. We can do this with unsafe code that implements a tagged pointer. Basically, if the id is short enough, we should be able to stuff it into the top 63 bits of a pointer. If it exceeds that quantity, we can interpret those bits as a pointer to a heap-allocated vector of u16s. The comparison code can then be polymorphic over the two representations, and hopefully most of the time we wouldn't need the allocation.

It will be an experiment. Tying it to some sort of benchmarks related to editing performance will be important to ensuring this idea warrants the complexity.


Total Bounty: 0 BCH| ~ 0.00 USD

To claim this bounty, get a pull request merged with @chaintip fixes #24 in the creation comment.

Returned Tips:

  • 0.00714285 BCH| ~ 2.94 USD from DesWurstes returned 3 years ago.
  • 0.00029641 BCH| ~ 0.12 USD from DesWurstes returned 3 years ago.
@chaintip
Copy link
Owner Author

chaintip commented Apr 3, 2018

You can't tip a bounties issue... it creates all sorts of confusion :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@chaintip and others