Skip to content
This repository has been archived by the owner on Jul 23, 2019. It is now read-only.

Help wanted: Optimize fragment identifiers #24

Open
nathansobo opened this issue Mar 6, 2018 · 2 comments
Open

Help wanted: Optimize fragment identifiers #24

nathansobo opened this issue Mar 6, 2018 · 2 comments

Comments

@nathansobo
Copy link

This is an optimization that's in my head, but I haven't performed the measurements necessary to prove that it's really worth the complexity. If you want to work on this, helping with those measurements would be an important first step.

Background

Xray's buffer is a CRDT that stores every piece of text ever inserted in a giant copy-on-write B-tree. Insertions are immutable, and each insertion is initially added to the tree in its whole form. Insertions are subsequently split into fragments whenever new text is inserted within an existing piece of inserted text. As a result, the tree ends up containing a sequence of interleaved fragments, each originating from a different original insertion.

Retaining all of these fragments and their connection back to the original insertion allows us to describe positions in the buffer logically, in terms of an (insertion_id, offset) pair that uniquely describes a position inside the immutable insertions.

To determine where such a logical position lands in the document, we need to determine which fragment of the position's specified insertion contains the position's specified offset. To enable this, each insertion is associated with an auxilarry tree via the Buffer::insertions map. These trees contain only fragments from the original insertion, and map offset ranges to FragmentIds. Once you find a fragment id for a particular (insertion_id, offset) pair, you can use it to perform a lookup in the main fragments tree which contains the full text of the document.

See the presentation I gave at InfoQ 2017 for a clearer explanation.

Dense ids

In the teletype-crdt JS library, we map from an (insertion_id, offset) pair to a fragment in the global document by making our fragments members of two splay trees simultaneously. So you can look up the tree containing all the fragments of a single insertion and seek to the fragment containing a particular offset, then use different tree pointers on that same fragment belonging to the main document tree to resolve its location. It's kind of weird, and the talk may clarify this somewhat, but it may not be mandatory to understand. In Xray, we're storing fragments in a persistent B-tree rather than a splay tree, which means we really can't associate fragments with parent pointers. This means we'll need to find a new strategy for understanding where a particular insertion fragment lives inside the main tree.

In Xray, we instead rely on representing fragment ids as dense, ordered identifiers. "Dense" essentially means that between any two of these identifiers, there should always be room to allocate another identifier. You could think of infinite precision floating point numbers as satisfying this property. Between 1 and 2 there is 1.5, and between 1.5 and 2 there is 1.75, and so on. The inspiration for this approach came from a different text-based CRDT called Logoot.

Currently, we represent a FragmentId as a vector of 16 bit integers. To compare ids, we simply compare each element of the two vectors lexicographically. When we want to create a new id that is ordered in between two existing ids, we search for the first index corresponding to elements in either vector that have a difference of at least one. Once we find one, we allocate a new integer randomly between the two ids.

The LSEQ strategy, discussed in a followup paper, seems like it could be a good source of inspiration for optimizing these ids so as to minimize their length. The longer the ids, the more memory they consume and the longer they take to compare. The linked paper has a really good discussion of the performance profile of allocating these ids based on different insertion patterns in a sequence.

One key idea in the paper is to allocate ids closer to the left or right based on a random strategy at each index of the vector. Another idea is to increase the base of each number in the id, so that the key space expands exponentially as an id gets deeper. It's all in the paper.

The next valuable optimization would build on the LSEQ work and avoid allocation whenever ids are below a certain length. We can do this with unsafe code that implements a tagged pointer. Basically, if the id is short enough, we should be able to stuff it into the top 63 bits of a pointer. If it exceeds that quantity, we can interpret those bits as a pointer to a heap-allocated vector of u16s. The comparison code can then be polymorphic over the two representations, and hopefully most of the time we wouldn't need the allocation.

It will be an experiment. Tying it to some sort of benchmarks related to editing performance will be important to ensuring this idea warrants the complexity.

@DesWurstes

This comment has been minimized.

@chaintip

This comment has been minimized.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants