Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-playable Event Graph (REG) storage format specification #8

Open
streamich opened this issue Dec 2, 2023 · 6 comments
Open

Re-playable Event Graph (REG) storage format specification #8

streamich opened this issue Dec 2, 2023 · 6 comments

Comments

@streamich
Copy link
Collaborator

streamich commented Dec 2, 2023

This is a proposal to formalize and document serialization formats for Re-playable Event Graphs (REGs).

What are REGs? I am not an expert on REGs, but essentially it is a way to represents a sequence of all editing events in a format like this:

export interface ConcurrentTrace {
  kind: 'concurrent';
  endContent: 'string';
  numAgents: number;
  txns: ConcurrentTraceTransaction[];
}

export interface ConcurrentTraceTransaction {
  parents: number[];
  numChildren: number;
  agent: number;
  patches: [position: number, remove: number, insert: string][];
}

The key here is that:

  1. The full editing history is stored.
  2. The operations are stored in the local form [position: number, remove: number, insert: string]. (There is no CRDT metadata.)
  3. Full causal graph is stored, i.e. each transaction has a list of parents: [], which is a set of operations that this transaction depends on (knows about at the time of creation).

What I would like to get from this issue:

  1. Firstly, I would like to see if there is any interest for having a shared formal specification for the REG serialization format.
  2. Requirements: what would be a list of requirements for a REG serialization format?
  3. Implementation details: any tricks/gotchas/suggestions of the low-level implementation details?

cc @josephg @ept @dmonad @mweidner037 @gritzko @archagon @zxch3n

@streamich
Copy link
Collaborator Author

streamich commented Dec 2, 2023

Let me try to answer the above from the perspective of the json-joy library.

json-joy has its own native—JSON CRDT Patch—format, but it seems it could also support the REG format. I am still thinking about the best use cases: it can definitely be used for testing, could be used for full history export, maybe for some compact or cold storage purposes.

json-joy already converts the REG-like format concurrent editing traces into the JSON CRDT Patch format, for testing purposes. It can probably convert the other way around too.

With regard to requirements, I think the following are important:

  1. I would love to have some human-readable JSON format, call it verbose format. It could
    be something similar to the existing concurrent trace format.
  2. I would also want to have a binary format, which semantically would be the same as verbose format,
    but it would encode the data in the most efficient binary way using all the RLE tricks.
  3. The json-joy library provides a full JSON CRDT, hence I would want this format to:
    1. Support LWW-register operations and non-text operations (all JSON data types and binary data).
    2. Or, alternatively, if this format is kept succinct, just for text. Then maybe some
      consideration should be given to how this format could be integrated into a larger JSON or rich-text format.
  4. Since this format could be used by different CRDT implementations, it should not leak CRDT specific
    details, such as the sorted list type (RGA, Fugue, etc.).

I'll skip the implementation details for now, as I am a REG newbie, but I would love to hear from the REG experts what they think about the above.

@ept
Copy link

ept commented Dec 2, 2023

You describe parents as "a set of operations that this transaction depends on" — presumably you mean a set of transactions that it depends on? We also need to specify how these transactions are identified; probably easiest to just give each an ID that can be an arbitrary string.

The numChildren seems redundant, since it could be computed by looking for all the later transactions that have the current transaction as a parent. If there's a bug in the trace collector that sets numChildren to the wrong value, what would happen? Maybe it would be easier to not store numChildren in the trace.

In the edge case where two concurrent transactions insert at the same position, the trace might need to contain an indicator to show which one of the two should come first in the merged result (otherwise it will not be possible to reliably reproduce endContent). It could be done by ordering by agent, for example, but whatever the mechanism is, we should probably be explicit about it. I believe the current set of traces don't have any instances of concurrent insertion at the same position, but we may want to collect such traces in the future, since they open up a whole bunch of interesting challenges.

Are you thinking this REG data format will be only for the purpose of encoding test datasets, benchmarks, etc., or will it also be used for storing end-user documents in production software? If it's the former, I don't think we need a binary encoding, since human-readable verboseness is preferable for this use case.

If we want to consider extending the format beyond plain text, there are lots of potential opportunities, but also a lot of complexity. For example, there are several different ways of representing rich text, and it's not immediately obvious to me whether a single representation is sufficient to encode edits for different rich text collaboration algorithms. If you want to represent spreadsheets or vector graphics it gets more complex still. My suggestion would be to focus the REG format on plain text initially, and to discuss extensions to other data types separately once we have some experience with plain text.

@josephg
Copy link
Owner

josephg commented Dec 3, 2023

Sounds great - I'd love this too!

@streamich:

  1. Firstly, I would like to see if there is any interest for having a shared formal specification for the REG serialization format.

Absolutely!

  1. Requirements: what would be a list of requirements for a REG serialization format?
  2. Implementation details: any tricks/gotchas/suggestions of the low-level implementation details?

I think what you've suggested is a good start.

We'd also need to specify the IDs of each patch - which is a bit trickier because:

  • Different implementations might use different ID formats (eg hashes of the subgraph, (agent, seq) pairs, UUIDs?). An (agent,seq) pair is quite common, but even then diamond types consumes 1 sequence number per keystroke in a transaction, and I think automerge consumes 1 sequence number for each entire transaction.
  • The IDs need to be ordered in a consistent way to be able to process concurrent changes.

It might be possible to use an opaque string and encode your algorithms' ID format into the string in such a way that its always ordered as expected. For example, if your system orders concurrent items first by agent ID then sequence number, the opaque ID could be "AAAAASSSSSSS", storing the agent bytes then sequence number bytes. Doing this could be done in such a way that it preserves lexical ordering. The problem is that it'd make it harder to load an arbitrary log into an arbitrary system.

Another approach would be to just have a top level field specifying the type of the IDs for the trace. But again, that'd quite dramatically limit compatibility.

@ept:

You describe parents as "a set of operations that this transaction depends on" — presumably you mean a set of transactions that it depends on? We also need to specify how these transactions are identified; probably easiest to just give each an ID that can be an arbitrary string.

The format above looks like its based on the concurrent editing trace format I've defined in this repository. In that case, yes - the JSON file itself contains a big list of transactions. The parents field lists the indexes of previous transactions. This is simple, straightforward and I think it should work for any format.

In the edge case where two concurrent transactions insert at the same position, the trace might need to contain an indicator to show which one of the two should come first in the merged result (otherwise it will not be possible to reliably reproduce endContent). It could be done by ordering by agent, for example, but whatever the mechanism is, we should probably be explicit about it. I believe the current set of traces don't have any instances of concurrent insertion at the same position, but we may want to collect such traces in the future, since they open up a whole bunch of interesting challenges.

I absolutely agree we want these sort of traces. I have a few traces like this that I use for benchmarking and conformance testing. And a script that lets us reconstruct an editing trace from any file in any git repository.

I think it'd be pretty tricky to use an indicator like that in the transaction itself. It could operate as a checksum, but what if the expected ordering doesn't match your local algorithm? Perhaps a better approach would be to just name the algorithm used at the top level when replaying the graph of events - eg "textOrdering": "fuguemax"?

Re: rich text, at the risk of eating my words, I think the format is reasonably consistent even if different CRDTs implement rich text somewhat differently. The format I'd expect is a combination of text editing events (insert / delete) and annotations, annotating ranges like addAnnotation(Bold, from: 100, to: 110). At least thats what we did on Wave. I'm not sure if thats compatible with how peritext works. The tricky parts would be standardizing the rules for extending ranges - eg, if an item is subsequently inserted at position 110, is it bold? And what are the rules when merging conflicting annotations?

But, baby steps. Lets start with a plain text editing format and go from there. I think JSON would be reasonably easy too - though I'm hesitant to support the full range of JSON-patch operations. But we could start with a set operation and object-delete operation.

@ept
Copy link

ept commented Dec 4, 2023

The parents field lists the indexes of previous transactions. This is simple, straightforward and I think it should work for any format.

Ah yes, that's good.

Perhaps a better approach would be to just name the algorithm used at the top level when replaying the graph of events - eg "textOrdering": "fuguemax"?

That works. The merging algorithm can then take into account other information on the transactions when ordering the insertions, e.g. using the agent property as tie-breaker.

The format I'd expect is a combination of text editing events (insert / delete) and annotations, annotating ranges like addAnnotation(Bold, from: 100, to: 110). At least thats what we did on Wave. I'm not sure if thats compatible with how peritext works.

Yes, that would be fine for Peritext.

The tricky parts would be standardizing the rules for extending ranges - eg, if an item is subsequently inserted at position 110, is it bold? And what are the rules when merging conflicting annotations?

Yeah, insertion at the boundary of formatting spans is tricky. Many rich text editors I've tried treat different types of annotation differently – for example, adding chars at the end of a bold span makes them bold too, but adding chars at the end of a link places them outside of the link. If there's the end of a link and the end of a bold in the same place, the behaviour varies depending on the app.

But anyway, I agree that let's start with plain text, and we can cross the rich text bridge later when we come to it.

@streamich
Copy link
Collaborator Author

streamich commented Dec 4, 2023

Are you thinking this REG data format will be only for the purpose of encoding test datasets, benchmarks, etc., or will it also be used for storing end-user documents in production software?

The idea for standardizing is to hopefully use it outside of just testing. One place I have in mind is exporting full history of json-joy document editing (not for testing, but for end users) traces. I was thinking those traces could be made really compact if converted to binary REG format with RLE of metadata.

@mweidner037
Copy link

Prior work on an update format for collaborative rich-text documents: https://github.com/matrix-org/collaborative-documents

Instead of directly capturing the causal graph, that proposal leaves space for algorithm-specific metadata; I found it hard to picture how that would work with a concrete algorithm. But maybe the document format could serve as inspiration here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants