From bb3ca7c70928eab94248103fed7d4557b030dad1 Mon Sep 17 00:00:00 2001 From: Doug Hoyte Date: Wed, 8 Mar 2023 11:17:57 -0500 Subject: [PATCH] wip --- docs/xor.md | 55 +++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 49 insertions(+), 6 deletions(-) diff --git a/docs/xor.md b/docs/xor.md index 51fa97a..d61dd49 100644 --- a/docs/xor.md +++ b/docs/xor.md @@ -1,6 +1,6 @@ # XOR Syncing -This document describes a method for syncing nostr events over a nostr protocol extension. It is loosely based on the method described [here](https://github.com/AljoschaMeyer/set-reconciliation). +This document describes a method for syncing nostr events over a nostr protocol extension. It is roughly based on the method described [here](https://github.com/AljoschaMeyer/set-reconciliation). If both sides of the sync have common events, then this protocol will use less bandwidth than transferring the full set of events (or even just their IDs). @@ -76,7 +76,7 @@ The have and need ID fields are event IDs truncated to `idSize`, concatenated to ### Varint -Varints are a format for storing unsigned integers in a small number of bytes, commonly called BER (Binary Encoded Representation). They are stored as base 128 digits, most significant digit first, with as few digits as possible. Bit eight (the high bit) is set on each byte except the last. +Varints (variable-sized integers) are a format for storing unsigned integers in a small number of bytes, commonly called BER (Binary Encoded Representation). They are stored as base 128 digits, most significant digit first, with as few digits as possible. Bit eight (the high bit) is set on each byte except the last. Varint := * @@ -84,11 +84,13 @@ Varints are a format for storing unsigned integers in a small number of bytes, c The protocol frequently transmits the bounds of a range of events. Each range is specified by a (inclusive) lower bound, followed by an (exclusive) upper bound. -Each bound is specified by a timestamp offset (see below), and a disambiguating prefix of an event (in case multiple events have the same timestamp): +Each bound is specified by a timestamp (see below), and a disambiguating prefix of an event (in case multiple events have the same timestamp): - Bound := * + Bound := * -Since the upper bound's timestamp is always `>=` to the lower bound's, ranges in a message are always encoded in ascending order, and ranges never overlap, each timestamp is encoded as an offset from the previous bound. The initial offset is 0. +The timestamp is encoded specially. The value `0` is reserved for the maximum allowable timestamp (such that all valid events preceed it). All other values are `1 + offset`, where offset is the difference between this timestamp and the previously encoded timestamp. The initial offset starts at 0 and resets at the beginning of each message. + +Offsets are always positive numbers since the upper bound's timestamp is always `>=` to the lower bound's, ranges in a message are always encoded in ascending order, and ranges never overlap. ### Range @@ -96,7 +98,7 @@ A range consists of a pair of bounds, a mode, and a payload (determined by the m Range := -If the mode is 0, then the payload is the byte-wise exclusive OR of all the IDs in this range, truncated to `idSize`: +If the mode is 0, then the payload is the bitwise eXclusive OR of all the IDs in this range, truncated to `idSize`: Xor := {idSize} @@ -109,3 +111,44 @@ If the mode is `>= 8`, then this is a list of IDs in the range. The number of en A reconcilliation message is just an ordered list of ranges, or the empty string if no ranges need reconciling: Message := * + + +## Algorithm + +After the initial setup, the algorithm is the same for both client and relay (who alternate between sender and receiver). + +The message receiver should loop over all the ranges of its incoming message and build its next outgoing message at the same time. + +Each range in a message encodes the information for the receiving side to determine if their portion of the event set matches the sender's portion. For every incoming range, the receiver will add 0 more ranges to its next outgoing message, depending on whether more syncing needs to occur for this range or not. + +### Base Case + +If the mode of a range is `>= 8`, then the sender has decided to send all the IDs it has within this range. Typically it will choose a cut-off, and use this mode if there are fewer than `N` events to send. + +For each received ID, the receiver should determine if it has this event. If not, it should add it to its next `needIDs` message. Additionally, each event that the receiver has that does not appear in the received IDs list should be added to the next `haveIDs` message. Since this range has been fully reconciled, no additional ranges should be pushed onto the next outgoing message. + +### Splitting + +If the mode is `0`, then the sender is sending a bitwise eXclusive OR of all the IDs it has within this range, indicating that it has too many events to send outright, and would like to determine if the two sides have some common events. + +Upon receipt of an XOR range, the receiver should compute and compare the XOR of its own events within this range. If they are identical, then the receiver should simply ignore the range. Otherwise, it should respond with additional ranges. If its own set of events within this range is small, it can reply with the base case mode above. + +However, if a receiver's set of events within the range is large, it should instead "split" the range into multiple sub-ranges, each of which gets appended onto the next outgoing message. These sub-ranges must totally cover the original range, meaning that the start of the first sub-range must match the original range's start (*not* the first matching element in its set). Likewise for the end of the range. Also, each sub-range's lower bound must be equal to the previous range's upper bound. The overhead of this duplication is mostly erased by the timestamp offset encoding described above. + +How to split the range is implementation-dependent. The simplest way is to divide the elements that fall within the range into N equal-sized buckets, and emit an XOR-mode sub-range for each bucket. However, an implementation could choose different grouping criteria. For example, events with similar timestamps could be grouped into a single bucket. If the implementation believes the other side is less likely to have recent events, it could make the most recent bucket a base-case-mode. + +Note that if alternate grouping strategies are used, an implementation should never reply to an XOR-range with another single XOR-range, otherwise the protocol may never terminate. + +### Termination + +If a receiver has looped over all incoming ranges and has added no new ranges to the next outgoing message, the sets have been fully reconciled. It still must send a final message with the value of an empty string (`""`) to let the other side know it is finished, as well as to transmit any final have/need IDs. + + + + + +## Extensions + +Don't need to push all your ranges right away if message is getting too big + +Don't need to send need or have IDs if you're not interested in syncing in that direction