remove prefix matching for ids and authors filter fields

- this also fixes https://github.com/hoytech/strfry/issues/109
This commit is contained in:
Doug Hoyte
2024-08-30 15:56:41 -04:00
parent 80249f2078
commit 21d4e07b87
9 changed files with 42 additions and 70 deletions

View File

@ -239,7 +239,7 @@ In nostr, each `REQ` message from a subscriber can contain multiple filters. We
A `FilterGroup` is a vector of `Filter` objects. When the Ingester receives a `REQ`, the JSON filter items are compiled into `Filter`s and the original JSON is discarded. Each filter item's specified fields are compiled into sorted lookup tables called filter sets. A `FilterGroup` is a vector of `Filter` objects. When the Ingester receives a `REQ`, the JSON filter items are compiled into `Filter`s and the original JSON is discarded. Each filter item's specified fields are compiled into sorted lookup tables called filter sets.
In order to determine if an event matches against a `Filter`, first the `since` and `until` fields are checked. Then, each field of the event for which a filter item was specified is looked up in the corresponding lookup table. Specifically, the upper-bound index is determined using a binary search (for example `std::upper_bound`). This is the first element greater than the event's item. Then the preceeding table item is checked for either a prefix (`ids`/`authors`) or exact (everything else) match. In order to determine if an event matches against a `Filter`, first the `since` and `until` fields are checked. Then, each field of the event for which a filter item was specified is looked up in the corresponding lookup table. Specifically, the upper-bound index is determined using a binary search (for example `std::upper_bound`). This is the first element greater than the event's item. Then the preceeding table item is checked for a match.
Since testing `Filter`s against events is performed so frequently, it is a performance-critical operation and some optimisations have been applied. For example, each filter item in the lookup table is represented by a 4 byte data structure, one of which is the first byte of the field and the rest are offset/size lookups into a single memory allocation containing the remaining bytes. Under typical scenarios, this will greatly reduce the amount of memory that needs to be loaded to process a filter. Filters with 16 or fewer items can often be rejected with the load of a single cache line. Because filters aren't scanned linearly, the number of items in a filter (ie amount of pubkeys) doesn't have a significant impact on processing resources. Since testing `Filter`s against events is performed so frequently, it is a performance-critical operation and some optimisations have been applied. For example, each filter item in the lookup table is represented by a 4 byte data structure, one of which is the first byte of the field and the rest are offset/size lookups into a single memory allocation containing the remaining bytes. Under typical scenarios, this will greatly reduce the amount of memory that needs to be loaded to process a filter. Filters with 16 or fewer items can often be rejected with the load of a single cache line. Because filters aren't scanned linearly, the number of items in a filter (ie amount of pubkeys) doesn't have a significant impact on processing resources.
@ -251,7 +251,7 @@ Because events are stored in the same flatbuffers format in memory and "in the d
When a user's `REQ` is being processed for the initial "old" data, each `Filter` in its `FilterGroup` is analysed and the best index is determined according to a simple heuristic. For each filter item in the `Filter`, the index is scanned backwards starting at the upper-bound of that filter item. Because all indices are composite keyed with `created_at`, the scanner also jumps to the `until` time when possible. Each event is compared against the compiled `Filter` and, if it matches, sent to the Websocket thread to be sent to the subscriber. The scan completes when one of the following is true: When a user's `REQ` is being processed for the initial "old" data, each `Filter` in its `FilterGroup` is analysed and the best index is determined according to a simple heuristic. For each filter item in the `Filter`, the index is scanned backwards starting at the upper-bound of that filter item. Because all indices are composite keyed with `created_at`, the scanner also jumps to the `until` time when possible. Each event is compared against the compiled `Filter` and, if it matches, sent to the Websocket thread to be sent to the subscriber. The scan completes when one of the following is true:
* The key no longer matches the filter item (exact or prefix, depending on field) * The key no longer matches the filter item
* The event's `created_at` is before the `since` filter field * The event's `created_at` is before the `since` filter field
* The filter's `limit` field of delivered events has been reached * The filter's `limit` field of delivered events has been reached
@ -280,7 +280,7 @@ After the subscription is all caught up to the current transaction's snapshot, t
Whenever a new event is processed, all of its fields are looked up in the various monitor sets, which provides a list of filters that should be fully processed to check for a match. If an event has no fields in common with a filter, a match will not be attempted for this filter. Whenever a new event is processed, all of its fields are looked up in the various monitor sets, which provides a list of filters that should be fully processed to check for a match. If an event has no fields in common with a filter, a match will not be attempted for this filter.
For example, for each prefix in the `authors` field in a filter, an entry is added to the `allAuthors` monitor set. When a new event is subsequently detected, the `pubkey` is looked up in `allAuthors` according to a binary search. Then the data-structure is scanned until it stops seeing records that are prefix matches against the `pubkey`. All of these matching records are pointers to corresponding `Filter`s of the REQs that have subscribed to this author. The filters must then be processed to determine if the event satisfies the other parameters of each filter (`since`/`until`/etc). For example, for each item in the `authors` field in a filter, an entry is added to the `allAuthors` monitor set. When a new event is subsequently detected, the `pubkey` is looked up in `allAuthors` according to a binary search. Then the data-structure is scanned until it stops seeing records that match the `pubkey`. All of these matching records are pointers to corresponding `Filter`s of the REQs that have subscribed to this author. The filters must then be processed to determine if the event satisfies the other parameters of each filter (`since`/`until`/etc).
After comparing the event against each filter detected via the inverted index, that filter is marked as "up-to-date" with this event's ID, whether the filter matched or not. This prevents needlessly re-comparing this filter against the same event in the future (in case one of the *other* index lookups matches it). If a filter *does* match, then the entire filter group is marked as up-to-date. This prevents sending the same event multiple times in case multiple filters in a filter group match, and also prevents needlessly comparing other filters in the group against an event that has already been sent. After comparing the event against each filter detected via the inverted index, that filter is marked as "up-to-date" with this event's ID, whether the filter matched or not. This prevents needlessly re-comparing this filter against the same event in the future (in case one of the *other* index lookups matches it). If a filter *does* match, then the entire filter group is marked as up-to-date. This prevents sending the same event multiple times in case multiple filters in a filter group match, and also prevents needlessly comparing other filters in the group against an event that has already been sent.

2
TODO
View File

@ -2,8 +2,6 @@
latest negentropy latest negentropy
make all std::function's const refs make all std::function's const refs
split source meta-data off Event table split source meta-data off Event table
remove prefix matching
fix related github bug, add tests

View File

@ -100,17 +100,6 @@ struct ActiveMonitors : NonCopyable {
} }
}; };
auto processMonitorsPrefix = [&](btree_map<std::string, MonitorSet> &m, const std::string &key, std::function<bool(const std::string&)> matches){
auto it = m.lower_bound(key.substr(0, 1));
if (it == m.end()) return;
while (it != m.end() && it->first[0] == key[0]) {
if (matches(it->first)) processMonitorSet(it->second);
it = std::next(it);
}
};
auto processMonitorsExact = [&]<typename T>(btree_map<T, MonitorSet> &m, const T &key, std::function<bool(const T &)> matches){ auto processMonitorsExact = [&]<typename T>(btree_map<T, MonitorSet> &m, const T &key, std::function<bool(const T &)> matches){
auto it = m.upper_bound(key); auto it = m.upper_bound(key);
@ -128,15 +117,15 @@ struct ActiveMonitors : NonCopyable {
{ {
auto id = std::string(packed.id()); auto id = std::string(packed.id());
processMonitorsPrefix(allIds, id, static_cast<std::function<bool(const std::string&)>>([&](const std::string &val){ processMonitorsExact(allIds, id, static_cast<std::function<bool(const std::string&)>>([&](const std::string &val){
return id.starts_with(val); return id == val;
})); }));
} }
{ {
auto pubkey = std::string(packed.pubkey()); auto pubkey = std::string(packed.pubkey());
processMonitorsPrefix(allAuthors, pubkey, static_cast<std::function<bool(const std::string&)>>([&](const std::string &val){ processMonitorsExact(allAuthors, pubkey, static_cast<std::function<bool(const std::string&)>>([&](const std::string &val){
return pubkey.starts_with(val); return pubkey == val;
})); }));
} }

View File

@ -24,7 +24,6 @@ struct DBScan : NonCopyable {
enum class KeyMatchResult { enum class KeyMatchResult {
Yes, Yes,
No, No,
NoButContinue,
}; };
struct ScanCursor { struct ScanCursor {
@ -111,13 +110,13 @@ struct DBScan : NonCopyable {
cursors.reserve(f.ids->size()); cursors.reserve(f.ids->size());
for (uint64_t i = 0; i < f.ids->size(); i++) { for (uint64_t i = 0; i < f.ids->size(); i++) {
std::string prefix = f.ids->at(i); std::string search = f.ids->at(i);
cursors.emplace_back( cursors.emplace_back(
padBytes(prefix, 32 + 8, '\xFF'), search + std::string(8, '\xFF'),
MAX_U64, MAX_U64,
[prefix](std::string_view k){ [search](std::string_view k){
return k.starts_with(prefix) ? KeyMatchResult::Yes : KeyMatchResult::No; return k.starts_with(search) ? KeyMatchResult::Yes : KeyMatchResult::No;
} }
); );
} }
@ -161,22 +160,15 @@ struct DBScan : NonCopyable {
for (uint64_t j = 0; j < f.kinds->size(); j++) { for (uint64_t j = 0; j < f.kinds->size(); j++) {
uint64_t kind = f.kinds->at(j); uint64_t kind = f.kinds->at(j);
std::string prefix = f.authors->at(i); std::string search = f.authors->at(i);
if (prefix.size() == 32) prefix += lmdb::to_sv<uint64_t>(kind); search += lmdb::to_sv<uint64_t>(kind);
cursors.emplace_back( cursors.emplace_back(
padBytes(prefix, 32 + 8 + 8, '\xFF'), search + std::string(8, '\xFF'),
MAX_U64, MAX_U64,
[prefix, kind](std::string_view k){ [search, kind](std::string_view k){
if (!k.starts_with(prefix)) return KeyMatchResult::No; if (!k.starts_with(search)) return KeyMatchResult::No;
if (prefix.size() == 32 + 8) return KeyMatchResult::Yes; return KeyMatchResult::Yes;
ParsedKey_StringUint64Uint64 parsedKey(k);
if (parsedKey.n1 == kind) return KeyMatchResult::Yes;
// With a prefix pubkey, continue scanning (pubkey,kind) backwards because with this index
// we don't know the next pubkey to jump back to
return KeyMatchResult::NoButContinue;
} }
); );
} }
@ -189,13 +181,13 @@ struct DBScan : NonCopyable {
cursors.reserve(f.authors->size()); cursors.reserve(f.authors->size());
for (uint64_t i = 0; i < f.authors->size(); i++) { for (uint64_t i = 0; i < f.authors->size(); i++) {
std::string prefix = f.authors->at(i); std::string search = f.authors->at(i);
cursors.emplace_back( cursors.emplace_back(
padBytes(prefix, 32 + 8, '\xFF'), search + std::string(8, '\xFF'),
MAX_U64, MAX_U64,
[prefix](std::string_view k){ [search](std::string_view k){
return k.starts_with(prefix) ? KeyMatchResult::Yes : KeyMatchResult::No; return k.starts_with(search) ? KeyMatchResult::Yes : KeyMatchResult::No;
} }
); );
} }

View File

@ -29,8 +29,9 @@ struct FilterSetBytes {
std::sort(arr.begin(), arr.end()); std::sort(arr.begin(), arr.end());
for (const auto &item : arr) { for (size_t i = 0; i < arr.size(); i++) {
if (items.size() > 0 && item.starts_with(at(items.size() - 1))) continue; // remove duplicates and redundant prefixes const auto &item = arr[i];
if (i > 0 && item == arr[i - 1]) continue; // remove duplicates
items.emplace_back(Item{ (uint16_t)buf.size(), (uint8_t)item.size(), (uint8_t)item[0] }); items.emplace_back(Item{ (uint16_t)buf.size(), (uint8_t)item.size(), (uint8_t)item[0] });
buf += item; buf += item;
} }
@ -72,7 +73,7 @@ struct FilterSetBytes {
} }
if (first == 0) return false; if (first == 0) return false;
if (candidate.starts_with(std::string_view(buf.data() + items[first - 1].offset, items[first - 1].size))) return true; if (candidate == std::string_view(buf.data() + items[first - 1].offset, items[first - 1].size)) return true;
return false; return false;
} }
@ -123,14 +124,14 @@ struct NostrFilter {
for (const auto &[k, v] : filterObj.get_object()) { for (const auto &[k, v] : filterObj.get_object()) {
if (v.is_array() && v.get_array().size() == 0) { if (v.is_array() && v.get_array().size() == 0) {
neverMatch = true; neverMatch = true;
break; continue;
} }
if (k == "ids") { if (k == "ids") {
ids.emplace(v, true, 1, 32); ids.emplace(v, true, 32, 32);
numMajorFields++; numMajorFields++;
} else if (k == "authors") { } else if (k == "authors") {
authors.emplace(v, true, 1, 32); authors.emplace(v, true, 32, 32);
numMajorFields++; numMajorFields++;
} else if (k == "kinds") { } else if (k == "kinds") {
kinds.emplace(v); kinds.emplace(v);
@ -159,7 +160,7 @@ struct NostrFilter {
} }
} }
if (tags.size() > 2) throw herr("too many tags in filter"); // O(N^2) in matching, just prohibit it if (tags.size() > 3) throw herr("too many tags in filter"); // O(N^2) in matching, just prohibit it
if (limit > maxFilterLimit) limit = maxFilterLimit; if (limit > maxFilterLimit) limit = maxFilterLimit;

View File

@ -15,5 +15,4 @@ std::string renderPercent(double p);
uint64_t parseUint64(const std::string &s); uint64_t parseUint64(const std::string &s);
std::string parseIP(const std::string &ip); std::string parseIP(const std::string &ip);
uint64_t getDBVersion(lmdb::txn &txn); uint64_t getDBVersion(lmdb::txn &txn);
std::string padBytes(std::string_view str, size_t n, char padChar);
void exitOnSigPipe(); void exitOnSigPipe();

View File

@ -110,11 +110,6 @@ uint64_t getDBVersion(lmdb::txn &txn) {
} }
std::string padBytes(std::string_view str, size_t n, char padChar) {
if (str.size() > n) throw herr("unable to pad, string longer than expected");
return std::string(str) + std::string(n - str.size(), padChar);
}
void exitOnSigPipe() { void exitOnSigPipe() {
struct sigaction act; struct sigaction act;
memset(&act, 0, sizeof act); memset(&act, 0, sizeof act);

View File

@ -46,7 +46,7 @@ sub doesMatchSingle {
if ($filter->{ids}) { if ($filter->{ids}) {
my $found; my $found;
foreach my $id (@{ $filter->{ids} }) { foreach my $id (@{ $filter->{ids} }) {
if (startsWith($ev->{id}, $id)) { if ($ev->{id} eq $id) {
$found = 1; $found = 1;
last; last;
} }
@ -57,7 +57,7 @@ sub doesMatchSingle {
if ($filter->{authors}) { if ($filter->{authors}) {
my $found; my $found;
foreach my $author (@{ $filter->{authors} }) { foreach my $author (@{ $filter->{authors} }) {
if (startsWith($ev->{pubkey}, $author)) { if ($ev->{pubkey} eq $author) {
$found = 1; $found = 1;
last; last;
} }
@ -117,7 +117,3 @@ sub doesMatchSingle {
return 1; return 1;
} }
sub startsWith {
return rindex($_[0], $_[1], 0) == 0;
}

View File

@ -89,7 +89,9 @@ c1e5e04d92d9bd20701bff4cbdac1cdc317d405035883b7adcf9a6a5308d0f54
my $topics = [qw{ my $topics = [qw{
bitcoin bitcoin
nos
nostr nostr
nostrnovember
gitlog gitlog
introductions introductions
jb55 jb55
@ -111,14 +113,14 @@ sub genRandomFilterGroup {
if (rand() < .15) { if (rand() < .15) {
$f->{ids} = []; $f->{ids} = [];
for (1..(rand()*10)) { for (1..(rand()*10)) {
push @{$f->{ids}}, randPrefix($ids->[int(rand() * @$ids)], $useLimit); push @{$f->{ids}}, $ids->[int(rand() * @$ids)];
} }
} }
if (rand() < .3) { if (rand() < .3) {
$f->{authors} = []; $f->{authors} = [];
for (1..(rand()*5)) { for (1..(rand()*5)) {
push @{$f->{authors}}, randPrefix($pubkeys->[int(rand() * @$pubkeys)], $useLimit); push @{$f->{authors}}, $pubkeys->[int(rand() * @$pubkeys)];
} }
} }
@ -174,13 +176,6 @@ sub genRandomFilterGroup {
return \@filters; return \@filters;
} }
sub randPrefix {
my $v = shift;
my $noPrefix = shift;
return $v if $noPrefix || rand() < .5;
return substr($v, 0, (int(rand() * 20) + 1) * 2);
}
sub genRandomMonitorCmds { sub genRandomMonitorCmds {
my $nextConnId = 1; my $nextConnId = 1;
my @out; my @out;
@ -236,6 +231,13 @@ sub testMonitor {
my $fge = encode_json($interestFg); my $fge = encode_json($interestFg);
print "filt: $fge\n\n"; print "filt: $fge\n\n";
# HACK for debugging:
#$fge = q{[{"#t":["nostrnovember","nostr"]}]};
#$monCmds = [
# ["sub",1000000,"mysub",decode_json($fge)],
# ["interest",1000000,"mysub"],
#];
print "DOING MONS\n"; print "DOING MONS\n";
my $pid = open2(my $outfile, my $infile, './strfry monitor | jq -r .id | sort | sha256sum'); my $pid = open2(my $outfile, my $infile, './strfry monitor | jq -r .id | sort | sha256sum');
for my $c (@$monCmds) { print $infile encode_json($c), "\n"; } for my $c (@$monCmds) { print $infile encode_json($c), "\n"; }