mirror of
https://github.com/hoytech/strfry.git
synced 2025-06-16 16:28:50 +00:00
remove prefix matching for ids and authors filter fields
- this also fixes https://github.com/hoytech/strfry/issues/109
This commit is contained in:
@ -239,7 +239,7 @@ In nostr, each `REQ` message from a subscriber can contain multiple filters. We
|
||||
|
||||
A `FilterGroup` is a vector of `Filter` objects. When the Ingester receives a `REQ`, the JSON filter items are compiled into `Filter`s and the original JSON is discarded. Each filter item's specified fields are compiled into sorted lookup tables called filter sets.
|
||||
|
||||
In order to determine if an event matches against a `Filter`, first the `since` and `until` fields are checked. Then, each field of the event for which a filter item was specified is looked up in the corresponding lookup table. Specifically, the upper-bound index is determined using a binary search (for example `std::upper_bound`). This is the first element greater than the event's item. Then the preceeding table item is checked for either a prefix (`ids`/`authors`) or exact (everything else) match.
|
||||
In order to determine if an event matches against a `Filter`, first the `since` and `until` fields are checked. Then, each field of the event for which a filter item was specified is looked up in the corresponding lookup table. Specifically, the upper-bound index is determined using a binary search (for example `std::upper_bound`). This is the first element greater than the event's item. Then the preceeding table item is checked for a match.
|
||||
|
||||
Since testing `Filter`s against events is performed so frequently, it is a performance-critical operation and some optimisations have been applied. For example, each filter item in the lookup table is represented by a 4 byte data structure, one of which is the first byte of the field and the rest are offset/size lookups into a single memory allocation containing the remaining bytes. Under typical scenarios, this will greatly reduce the amount of memory that needs to be loaded to process a filter. Filters with 16 or fewer items can often be rejected with the load of a single cache line. Because filters aren't scanned linearly, the number of items in a filter (ie amount of pubkeys) doesn't have a significant impact on processing resources.
|
||||
|
||||
@ -251,7 +251,7 @@ Because events are stored in the same flatbuffers format in memory and "in the d
|
||||
|
||||
When a user's `REQ` is being processed for the initial "old" data, each `Filter` in its `FilterGroup` is analysed and the best index is determined according to a simple heuristic. For each filter item in the `Filter`, the index is scanned backwards starting at the upper-bound of that filter item. Because all indices are composite keyed with `created_at`, the scanner also jumps to the `until` time when possible. Each event is compared against the compiled `Filter` and, if it matches, sent to the Websocket thread to be sent to the subscriber. The scan completes when one of the following is true:
|
||||
|
||||
* The key no longer matches the filter item (exact or prefix, depending on field)
|
||||
* The key no longer matches the filter item
|
||||
* The event's `created_at` is before the `since` filter field
|
||||
* The filter's `limit` field of delivered events has been reached
|
||||
|
||||
@ -280,7 +280,7 @@ After the subscription is all caught up to the current transaction's snapshot, t
|
||||
|
||||
Whenever a new event is processed, all of its fields are looked up in the various monitor sets, which provides a list of filters that should be fully processed to check for a match. If an event has no fields in common with a filter, a match will not be attempted for this filter.
|
||||
|
||||
For example, for each prefix in the `authors` field in a filter, an entry is added to the `allAuthors` monitor set. When a new event is subsequently detected, the `pubkey` is looked up in `allAuthors` according to a binary search. Then the data-structure is scanned until it stops seeing records that are prefix matches against the `pubkey`. All of these matching records are pointers to corresponding `Filter`s of the REQs that have subscribed to this author. The filters must then be processed to determine if the event satisfies the other parameters of each filter (`since`/`until`/etc).
|
||||
For example, for each item in the `authors` field in a filter, an entry is added to the `allAuthors` monitor set. When a new event is subsequently detected, the `pubkey` is looked up in `allAuthors` according to a binary search. Then the data-structure is scanned until it stops seeing records that match the `pubkey`. All of these matching records are pointers to corresponding `Filter`s of the REQs that have subscribed to this author. The filters must then be processed to determine if the event satisfies the other parameters of each filter (`since`/`until`/etc).
|
||||
|
||||
After comparing the event against each filter detected via the inverted index, that filter is marked as "up-to-date" with this event's ID, whether the filter matched or not. This prevents needlessly re-comparing this filter against the same event in the future (in case one of the *other* index lookups matches it). If a filter *does* match, then the entire filter group is marked as up-to-date. This prevents sending the same event multiple times in case multiple filters in a filter group match, and also prevents needlessly comparing other filters in the group against an event that has already been sent.
|
||||
|
||||
|
2
TODO
2
TODO
@ -2,8 +2,6 @@
|
||||
latest negentropy
|
||||
make all std::function's const refs
|
||||
split source meta-data off Event table
|
||||
remove prefix matching
|
||||
fix related github bug, add tests
|
||||
|
||||
|
||||
|
||||
|
@ -100,17 +100,6 @@ struct ActiveMonitors : NonCopyable {
|
||||
}
|
||||
};
|
||||
|
||||
auto processMonitorsPrefix = [&](btree_map<std::string, MonitorSet> &m, const std::string &key, std::function<bool(const std::string&)> matches){
|
||||
auto it = m.lower_bound(key.substr(0, 1));
|
||||
|
||||
if (it == m.end()) return;
|
||||
|
||||
while (it != m.end() && it->first[0] == key[0]) {
|
||||
if (matches(it->first)) processMonitorSet(it->second);
|
||||
it = std::next(it);
|
||||
}
|
||||
};
|
||||
|
||||
auto processMonitorsExact = [&]<typename T>(btree_map<T, MonitorSet> &m, const T &key, std::function<bool(const T &)> matches){
|
||||
auto it = m.upper_bound(key);
|
||||
|
||||
@ -128,15 +117,15 @@ struct ActiveMonitors : NonCopyable {
|
||||
|
||||
{
|
||||
auto id = std::string(packed.id());
|
||||
processMonitorsPrefix(allIds, id, static_cast<std::function<bool(const std::string&)>>([&](const std::string &val){
|
||||
return id.starts_with(val);
|
||||
processMonitorsExact(allIds, id, static_cast<std::function<bool(const std::string&)>>([&](const std::string &val){
|
||||
return id == val;
|
||||
}));
|
||||
}
|
||||
|
||||
{
|
||||
auto pubkey = std::string(packed.pubkey());
|
||||
processMonitorsPrefix(allAuthors, pubkey, static_cast<std::function<bool(const std::string&)>>([&](const std::string &val){
|
||||
return pubkey.starts_with(val);
|
||||
processMonitorsExact(allAuthors, pubkey, static_cast<std::function<bool(const std::string&)>>([&](const std::string &val){
|
||||
return pubkey == val;
|
||||
}));
|
||||
}
|
||||
|
||||
|
@ -24,7 +24,6 @@ struct DBScan : NonCopyable {
|
||||
enum class KeyMatchResult {
|
||||
Yes,
|
||||
No,
|
||||
NoButContinue,
|
||||
};
|
||||
|
||||
struct ScanCursor {
|
||||
@ -111,13 +110,13 @@ struct DBScan : NonCopyable {
|
||||
|
||||
cursors.reserve(f.ids->size());
|
||||
for (uint64_t i = 0; i < f.ids->size(); i++) {
|
||||
std::string prefix = f.ids->at(i);
|
||||
std::string search = f.ids->at(i);
|
||||
|
||||
cursors.emplace_back(
|
||||
padBytes(prefix, 32 + 8, '\xFF'),
|
||||
search + std::string(8, '\xFF'),
|
||||
MAX_U64,
|
||||
[prefix](std::string_view k){
|
||||
return k.starts_with(prefix) ? KeyMatchResult::Yes : KeyMatchResult::No;
|
||||
[search](std::string_view k){
|
||||
return k.starts_with(search) ? KeyMatchResult::Yes : KeyMatchResult::No;
|
||||
}
|
||||
);
|
||||
}
|
||||
@ -161,22 +160,15 @@ struct DBScan : NonCopyable {
|
||||
for (uint64_t j = 0; j < f.kinds->size(); j++) {
|
||||
uint64_t kind = f.kinds->at(j);
|
||||
|
||||
std::string prefix = f.authors->at(i);
|
||||
if (prefix.size() == 32) prefix += lmdb::to_sv<uint64_t>(kind);
|
||||
std::string search = f.authors->at(i);
|
||||
search += lmdb::to_sv<uint64_t>(kind);
|
||||
|
||||
cursors.emplace_back(
|
||||
padBytes(prefix, 32 + 8 + 8, '\xFF'),
|
||||
search + std::string(8, '\xFF'),
|
||||
MAX_U64,
|
||||
[prefix, kind](std::string_view k){
|
||||
if (!k.starts_with(prefix)) return KeyMatchResult::No;
|
||||
if (prefix.size() == 32 + 8) return KeyMatchResult::Yes;
|
||||
|
||||
ParsedKey_StringUint64Uint64 parsedKey(k);
|
||||
if (parsedKey.n1 == kind) return KeyMatchResult::Yes;
|
||||
|
||||
// With a prefix pubkey, continue scanning (pubkey,kind) backwards because with this index
|
||||
// we don't know the next pubkey to jump back to
|
||||
return KeyMatchResult::NoButContinue;
|
||||
[search, kind](std::string_view k){
|
||||
if (!k.starts_with(search)) return KeyMatchResult::No;
|
||||
return KeyMatchResult::Yes;
|
||||
}
|
||||
);
|
||||
}
|
||||
@ -189,13 +181,13 @@ struct DBScan : NonCopyable {
|
||||
|
||||
cursors.reserve(f.authors->size());
|
||||
for (uint64_t i = 0; i < f.authors->size(); i++) {
|
||||
std::string prefix = f.authors->at(i);
|
||||
std::string search = f.authors->at(i);
|
||||
|
||||
cursors.emplace_back(
|
||||
padBytes(prefix, 32 + 8, '\xFF'),
|
||||
search + std::string(8, '\xFF'),
|
||||
MAX_U64,
|
||||
[prefix](std::string_view k){
|
||||
return k.starts_with(prefix) ? KeyMatchResult::Yes : KeyMatchResult::No;
|
||||
[search](std::string_view k){
|
||||
return k.starts_with(search) ? KeyMatchResult::Yes : KeyMatchResult::No;
|
||||
}
|
||||
);
|
||||
}
|
||||
|
@ -29,8 +29,9 @@ struct FilterSetBytes {
|
||||
|
||||
std::sort(arr.begin(), arr.end());
|
||||
|
||||
for (const auto &item : arr) {
|
||||
if (items.size() > 0 && item.starts_with(at(items.size() - 1))) continue; // remove duplicates and redundant prefixes
|
||||
for (size_t i = 0; i < arr.size(); i++) {
|
||||
const auto &item = arr[i];
|
||||
if (i > 0 && item == arr[i - 1]) continue; // remove duplicates
|
||||
items.emplace_back(Item{ (uint16_t)buf.size(), (uint8_t)item.size(), (uint8_t)item[0] });
|
||||
buf += item;
|
||||
}
|
||||
@ -72,7 +73,7 @@ struct FilterSetBytes {
|
||||
}
|
||||
|
||||
if (first == 0) return false;
|
||||
if (candidate.starts_with(std::string_view(buf.data() + items[first - 1].offset, items[first - 1].size))) return true;
|
||||
if (candidate == std::string_view(buf.data() + items[first - 1].offset, items[first - 1].size)) return true;
|
||||
|
||||
return false;
|
||||
}
|
||||
@ -123,14 +124,14 @@ struct NostrFilter {
|
||||
for (const auto &[k, v] : filterObj.get_object()) {
|
||||
if (v.is_array() && v.get_array().size() == 0) {
|
||||
neverMatch = true;
|
||||
break;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (k == "ids") {
|
||||
ids.emplace(v, true, 1, 32);
|
||||
ids.emplace(v, true, 32, 32);
|
||||
numMajorFields++;
|
||||
} else if (k == "authors") {
|
||||
authors.emplace(v, true, 1, 32);
|
||||
authors.emplace(v, true, 32, 32);
|
||||
numMajorFields++;
|
||||
} else if (k == "kinds") {
|
||||
kinds.emplace(v);
|
||||
@ -159,7 +160,7 @@ struct NostrFilter {
|
||||
}
|
||||
}
|
||||
|
||||
if (tags.size() > 2) throw herr("too many tags in filter"); // O(N^2) in matching, just prohibit it
|
||||
if (tags.size() > 3) throw herr("too many tags in filter"); // O(N^2) in matching, just prohibit it
|
||||
|
||||
if (limit > maxFilterLimit) limit = maxFilterLimit;
|
||||
|
||||
|
@ -15,5 +15,4 @@ std::string renderPercent(double p);
|
||||
uint64_t parseUint64(const std::string &s);
|
||||
std::string parseIP(const std::string &ip);
|
||||
uint64_t getDBVersion(lmdb::txn &txn);
|
||||
std::string padBytes(std::string_view str, size_t n, char padChar);
|
||||
void exitOnSigPipe();
|
||||
|
@ -110,11 +110,6 @@ uint64_t getDBVersion(lmdb::txn &txn) {
|
||||
}
|
||||
|
||||
|
||||
std::string padBytes(std::string_view str, size_t n, char padChar) {
|
||||
if (str.size() > n) throw herr("unable to pad, string longer than expected");
|
||||
return std::string(str) + std::string(n - str.size(), padChar);
|
||||
}
|
||||
|
||||
void exitOnSigPipe() {
|
||||
struct sigaction act;
|
||||
memset(&act, 0, sizeof act);
|
||||
|
@ -46,7 +46,7 @@ sub doesMatchSingle {
|
||||
if ($filter->{ids}) {
|
||||
my $found;
|
||||
foreach my $id (@{ $filter->{ids} }) {
|
||||
if (startsWith($ev->{id}, $id)) {
|
||||
if ($ev->{id} eq $id) {
|
||||
$found = 1;
|
||||
last;
|
||||
}
|
||||
@ -57,7 +57,7 @@ sub doesMatchSingle {
|
||||
if ($filter->{authors}) {
|
||||
my $found;
|
||||
foreach my $author (@{ $filter->{authors} }) {
|
||||
if (startsWith($ev->{pubkey}, $author)) {
|
||||
if ($ev->{pubkey} eq $author) {
|
||||
$found = 1;
|
||||
last;
|
||||
}
|
||||
@ -117,7 +117,3 @@ sub doesMatchSingle {
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
sub startsWith {
|
||||
return rindex($_[0], $_[1], 0) == 0;
|
||||
}
|
||||
|
@ -89,7 +89,9 @@ c1e5e04d92d9bd20701bff4cbdac1cdc317d405035883b7adcf9a6a5308d0f54
|
||||
|
||||
my $topics = [qw{
|
||||
bitcoin
|
||||
nos
|
||||
nostr
|
||||
nostrnovember
|
||||
gitlog
|
||||
introductions
|
||||
jb55
|
||||
@ -111,14 +113,14 @@ sub genRandomFilterGroup {
|
||||
if (rand() < .15) {
|
||||
$f->{ids} = [];
|
||||
for (1..(rand()*10)) {
|
||||
push @{$f->{ids}}, randPrefix($ids->[int(rand() * @$ids)], $useLimit);
|
||||
push @{$f->{ids}}, $ids->[int(rand() * @$ids)];
|
||||
}
|
||||
}
|
||||
|
||||
if (rand() < .3) {
|
||||
$f->{authors} = [];
|
||||
for (1..(rand()*5)) {
|
||||
push @{$f->{authors}}, randPrefix($pubkeys->[int(rand() * @$pubkeys)], $useLimit);
|
||||
push @{$f->{authors}}, $pubkeys->[int(rand() * @$pubkeys)];
|
||||
}
|
||||
}
|
||||
|
||||
@ -174,13 +176,6 @@ sub genRandomFilterGroup {
|
||||
return \@filters;
|
||||
}
|
||||
|
||||
sub randPrefix {
|
||||
my $v = shift;
|
||||
my $noPrefix = shift;
|
||||
return $v if $noPrefix || rand() < .5;
|
||||
return substr($v, 0, (int(rand() * 20) + 1) * 2);
|
||||
}
|
||||
|
||||
sub genRandomMonitorCmds {
|
||||
my $nextConnId = 1;
|
||||
my @out;
|
||||
@ -236,6 +231,13 @@ sub testMonitor {
|
||||
my $fge = encode_json($interestFg);
|
||||
print "filt: $fge\n\n";
|
||||
|
||||
# HACK for debugging:
|
||||
#$fge = q{[{"#t":["nostrnovember","nostr"]}]};
|
||||
#$monCmds = [
|
||||
# ["sub",1000000,"mysub",decode_json($fge)],
|
||||
# ["interest",1000000,"mysub"],
|
||||
#];
|
||||
|
||||
print "DOING MONS\n";
|
||||
my $pid = open2(my $outfile, my $infile, './strfry monitor | jq -r .id | sort | sha256sum');
|
||||
for my $c (@$monCmds) { print $infile encode_json($c), "\n"; }
|
||||
|
Reference in New Issue
Block a user