allow filtering for indexed values > 255 bytes

This commit is contained in:
Doug Hoyte
2023-02-05 15:02:36 -05:00
parent 1987c5a669
commit 93ca4b9044
2 changed files with 8 additions and 5 deletions

View File

@ -195,7 +195,7 @@ A `FilterGroup` is a vector of `Filter` objects. When the Ingester receives a `R
In order to determine if an event matches against a `Filter`, first the `since` and `until` fields are checked. Then, each field of the event for which a filter item was specified is looked up in the corresponding lookup table. Specifically, the upper-bound index is determined using a binary search (for example `std::upper_bound`). This is the first element greater than the event's item. Then the preceeding table item is checked for either a prefix (`ids`/`authors`) or exact (everything else) match. In order to determine if an event matches against a `Filter`, first the `since` and `until` fields are checked. Then, each field of the event for which a filter item was specified is looked up in the corresponding lookup table. Specifically, the upper-bound index is determined using a binary search (for example `std::upper_bound`). This is the first element greater than the event's item. Then the preceeding table item is checked for either a prefix (`ids`/`authors`) or exact (everything else) match.
Since testing `Filter`s against events is performed so frequently, it is a performance-critical operation and some optimisations have been applied. For example, each filter item in the lookup table is represented by a 4 byte data structure, one of which is the first byte of the field and the rest are offset/size lookups into a single memory allocation containing the remaining bytes. Under typical scenarios, this will greatly reduce the amount of memory that needs to be loaded to process a filter. Filters with 16 or fewer items can often be rejected with the load of a single cache line. Because filters aren't scanned linearly, the number of items in a filter (ie amount of pubkeys) doesn't have a significant impact on processing resources. Since testing `Filter`s against events is performed so frequently, it is a performance-critical operation and some optimisations have been applied. For example, each filter item in the lookup table is represented by a 8 byte data structure, one of which is the first byte of the field and the rest are offset/size lookups into a single memory allocation containing the remaining bytes. Under typical scenarios, this will greatly reduce the amount of memory that needs to be loaded to process a filter. Filters with 8 or fewer items can often be rejected with the load of a single cache line. Because filters aren't scanned linearly, the number of items in a filter (ie amount of pubkeys) doesn't have a significant impact on processing resources.
#### DBScan #### DBScan

View File

@ -7,9 +7,10 @@
struct FilterSetBytes { struct FilterSetBytes {
struct Item { struct Item {
uint16_t offset; uint32_t offset;
uint8_t size; uint16_t size;
uint8_t firstByte; uint8_t firstByte;
uint8_t padding;
}; };
std::vector<Item> items; std::vector<Item> items;
@ -18,6 +19,8 @@ struct FilterSetBytes {
// Sizes are post-hex decode // Sizes are post-hex decode
FilterSetBytes(const tao::json::value &arrHex, bool hexDecode, size_t minSize, size_t maxSize) { FilterSetBytes(const tao::json::value &arrHex, bool hexDecode, size_t minSize, size_t maxSize) {
if (maxSize > std::numeric_limits<uint16_t>::max()) throw herr("filter maxSize too big");
std::vector<std::string> arr; std::vector<std::string> arr;
uint64_t totalSize = 0; uint64_t totalSize = 0;
@ -34,11 +37,11 @@ struct FilterSetBytes {
for (const auto &item : arr) { for (const auto &item : arr) {
if (items.size() > 0 && item.starts_with(at(items.size() - 1))) continue; // remove duplicates and redundant prefixes if (items.size() > 0 && item.starts_with(at(items.size() - 1))) continue; // remove duplicates and redundant prefixes
items.emplace_back(Item{ (uint16_t)buf.size(), (uint8_t)item.size(), (uint8_t)item[0] }); items.emplace_back(Item{ (uint32_t)buf.size(), (uint16_t)item.size(), (uint8_t)item[0] });
buf += item; buf += item;
} }
if (buf.size() > 65535) throw herr("total filter items too large"); if (buf.size() > 1'000'000) throw herr("total filter items too large");
} }
std::string at(size_t n) const { std::string at(size_t n) const {