More so than other sketches in the library, the Frequent Items sketch can take some practice to use since it identifies exceptionally heavy hitters rather than returning a "top N" list. We assume readers have already familiarized themselves with the sketch documentation and are aware of the key concepts around use of this sketch.
from datasketches import frequent_strings_sketch, frequent_items_error_type
We'll use a very small sketch in this case so that we can easily fill it, otherwise the difference between error types is more difficult to demonstrate.
k = 3
fi = frequent_strings_sketch(k)
A brief digression into implementation details to help explain what we're doing here. The Frequent Items sketch maintains a list of items, but purges the least frequent items when the list fills. For this example, we'll keep inserting items until after a purge takes place.
We'll insert items with exponentially decreasing weights, which in this case gives us a more interesting set of results when we later query things.
n = 8
for i in range(0,n):
fi.update(str(i), 2 ** (n-i))
i += 1
print('Update ' + str(i) + ': ' + str(fi.get_num_active_items()) + ' items')
Update 1: 1 items Update 2: 2 items Update 3: 3 items Update 4: 4 items Update 5: 5 items Update 6: 6 items Update 7: 3 items Update 8: 4 items
We can see where the purge happened, and in this case we inserted a low-weight item after the purge. We can now compare querying items to exclude either false positives or false negatives.
NO_FALSE_POSITIVES
returns all items with a lower bound above the a posteriori errorNO_FALSE_NEGATIVES
returns all items with an upper bound above the a posteriori errorThe latter option will always include any results from the first set and may include others. Items are returned as (id, estimate, lower_bound, upper_bound) and are sorted by decreasing weight.
fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES)
[('0', 256, 224, 256), ('1', 128, 96, 128)]
fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES)
[('0', 256, 224, 256), ('1', 128, 96, 128), ('2', 64, 32, 64), ('7', 34, 2, 34)]
The sketch also allows us to query for individual items directly.
print(fi.get_estimate("0"))
print(fi.get_upper_bound("2"))
print(fi.get_lower_bound("7"))
256 64 2
We can also query for items not in the the list, whether the item has never been seen or if it has been evicted from the active set.
fi.get_estimate("5")
0
The sketch may also be serialized for archiving, and reconstructed.
sk_bytes = fi.serialize()
len(sk_bytes)
84
fi2 = frequent_strings_sketch.deserialize(sk_bytes)
print(fi2)
### Frequent items sketch summary: lg cur map size : 3 lg max map size : 3 num active items : 4 total weight : 510 max error : 32 ### End sketch summary
Frequent Items sketches support merge()
to combine sketches. Keep in mind that the combined sketches may not have any meaningfully frequent items, even if there were frequent items in one of the input sketches.
We'll start by creating a sketch with lots of equally-weighted very light items, but with a combined weight several times greater than that of the first sketch, and then merge that into the first sketch.
fi2 = frequent_strings_sketch(k)
wt = fi.get_total_weight()
for i in range(0,4*wt):
fi2.update(str(i))
fi.merge(fi2)
Even though all these new items have weight 1, there are so many of them that we have nothing if we ask for no fasle positives.
len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES))
0
We do, however, see a few potentially heavy items if we request no false negatives.
len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES))
3