#!/usr/bin/env python # coding: utf-8 # ## Frequent Items Sketch Examples # ### Basic Sketch Usage # More so than other sketches in the library, the Frequent Items sketch can take some practice to use since it identifies exceptionally heavy hitters rather than returning a "top N" list. We assume readers have already familiarized themselves with the [sketch documentation](https://datasketches.github.io/docs/Frequency/FrequentItemsOverview.html) and are aware of the key concepts around use of this sketch. # In[2]: from datasketches import frequent_strings_sketch, frequent_items_error_type # We'll use a very small sketch in this case so that we can easily fill it, otherwise the difference between error types is more difficult to demonstrate. # In[3]: k = 3 fi = frequent_strings_sketch(k) # A brief digression into implementation details to help explain what we're doing here. The Frequent Items sketch maintains a list of items, but purges the least frequent items when the list fills. For this example, we'll keep inserting items until after a purge takes place. # # We'll insert items with exponentially decreasing weights, which in this case gives us a more interesting set of results when we later query things. # In[4]: n = 8 for i in range(0,n): fi.update(str(i), 2 ** (n-i)) i += 1 print('Update ' + str(i) + ': ' + str(fi.get_num_active_items()) + ' items') # We can see where the purge happened, and in this case we inserted a low-weight item after the purge. We can now compare querying items to exclude either false positives or false negatives. # - `NO_FALSE_POSITIVES` returns all items with a _lower_ bound above the a posteriori error # - `NO_FALSE_NEGATIVES` returns all items with an _upper_ bound above the a posteriori error # # The latter option will always include any results from the first set and may include others. Items are returned as (id, estimate, lower_bound, upper_bound) and are sorted by decreasing weight. # In[5]: fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES) # In[6]: fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES) # The sketch also allows us to query for individual items directly. # In[7]: print(fi.get_estimate("0")) print(fi.get_upper_bound("2")) print(fi.get_lower_bound("7")) # We can also query for items not in the the list, whether the item has never been seen or if it has been evicted from the active set. # In[8]: fi.get_estimate("5") # The sketch may also be serialized for archiving, and reconstructed. # In[9]: sk_bytes = fi.serialize() len(sk_bytes) # In[11]: fi2 = frequent_strings_sketch.deserialize(sk_bytes) print(fi2) # ### Merging Example # Frequent Items sketches support `merge()` to combine sketches. Keep in mind that the combined sketches may not have any meaningfully frequent items, even if there were frequent items in one of the input sketches. # # We'll start by creating a sketch with lots of equally-weighted very light items, but with a combined weight several times greater than that of the first sketch, and then merge that into the first sketch. # In[12]: fi2 = frequent_strings_sketch(k) wt = fi.get_total_weight() for i in range(0,4*wt): fi2.update(str(i)) fi.merge(fi2) # Even though all these new items have weight 1, there are so many of them that we have nothing if we ask for no fasle positives. # In[13]: len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES)) # We do, however, see a few potentially heavy items if we request no false negatives. # In[14]: len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES))