#!/usr/bin/env python
# coding: utf-8

# ## Frequent Items Sketch Examples

# ### Basic Sketch Usage

# More so than other sketches in the library, the Frequent Items sketch can take some practice to use since it identifies exceptionally heavy hitters rather than returning a "top N" list. We assume readers have already familiarized themselves with the [sketch documentation](https://datasketches.github.io/docs/Frequency/FrequentItemsOverview.html) and are aware of the key concepts around use of this sketch.

# In[2]:


from datasketches import frequent_strings_sketch, frequent_items_error_type


# We'll use a very small sketch in this case so that we can easily fill it, otherwise the difference between error types is more difficult to demonstrate.

# In[3]:


k = 3
fi = frequent_strings_sketch(k)


# A brief digression into implementation details to help explain what we're doing here. The Frequent Items sketch maintains a list of items, but purges the least frequent items when the list fills. For this example, we'll keep inserting items until after a purge takes place.
# 
# We'll insert items with exponentially decreasing weights, which in this case gives us a more interesting set of results when we later query things.

# In[4]:


n = 8
for i in range(0,n):
    fi.update(str(i), 2 ** (n-i))
    i += 1
    print('Update ' + str(i) + ': ' + str(fi.get_num_active_items()) + ' items')


# We can see where the purge happened, and in this case we inserted a low-weight item after the purge. We can now compare querying items to exclude either false positives or false negatives.
#   - `NO_FALSE_POSITIVES` returns all items with a _lower_ bound above the a posteriori error
#   - `NO_FALSE_NEGATIVES` returns all items with an _upper_ bound above the a posteriori error
# 
# The latter option will always include any results from the first set and may include others. Items are returned as (id, estimate, lower_bound, upper_bound) and are sorted by decreasing weight.

# In[5]:


fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES)


# In[6]:


fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES)


# The sketch also allows us to query for individual items directly.

# In[7]:


print(fi.get_estimate("0"))
print(fi.get_upper_bound("2"))
print(fi.get_lower_bound("7"))


# We can also query for items not in the the list, whether the item has never been seen or if it has been evicted from the active set.

# In[8]:


fi.get_estimate("5")


# The sketch may also be serialized for archiving, and reconstructed.

# In[9]:


sk_bytes = fi.serialize()
len(sk_bytes)


# In[11]:


fi2 = frequent_strings_sketch.deserialize(sk_bytes)
print(fi2)


# ### Merging Example

# Frequent Items sketches support `merge()` to combine sketches. Keep in mind that the combined sketches may not have any meaningfully frequent items, even if there were frequent items in one of the input sketches.
# 
# We'll start by creating a sketch with lots of equally-weighted very light items, but with a combined weight several times greater than that of the first sketch, and then merge that into the first sketch.

# In[12]:


fi2 = frequent_strings_sketch(k)
wt = fi.get_total_weight()
for i in range(0,4*wt):
    fi2.update(str(i))
fi.merge(fi2)


# Even though all these new items have weight 1, there are so many of them that we have nothing if we ask for no fasle positives.

# In[13]:


len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES))


# We do, however, see a few potentially heavy items if we request no false negatives.

# In[14]:


len(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_NEGATIVES))