Theta Sketch Examples¶

Basic Sketch Usage¶

In [1]:

from datasketches import theta_sketch, update_theta_sketch, compact_theta_sketch
from datasketches import theta_union, theta_intersection, theta_a_not_b

To start, we'll create a sketch with 1 million points in order to demonstrate basic sketch operations.

In [2]:

n = 1000000
k = 12
sk1 = update_theta_sketch(k)
for i in range(0, n):
    sk1.update(i)
print(sk1)

### Update Theta sketch summary:
   lg nominal size      : 12
   lg current size      : 13
   num retained keys    : 6560
   resize factor        : 8
   sampling probability : 1
   seed hash            : 37836
   ordered?             : false
   theta (fraction)     : 0.00654224
   theta (raw 64-bit)   : 60341508738660257
   estimation mode?     : true
   estimate             : 1.00271e+06
   lower bound 95% conf : 978261
   upper bound 95% conf : 1.02778e+06
### End sketch summary

The summary contains most data fo interest, but we can also query for specific information. And in this case, since we know the exact number of distinct items presented ot the sketch, we can look at the estimate, upper, and lower bounds as a percentage of the exact value.

In [3]:

print("Upper bound (1 std. dev) as % of true value:\t", round(100*sk1.get_upper_bound(1) / n, 4))
print("Sketch estimate as % of true value:\t\t", round(100*sk1.get_estimate() / n, 4))
print("Lower bound (1 std. dev) as % of true value:\t", round(100*sk1.get_lower_bound(1) / n, 4))

Upper bound (1 std. dev) as % of true value:	 101.5208
Sketch estimate as % of true value:		 100.2715
Lower bound (1 std. dev) as % of true value:	 99.0374

We can serialize and reconstruct the sketch. If we compact the sketch prior to serialization, we can still query the rebuilt sketch but cannot update it further.

In [4]:

sk1_bytes = sk1.compact().serialize()
len(sk1_bytes)

Out[4]:

In [5]:

new_sk1 = theta_sketch.deserialize(sk1_bytes)
print("Estimate: \t\t", new_sk1.get_estimate())
print("Estimation mode: \t", new_sk1.is_estimation_mode())

Estimate: 		 1002714.745231455
Estimation mode: 	 True

Sketch Unions¶

Theta Sketch unions make use of a separate union object. The union will accept input sketches with different values of $k$.

For this example, we will create a sketch with distinct values that partially overlap those in sk1.

In [6]:

offset = int(3 * n / 4)
sk2 = update_theta_sketch(k+1)
for i in range(0, n):
    sk2.update(i + offset)
print(sk2)

### Update Theta sketch summary:
   lg nominal size      : 13
   lg current size      : 14
   num retained keys    : 12488
   resize factor        : 8
   sampling probability : 1
   seed hash            : 37836
   ordered?             : false
   theta (fraction)     : 0.0123336
   theta (raw 64-bit)   : 113757656857900725
   estimation mode?     : true
   estimate             : 1.01252e+06
   lower bound 95% conf : 994626
   upper bound 95% conf : 1.03073e+06
### End sketch summary

We can now feed the sketches into the union. As constructed, the exact number of unique values presented to the two sketches is $\frac{7}{4}n$.

In [ ]:

In [7]:

union = theta_union(k)
union.update(sk1)
union.update(sk2)
result = union.get_result()
print("Union estimate as % of true value: ", round(100*result.get_estimate()/(1.75*n), 4))

Union estimate as % of true value:  99.6787

Sketch Intersections¶

Beyond unions, theta sketches also support intersctions through the use of an intersection object. These set intersections can have vastly superior error bounds than the classic inclusion-exclusion rule used with sketches like HLL.

In [8]:

intersection = theta_intersection()
intersection.update(sk1)
intersection.update(sk2)
print("Has result: ", intersection.has_result())
result = intersection.get_result()
print(result)

Has result:  True
### Compact Theta sketch summary:
   num retained keys    : 1668
   seed hash            : 37836
   ordered?             : true
   theta (fraction)     : 0.00654224
   theta (raw 64-bit)   : 60341508738660257
   estimation mode?     : true
   estimate             : 254959
   lower bound 95% conf : 242739
   upper bound 95% conf : 267789
### End sketch summary

In this case, we expect the sets to have an overlap of $\frac{1}{4}n$.

In [9]:

print("Intersection estimate as % of true value: ", round(100*result.get_estimate()/(0.25*n), 4))

Intersection estimate as % of true value:  101.9834

Set Subtraction (A-not-B)¶

Finally, we have the set subtraction operation. Unlike theta_union and theta_intersection, theta_a_not_b always takes as input 2 sketches at a time, namely $a$ and $b$, and directly returns the result as a sketch.

In [10]:

anb = theta_a_not_b()
result = anb.compute(sk1, sk2)
print(result)

### Compact Theta sketch summary:
   num retained keys    : 4892
   seed hash            : 37836
   ordered?             : true
   theta (fraction)     : 0.00654224
   theta (raw 64-bit)   : 60341508738660257
   estimation mode?     : true
   estimate             : 747756
   lower bound 95% conf : 726670
   upper bound 95% conf : 769452
### End sketch summary

By using the same two sketches as before, the expected result here is $\frac{3}{4}n$.

In [11]:

print("A-not-B estimate as % of true value: ", round(100*result.get_estimate()/(0.75*n), 4))

A-not-B estimate as % of true value:  99.7008