from datasketches import cpc_sketch, cpc_union
We'll create a sketch with log2(k) = 12
sk = cpc_sketch(12)
Insert ~2 million points. Values are hashed, so using sequential integers is fine for demonstration purposes.
n = 1 << 21
for i in range(0, n):
sk.update(i)
print(sk)
### CPC sketch summary: lgK : 12 seed hash : 93cc C : 38212 flavor : 4 merged : false compressed : false intresting col : 5 HIP estimate : 2.09721e+06 kxp : 11.4725 offset : 6 table : allocated num SV : 135 window : allocated ### End sketch summary
Since we know the exact value of n we can look at the estimate and upper/lower bounds as a % of the true value. We'll look at the bounds at 1 standard deviation. In this case, the true value does lie within the bounds, but since these are probabilistic bounds the true value will sometimes be outside them (especially at 1 standard deviation).
print("Upper bound (1 std. dev) as % of true value: ", round(100*sk.get_upper_bound(1) / n, 4))
Upper bound (1 std. dev) as % of true value: 100.9281
print("Estimate as % of true value: ", round(100*sk.get_estimate() / n, 4))
Estimate as % of true value: 100.0026
print("Lower bound (1 std. dev) as % of true value: ", round(100*sk.get_lower_bound(1) / n, 4))
Lower bound (1 std. dev) as % of true value: 99.0935
Finally, we can serialize and deserialize the sketch, which will give us back the same structure.
sk_bytes = sk.serialize()
len(sk_bytes)
2484
sk2 = cpc_sketch.deserialize(sk_bytes)
print(sk2)
### CPC sketch summary: lgK : 12 seed hash : 93cc C : 38212 flavor : 4 merged : false compressed : false intresting col : 5 HIP estimate : 2.09721e+06 kxp : 11.4725 offset : 6 table : allocated num SV : 135 window : allocated ### End sketch summary
Here, we'll create two sketches with partial overlap in values. For good measure, we'll let k be larger in one sketch. For most applications we'd generally create all new data using the same size sketch, allowing differences to creep in when combining new and historica data.
k = 12
n = 1 << 20
offset = int(3 * n / 4)
sk1 = cpc_sketch(k)
sk2 = cpc_sketch(k + 1)
for i in range(0, n):
sk1.update(i)
sk2.update(i + offset)
Create a union object and add the sketches to that. To demonstrate smoothly handling multiple sketch sizes, we'll use a size of k+1 here.
union = cpc_union(k+1)
union.update(sk1)
union.update(sk2)
Note how log config k has automatically adopted the value of the smaller input sketch.
result = union.get_result()
print(result)
### CPC sketch summary: lgK : 12 seed hash : 93cc C : 37418 flavor : 4 merged : true compressed : false intresting col : 5 HIP estimate : 0 kxp : 4096 offset : 6 table : allocated num SV : 123 window : allocated ### End sketch summary
We can again compare against the exact result, in this case 1.75*n
print("Estimate as % of true value: ", round(100*result.get_estimate() / (7*n/4), 4))
Estimate as % of true value: 99.6646