axiomhq / hyperloglog

вторник, 27 июня 2017 г. в 03:12:52

HyperLogLog with lots of sugar (Sparse, LogLog-Beta bias correction and TailCut space reduction)

HyperLogLog

An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 20-50% less space than other usual HyperLogLog implementations.

This work is based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen".

Implementation

The core difference to other implementations are:

use metro hash instead of xxhash
sparse representation for lower cadinalities (like HyperLogLog++)
loglog-beta for dynamic bias correction medium and high cardinalities.
4-bit register instead of 5 (HLL) and 6 (HLL++), but most implementations use 1 byte registers out of convinience

In general it borrows a lot from the InfluxData's fork of Clark Duvall HyperLogLog++ implementation, but uses 50% less space.

Results

A direct comparsion with the HyperLogLog++ implementation used by InfluxDB, yielded the following results.

Exact	Axiom	Influx
10	10 (0.0% off)	10 (0.0% off)
50	50 (0.0% off)	50 (0.0% off)
250	250 (0.0% off)	250 (0.0% off)
1250	1249 (0.08% off)	1249 (0.08% off)
6250	6250 (0.0% off)	6250 (0.0% off)
31250	31008 (0.7744% off)	31565 (1.0080% off)
156250	156013 (0.1517% off)	156652 (0.2573% off)
781250	782364 (0.1426% off)	775988 (0.6735% off)
3906250	3869332 (0.9451% off)	3889909 (0.4183% off)
10000000	9952682 (0.4732% off)	9889556 (1.1044% off)

Note

A big thank you to Prof. Shigang Chen and his team at the University of Florida who are actively conducting research around "Big Network Data".