axiomhq / hyperloglog
- вторник, 27 июня 2017 г. в 03:12:52
HyperLogLog with lots of sugar (Sparse, LogLog-Beta bias correction and TailCut space reduction)
An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 20-50% less space than other usual HyperLogLog implementations.
This work is based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen".
The core difference to other implementations are:
In general it borrows a lot from the InfluxData's fork of Clark Duvall HyperLogLog++ implementation, but uses 50% less space.
A direct comparsion with the HyperLogLog++ implementation used by InfluxDB, yielded the following results.
Exact | Axiom | Influx |
---|---|---|
10 | 10 (0.0% off) | 10 (0.0% off) |
50 | 50 (0.0% off) | 50 (0.0% off) |
250 | 250 (0.0% off) | 250 (0.0% off) |
1250 | 1249 (0.08% off) | 1249 (0.08% off) |
6250 | 6250 (0.0% off) | 6250 (0.0% off) |
31250 | 31008 (0.7744% off) | 31565 (1.0080% off) |
156250 | 156013 (0.1517% off) | 156652 (0.2573% off) |
781250 | 782364 (0.1426% off) | 775988 (0.6735% off) |
3906250 | 3869332 (0.9451% off) | 3889909 (0.4183% off) |
10000000 | 9952682 (0.4732% off) | 9889556 (1.1044% off) |
A big thank you to Prof. Shigang Chen and his team at the University of Florida who are actively conducting research around "Big Network Data".