Count-distinct problem Contents Formal definition Naive solution HyperLogLog algorithm Streaming...
Statistical algorithms
IP addressesrouterunique visitorsDNARFIDsensor networksstreaming algorithmsstreaming algorithmshashminimum-variance unbiased estimatormaximum likelihoodHyperLogLogFlajoletDaniel M. KaneIPHyperLogLogHyperLogLog
In computer science, the count-distinct problem[1]
(also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements.
This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.
Contents
1 Formal definition
2 Naive solution
3 HyperLogLog algorithm
4 Streaming algorithms
4.1 Min/max sketches
4.2 Bottom-m sketches
5 Weighted count-distinct problem
6 Solving the weighted count-distinct problem
7 See also
8 References
Formal definition
Instance: A stream of elements x1,x2,…,xs{displaystyle x_{1},x_{2},ldots ,x_{s}} with repetitions, and an integer m{displaystyle m}. Let n{displaystyle n} be the number of distinct elements, namely n=|{x1,x2,…,xs}|{displaystyle n=|left{{x_{1},x_{2},ldots ,x_{s}}right}|}, and let these elements be {e1,e2,…,en}{displaystyle left{{e_{1},e_{2},ldots ,e_{n}}right}}.
Objective: Find an estimate n^{displaystyle {widehat {n}}} of n{displaystyle n} using only m{displaystyle m} storage units, where m≪n{displaystyle mll n}.
An example of an instance for the cardinality estimation problem is the stream: a,b,a,c,d,b,d{displaystyle a,b,a,c,d,b,d}. For this instance, n=|{a,b,c,d}|=4{displaystyle n=|left{{a,b,c,d}right}|=4}.
Naive solution
The naive solution to the problem is as follows:
Initialize a counter, c, to zero, c←0{displaystyle cleftarrow 0}.
Initialize an efficient dictionary data structure, D, such as hash table or search tree in which insertion and membership can be performed quickly.
For each element xi{displaystyle x_{i}}, a membership query is issued.
If xi{displaystyle x_{i}} is not a member of D (xi∉D{displaystyle x_{i}notin D})
Add xi{displaystyle x_{i}} to D
Increase c by one, c←c+1{displaystyle cleftarrow c+1}
Otherwise (xi∈D{displaystyle x_{i}in D}) do nothing.
Output n=c{displaystyle n=c}.
As long as the number of distinct elements is not too big, D fits in main memory and an exact answer can be retrieved.
However, this approach does not scale for bounded storage, or if the computation performed for each element xi{displaystyle x_{i}} should be minimized. In such a case, several streaming algorithms have been proposed that use a fixed number of storage units.
HyperLogLog algorithm
Streaming algorithms
To handle the bounded storage constraint, streaming algorithms use a randomization to produce a non-exact estimation of the distinct number of elements, n{displaystyle n}.
State-of-the-art estimators hash every element ej{displaystyle e_{j}} into a low-dimensional data sketch using a hash function, h(ej){displaystyle h(e_{j})}.
The different techniques can be classified according to the data sketches they store.
Min/max sketches
Min/max sketches [2][3] store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. [4] presents max sketch which is the minimum-variance unbiased estimator for the problem. The continuous max sketches estimator [5] is the maximum likelihood estimator. The estimator of choice in practice is the HyperLogLog algorithm.[6]
The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element ej{displaystyle e_{j}} is associated with a uniform RV, h(ej)∼U(0,1){displaystyle h(e_{j})sim U(0,1)}, the expected minimum value of h(e1),h(e2),…,h(en){displaystyle h(e_{1}),h(e_{2}),ldots ,h(e_{n})} is 1/(n+1){displaystyle 1/(n+1)}. The hash function guarantees that h(ej){displaystyle h(e_{j})} is identical for all the appearances of ej{displaystyle e_{j}}. Thus, the existence of duplicates does not affect the value of the extreme order statistics.
There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation by Flajolet et al. [7] describes a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by Daniel M. Kane, Jelani Nelson, and David P. Woodruff.[8]
Bottom-m sketches
Bottom-m sketches
[9]
are a generalization of min sketches, which maintain the m{displaystyle m} minimal values, where m≥1{displaystyle mgeq 1}.
See Cosma et al.[2] for a theoretical overview of count-distinct estimation algorithms, and Metwally
[10]
for a practical overview with comparative simulation results.
Weighted count-distinct problem
In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights.
Formally,
Instance: A stream of weighted elements x1,x2,…,xs{displaystyle x_{1},x_{2},ldots ,x_{s}} with repetitions, and an integer m{displaystyle m}. Let n{displaystyle n} be the number of distinct elements, namely n=|{x1,x2,…,xs}|{displaystyle n=|left{{x_{1},x_{2},ldots ,x_{s}}right}|}, and let these elements be {e1,e2,…,en}{displaystyle left{{e_{1},e_{2},ldots ,e_{n}}right}}. Finally, let wj{displaystyle w_{j}} be the weight of ej{displaystyle e_{j}}.
Objective: Find an estimate w^{displaystyle {widehat {w}}} of w=∑j=1nwj{displaystyle w=sum _{j=1}^{n}w_{j}} using only m{displaystyle m} storage units, where m≪n{displaystyle mll n}.
An example of an instance for the weighted problem is: a(3),b(4),a(3),c(2),d(3),b(4),d(3){displaystyle a(3),b(4),a(3),c(2),d(3),b(4),d(3)}. For this instance, e1=a,e2=b,e3=c,e4=d{displaystyle e_{1}=a,e_{2}=b,e_{3}=c,e_{4}=d}, the weights are w1=3,w2=4,w3=2,w4=3{displaystyle w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3} and ∑wj=12{displaystyle sum {w_{j}}=12}.
As an application example, x1,x2,…,xs{displaystyle x_{1},x_{2},ldots ,x_{s}} could be IP packets received by a server. Each packet belongs to one of n{displaystyle n} IP flows e1,e2,…,en{displaystyle e_{1},e_{2},ldots ,e_{n}}. The weight wj{displaystyle w_{j}} can be the load imposed by flow ej{displaystyle e_{j}} on the server. Thus, ∑j=1nwj{displaystyle sum _{j=1}^{n}{w_{j}}} represents the total load imposed on the server by all the flows to which packets x1,x2,…,xs{displaystyle x_{1},x_{2},ldots ,x_{s}} belong.
Solving the weighted count-distinct problem
Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem
.[11]
For example, the weighted estimator proposed by Cohen et al.[5] can be obtained when the continuous max sketches estimator is extended to solve the weighted problem.
In particular, the HyperLogLog algorithm [6] can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.
See also
- Count–min sketch
- Streaming algorithm
- Maximum likelihood
- Minimum-variance unbiased estimator
References
^ Ullman, Jeff; Rajaraman, Anand; Leskovec, Jure. "Mining data streams" (PDF)..mw-parser-output cite.citation{font-style:inherit}.mw-parser-output .citation q{quotes:"""""""'""'"}.mw-parser-output .citation .cs1-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .citation .cs1-lock-limited a,.mw-parser-output .citation .cs1-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .citation .cs1-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{color:#555}.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration span{border-bottom:1px dotted;cursor:help}.mw-parser-output .cs1-ws-icon a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Wikisource-logo.svg/12px-Wikisource-logo.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output code.cs1-code{color:inherit;background:inherit;border:inherit;padding:inherit}.mw-parser-output .cs1-hidden-error{display:none;font-size:100%}.mw-parser-output .cs1-visible-error{font-size:100%}.mw-parser-output .cs1-maint{display:none;color:#33aa33;margin-left:0.3em}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-format{font-size:95%}.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-left{padding-left:0.2em}.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-right{padding-right:0.2em}
^ ab Cosma, Ioana A.; Clifford, Peter (2011). "A statistical analysis of probabilistic counting algorithms". Scandinavian Journal of Statistics.
^ Giroire, Frederic; Fusy, Eric (2007). 2007 Proceedings of the Fourth Workshop on Analytic Algorithmics and Combinatorics (ANALCO). pp. 223–231. CiteSeerX 10.1.1.214.270. doi:10.1137/1.9781611972979.9. ISBN 978-1-61197-297-9.
^ Chassaing, Philippe; Gerin, Lucas (2006). "Efficient estimation of the cardinality of large data sets". Proceedings of the 4th Colloquium on Mathematics and Computer Science. arXiv:math/0701347. Bibcode:2007math......1347C.
^ ab Cohen, Edith (1997). "Size-estimation framework with applications to transitive closure and reachability". J. Comput. Syst. Sci.
^ ab Flajolet, Philippe; Fusy, Eric; Gandouet, Olivier; Meunier, Frederic (2007). "HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm". Analysis of Algorithms.
^ Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications". J. Comput. Syst. Sci.
^ Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010). "An Optimal Algorithm for the Distinct Elements Problem". Proceedings of the 29th Annual ACM Symposium on Principles of Database Systems (PODS).
^ Cohen, Edith; Kaplan, Haim (2008). "Tighter estimation using bottom k sketches" (PDF). PVLDB.
^ Metwally, Ahmed; Agrawal, Divyakant; Abbadi, Amr El (2008), Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic, Proceedings of the 11th international conference on Extending Database Technology: Advances in Database Technology
^ Cohen, Reuven; Katzir, Liran; Yehezkel, Aviv (2014). "A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation". Information Processing Letters. 115 (2): 336–342. doi:10.1016/j.ipl.2014.10.009.