Vybrané partie z dátových štruktúr
2-INF-237, LS 2016/17

Úvod · Pravidlá · Prednášky · Prezentácia · Ako poznámkovať · Moodle
Táto stránka sa týka školského roku 2016/17. V školskom roku 2017/18 predmet vyučuje Jakub Kováč, stránku predmetu je https://people.ksp.sk/~kuko/ds


Streaming model: Rozdiel medzi revíziami

Z VPDS
Prejsť na: navigácia, hľadanie
(Count-Min sketch)
(Count-Min sketch)
Riadok 28: Riadok 28:
 
** for a fixed row i: <math>E[A[i,h_i(j)] = F[j]+ \sum_{k\ne j} F[k]/w</math>, because every other element k has probability 1/w to hash to column h_i(j) and thus it cntributes F[k]/w to  
 
** for a fixed row i: <math>E[A[i,h_i(j)] = F[j]+ \sum_{k\ne j} F[k]/w</math>, because every other element k has probability 1/w to hash to column h_i(j) and thus it cntributes F[k]/w to  
 
** <math>E[A[i,h_i(j)]\le F[j]+M\epsilon/e</math>
 
** <math>E[A[i,h_i(j)]\le F[j]+M\epsilon/e</math>
** By Markov inequality <math>\Pr(A[i,h_i(j)]-F[j]\le M\epsilon) \le 1/e</math>
+
** By Markov inequality <math>\Pr(A[i,h_i(j)]-F[j]\ge M\epsilon) \le 1/e</math>
 
** Probability that this happens in every row i is at most <math>e^{-d} \le \delta</math>
 
** Probability that this happens in every row i is at most <math>e^{-d} \le \delta</math>
 
* Memory does not depend on n
 
* Memory does not depend on n
 
* Note: if we insert million elements and then delete all but 4 of them, in the final structure we are very likely to be able to identify the 4 remaining ones as M is only 4
 
* Note: if we insert million elements and then delete all but 4 of them, in the final structure we are very likely to be able to identify the 4 remaining ones as M is only 4

Verzia zo dňa a času 17:31, 17. apríl 2017

Definition

  • Input: stream of n elements
  • Goal: do one pass through the stream (or only a few passes), use a small memory (e.g. O(1) or O(log n)) and answer a specific query about the stream
    • Good for processing very fast streams, such as IP packets passing a router or measurements of a very low-powered device
    • Not enough memory to store the whole stream, each item should be processed fast
  • Strict turnstile model:
    • underlying set {0,...,n-1}
    • virtual vector F of length n initialized to zeroes
    • stream consists of operations (j,x) meaning F[j]+=x
    • At every point we have F[j]>=0 for each j
  • Cash register model: all values x are positive
  • Example: if all values x are +1, we are simply counting frequencies of elements from {0,...,n-1} on input

Count-Min sketch

  • Strict turnstile model
  • CM sketch with parameters epsilon and delta
  • Array of counters A of depth d=\lceil \ln(1/\delta )\rceil and width w=\lceil e/\epsilon \rceil
  • Each row i of A has a hash function h_i from {0,...,n-1} to {0,...,w-1} (assume totally random, but a weaker assumption of pairwise independence sufficient for analysis)
  • Update (j,x) does A[i,h_i(j)]+=x for all hash function i=0,...,n-1
  • Query F[j]=? returns min_i A[i,h_i(j)]
  • Let F be the correct answer, F' the answer returned
  • Clearly F'>=F, because each A[i,k] may have contributions from other elements as well
  • With probability at least 1-delta we have F'[j]\leq F[j]+\epsilon \sum _{i}F[i]
  • Proof:
    • Let M=\sum _{k}F[k]
    • for a fixed row i: E[A[i,h_{i}(j)]=F[j]+\sum _{{k\neq j}}F[k]/w, because every other element k has probability 1/w to hash to column h_i(j) and thus it cntributes F[k]/w to
    • E[A[i,h_{i}(j)]\leq F[j]+M\epsilon /e
    • By Markov inequality \Pr(A[i,h_{i}(j)]-F[j]\geq M\epsilon )\leq 1/e
    • Probability that this happens in every row i is at most e^{{-d}}\leq \delta
  • Memory does not depend on n
  • Note: if we insert million elements and then delete all but 4 of them, in the final structure we are very likely to be able to identify the 4 remaining ones as M is only 4