Vybrané partie z dátových štruktúr
2-INF-237, LS 2016/17

Úvod · Pravidlá · Prednášky · Prezentácia · Ako poznámkovať · Moodle
Táto stránka sa týka školského roku 2016/17. V školskom roku 2017/18 predmet vyučuje Jakub Kováč, stránku predmetu je https://people.ksp.sk/~kuko/ds


Hešovanie: Rozdiel medzi revíziami

Z VPDS
Prejsť na: navigácia, hľadanie
(Bloom filters (1970))
Riadok 18: Riadok 18:
 
** if yes, claim x is in the set, but possibility of error
 
** if yes, claim x is in the set, but possibility of error
 
** if no, answer no, surely true
 
** if no, answer no, surely true
* if hash functions totally random and independent (impractical) and  b = lg(e)kn bits, error is at most 2^{−k}, independent of the size of the universe
+
 
** lg(e) is approx 1.44. for 1% error, k=7, about 10 bits per element
+
Lemma: if hash functions totally random and independent (impractical) and  b = lg(e)kn bits, error is at most 2^{−k}.
 +
* proof later
 +
* error independent of the size of the universe U
 +
* lg(e) is approx 1.44. for 1% error, k=7, about 10 bits per element
 
* compare to a hash table, which needs at lead n lg u bits to store data items themselves
 
* compare to a hash table, which needs at lead n lg u bits to store data items themselves
* uses e.g. an approximate index of a larger data structure on a disk - if x not in the filter, do not bother looking at the disk, but small number of false positives not a problem
+
 +
Use of Bloom filters
 +
* e.g. an approximate index of a larger data structure on a disk - if x not in the filter, do not bother looking at the disk, but small number of false positives not a problem
 +
 
 +
Proof of lemma
 +
 
 +
References
 +
* Bloom, Burton H. (1970), [http://dx.doi.org/10.1145%2F362686.362692 "Space/Time Trade-offs in Hash Coding with Allowable Errors"], Communications of the ACM 13 (7): 422–426
 +
* Bloom filter. In Wikipedia, The Free Encyclopedia. Retrieved January 17, 2015, from http://en.wikipedia.org/w/index.php?title=Bloom_filter&oldid=641921171
  
 
===Locality Sensitive Hashing===
 
===Locality Sensitive Hashing===

Verzia zo dňa a času 10:16, 17. január 2015

Zdroje

Osnova

  • vlastnosti hešovacích funkcií: totally random, universal, k-wise independent, simple tabulation;
  • chaining, perfect hashing, linear probing,
  • Bloom filters, locality sensitive hashing

V sylabe na skúšku je len perfect hashing, rozoberané na Erikovej prednáške

Bloom filters (1970)

  • a bit string of length b, and k hash functions hi : U -> {1, . . . , b}.
  • insert(x): set bits h1 (x), . . . , hk (x) to 1.
  • test if x in the set: compute h1 (y), . . . , hk (y) and check whether all these bits are 1
    • if yes, claim x is in the set, but possibility of error
    • if no, answer no, surely true

Lemma: if hash functions totally random and independent (impractical) and b = lg(e)kn bits, error is at most 2^{−k}.

  • proof later
  • error independent of the size of the universe U
  • lg(e) is approx 1.44. for 1% error, k=7, about 10 bits per element
  • compare to a hash table, which needs at lead n lg u bits to store data items themselves

Use of Bloom filters

  • e.g. an approximate index of a larger data structure on a disk - if x not in the filter, do not bother looking at the disk, but small number of false positives not a problem

Proof of lemma

References

Locality Sensitive Hashing

  • Har-Peled, Sariel, Piotr Indyk, and Rajeev Motwani. "Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality." Theory of Computing 8.1 (2012): 321-350.
    • Proprocess: A set P of n binary sequences of length k, radius r, gap c
    • Query: binary sequence q of length q
    • Output: If there exist a sequence within Hamming distance r from q, find a sequence within r*c of q with probability at least f.
    • Hash each point in P using L hashing functions, each using O(log n) randomly chosen sequence positions (hidden constant depends on d, r, c), put all results to bins in one table
    • Given q, find it using all L hash function, check every collision, report if distance at most cr. If checked more than 3L collisions, stop.
    • L = O(n^{1/c})
    • Space O(nL), query time O(Ld+L(d/r)log n)
    • Probability of failure less than some constant < 1, can be boosted by repeating independently