Vybrané partie z dátových štruktúr
2-INF-237, LS 2016/17
Úsporné dátové štruktúry
Z VPDS
Uvod do uspornych datovych struktur
- Retazec dlzky n nad abecedou velkosti sigma mozeme ulozit pomocou n lg sigma bitov (predpokladajme pre jednoduchost, ze sigma je mocnina 2)
- napr pre binarny retazec potrebuje n bitov
- Sufixove pole na umoznuje rychlo hladat vzorku, ale potrebuje n lg n bitov, pripadne 3 n lg n, co je rychlejsie rastuca funkcia ako n
- sufixove stromy maju tiez Theta(n log n) bitov s vacsou konstantou
- vedeli by sme rychlo vyhladavat vzorku aj s ovela mensim indexom?
- oblast uspornych datovych struktur
Uvod do rank/select
Chceme reprezentovat staticku mnozinu A subseteq {0..n-1}, nech m = |A|
- pole cisel v A v utriedenom poradi (m log n bitov)
- bitovy vektor dlzky n (n bitov)
Pole Bitovy vektor Pamat v bitoch m log n n Je x v A? O(log m) O(1) rank(x)>rank(x-1) max. y v A t.z. y<=x O(log m) O(n) select(rank(x)) i-ty najmensi prvok z A O(1) O(n) select(i) kolko je v A prvkov <=x O(log m) O(n) rank(x)
Ak m je pomerne velke (cca Omega(n/log n)), pamat mensia s vyuzitim bitoveho pola
Ak chcem rank v O(1)
- trivialne riesenie: pamatam si rank(x) pre kazde x, n log n bitov
- ukazeme si, ze staci o(n) bitov navyse k bitovemu vektoru s n bitmi
- prakticke riesenie:
- zvolim si hodnotu t
- pole R dlzky n/t, R[i] = rank(i*t)
- rank(x) = R[floor(x/t)]+pocet jedniciek v A[t*floor(x/t)+1..x]
- pamat n + (n/t)*lg(n) bitov, cas O(t)
- O(1) riesenie zalozene na tejto myslienke, trochu zlozitejsie detaily
Ak chcem select v O(1)
- trivialne riesenie je utriedene pole prvkov (pozri vyssie)
- da sa spravit usporne o(n) bitov navyse k bitovemu vektoru s n bitmi, ale zlozitejsie
- prakticky kompromis: binarne vyhladavanie, log n volani rank
Rank
- overview na slajde
- rank = rank of super-block + rank of block within superblock + rank of position within block
- block and superblock: integer division (or shifts of sizes power of 2)
- time O(1), memory n + O(n log log n / log n) bits
Select
- let t1 = lg n lglg n
- store select(t1 * i) for each i = 0..n/t1 memory O(n / t1 * lg n) = O(n/ lg lg n)
- divides bit vector into super-blocks of unequal size
- consider one such block of size r. Distinguish between small super-block of size r<t1^2 and big super-block of size >= t1^2
- large super-block: store array of indices of 1 bits
- at most n/(t1 ^2) large super-blocks, each has t1 1 bits, each 1 bit needs lg n bits, overall O(n/lg lg n) bits
- small super-block:
- repeat the same with t2 = (lg lg n)^2
- store select(t2*i) within super-block for each i = 0..r/t2 memory O(n/t2 * lg lg n) = O(n/lg lg n)
- divides super-block into blocks, split blocks into small (size less than t2^2) and large
- for large blocks store relative indices of 1 bits: n/(t2 ^2) large blocks, each has t2 1 bits, each 1 bit needs lg t1 = O(lg lg n) bits, overall O(n/log log n)
- for small blocks of size at most t2^2, store as (t2^2)-bit integer, lookup table of all answers. Size of table is O(2^(t2^2) * t2 * log(t1)); t2^2 < (1/2) log n for large enough n, so 2^(t2^2) is O(sqrt(n)), the rest is polylog
Wavelet tree: Rank nad vacsou abecedou
- nad binarnou abecedou vieme spravit rank_1(i): pocet vyskytov 1 v A[0..i]
- poznamka: ak chceme spocitat rank_0(i), pouzijeme i+1-rank_1(i)
- original paper: Grossi, Gupta and Vitter 2003 High-order entropy-compressed text indexes
- survey G. Navarro, Wavelet Trees for All, Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), 2012
- http://alexbowe.com/wavelet-trees/
- namiesto bitoveho pola mame retazec S nad vacsou abecedou Sigma
- Chceme podporovat rank_a(i), co je pocet vyskytov pismena a v retazci S[0..i]
- OPT je n log_2 sigma
- Pouzijeme rank na binarnych retazcoch nasledovne:
- rozdelime abecedu na dve casti Sigma_0 a Sigma_1
- Vytvorime bin. retazec B dlzky n: B[i]=j iff S[i] je v Sigma_j
- Vytvorime retazce S_0 a S_1, kde S_j obsahuje pismena z S, ktore su v Sigma_j, sucet dlzok je n
- Rekurzivne pokracujeme v deleni abecedy a retazcov, az kym nedostaneme abecedy velkosti 2, kde pouzijeme binarny rank
- rank_a(i) v tejto strukture:
- ak a in Sigma_j, i_2=rank_j(i) v retazci B
- pokracuj rekurzivne s vypoctom rank_a(i_2) v lavom alebo pravom podstrome
- Velkost: binarne retazce n log_2 sigma + struktury pre rank nad bin. retazcami (mozem skonkatenovat na kazdej urovni a dostat o(n log_2 sigma)
- cas O(log sigma), robim jeden rank na kazdej urovni
- extrahovanie S[i] tiez v case O(log sigma)