Vybrané partie z dátových štruktúr
2-INF-237, LS 2016/17

Úvod · Pravidlá · Prednášky · Prezentácia · Ako poznámkovať · Moodle
Táto stránka sa týka školského roku 2016/17. V školskom roku 2017/18 predmet vyučuje Jakub Kováč, stránku predmetu je https://people.ksp.sk/~kuko/ds


Úsporné dátové štruktúry: Rozdiel medzi revíziami

Z VPDS
Prejsť na: navigácia, hľadanie
(Rank in compressed string: RRR)
(Rank in compressed string: RRR)
 
22 medziľahlých revízií od jedného používateľa nie je zobrazených.
Riadok 1: Riadok 1:
==Uvod do uspornych datovych struktur==
 
* Retazec dlzky n nad abecedou velkosti sigma mozeme ulozit pomocou n lg sigma bitov (predpokladajme pre jednoduchost, ze sigma je mocnina 2)
 
* napr pre binarny retazec potrebuje n bitov
 
* Sufixove pole na umoznuje rychlo hladat vzorku, ale potrebuje n lg n bitov, pripadne 3 n lg n, co je rychlejsie rastuca funkcia ako n
 
* sufixove stromy maju tiez Theta(n log n) bitov s vacsou konstantou
 
* vedeli by sme rychlo vyhladavat vzorku aj s ovela mensim indexom?
 
* oblast uspornych datovych struktur
 
 
 
==Rank and select==
 
==Rank and select==
===Uvod do rank/select===
+
===Introduction to rank/select===
Chceme reprezentovat staticku mnozinu A subseteq {0..n-1}, nech m = |A|
+
Our goal is to represent a static set <math>A \subseteq \{0\dots n-1\}</math>, let m = |A|
* pole cisel v A v utriedenom poradi (m log n bitov)
+
 
* bitovy vektor dlzky n (n bitov)
+
Two natural representations:
 +
* sorted array of elements of A  
 +
* bit vector of length n  
 
<pre>
 
<pre>
                          Pole       Bitovy vektor              
+
                                Array       Bit vector              
Pamat v bitoch            m log n        n       
+
Memory in bits                  m log n        n       
Je x v A?                 O(log m)      O(1)        rank(x)>rank(x-1)                 
+
Is x in A?                     O(log m)      O(1)        rank(x)>rank(x-1)                 
max. y v A t.z. y<=x     O(log m)      O(n)        select(rank(x))
+
max. y in A s.t. y<=x           O(log m)      O(n)        select(rank(x))
i-ty najmensi prvok z A   O(1)          O(n)        select(i)
+
i-th smallest element of A     O(1)          O(n)        select(i)
kolko je v A prvkov <=x   O(log m)      O(n)        rank(x)
+
how many elements of A are <=x O(log m)      O(n)        rank(x)
 
</pre>
 
</pre>
Ak m je pomerne velke (cca Omega(n/log n)), pamat mensia s vyuzitim bitoveho pola
+
If m relatively large, e.g. <math>m=\Omega(n/\log n)</math>, bit array uses less memory
  
Ak chcem rank v O(1)
+
To get rank in O(1)
* trivialne riesenie: pamatam si rank(x) pre kazde x, n log n bitov
+
* trivial solution: keep rank(x) for each <math>x\in \{0,\dots, n-1\}</math>, n log n bits
* ukazeme si, ze staci o(n) bitov navyse k bitovemu vektoru s n bitmi
+
* we will show that the bit vector of size n and o(n) additional bits are sufficient
* prakticke riesenie:  
+
* a simple practical solution for space reduction:  
** zvolim si hodnotu t
+
** choose value t
** pole R dlzky n/t, R[i] = rank(i*t)  
+
** array R of length n/t, R[i] = rank(i*t)  
** rank(x) = R[floor(x/t)]+pocet jedniciek v A[t*floor(x/t)+1..x]
+
** rank(x) = R[floor(x/t)]+number of 1 bits in A[t*floor(x/t)+1..x]
** pamat n + (n/t)*lg(n) bitov, cas O(t)
+
** memory n + (n/t)*lg(n) bits, time O(t)
* O(1) riesenie zalozene na tejto myslienke, trochu zlozitejsie detaily
+
* O(1) solution based on this idea, slightly more complex details
  
Ak chcem select v O(1)
+
To get select in O(1)
* trivialne riesenie je utriedene pole prvkov (pozri vyssie)
+
* trivial solution is sorted array (see above)
* da sa spravit usporne o(n) bitov navyse k bitovemu vektoru s n bitmi, ale zlozitejsie
+
* can be done in n+o(n) bits, more complex than rank
* prakticky kompromis: binarne vyhladavanie, log n volani rank
+
* practical compromise: binary search using log n calls of rank
  
Rank
+
===Rank===
 
* overview na slajde
 
* overview na slajde
 
* rank = rank of super-block + rank of block within superblock + rank of position within block
 
* rank = rank of super-block + rank of block within superblock + rank of position within block
Riadok 43: Riadok 37:
 
* time O(1), memory n + O(n log log n / log n) bits
 
* time O(1), memory n + O(n log log n / log n) bits
  
Select
+
data structures:
 +
* R1: Array of ranks at superblock boundaries
 +
* R2: Array of ranks at block boundaries within superblocks
 +
* R3: precomputed table of ranks for each type of block and each position within block of size t2
 +
* B: bit array
 +
 
 +
rank(i):
 +
* superblock = i/t1 (integer division)
 +
* block = i/t2
 +
* index = block*t2;
 +
* type = B[index..index+t2-1]
 +
* return R1[superblock]+R2[block]+R3[type, i%t2]
 +
 
 +
===Select===
 
* let t1 = lg n lglg n
 
* let t1 = lg n lglg n
 
* store select(t1 * i) for each i = 0..n/t1 memory O(n / t1 * lg n) = O(n/ lg lg n)
 
* store select(t1 * i) for each i = 0..n/t1 memory O(n / t1 * lg n) = O(n/ lg lg n)
Riadok 57: Riadok 64:
 
** for small blocks of size at most t2^2, store as (t2^2)-bit integer, lookup table of all answers. Size of table is O(2^(t2^2) * t2 * log(t1)); t2^2  < (1/2) log n for large enough n, so 2^(t2^2) is O(sqrt(n)), the rest is polylog
 
** for small blocks of size at most t2^2, store as (t2^2)-bit integer, lookup table of all answers. Size of table is O(2^(t2^2) * t2 * log(t1)); t2^2  < (1/2) log n for large enough n, so 2^(t2^2) is O(sqrt(n)), the rest is polylog
  
==Wavelet tree: Rank nad vacsou abecedou==
+
==Wavelet tree: Rank over larger alphabets==
* nad binarnou abecedou vieme spravit rank_1(i): pocet vyskytov 1 v A[0..i]
+
* look at rank as a string operation over binary string B: count the number 1's in a prefix B[0..i]
* poznamka: ak chceme spocitat rank_0(i), pouzijeme i+1-rank_1(i)
+
* denote it more explicitly as rank_1(i)
 +
* note: we can compute rank_0(i) as i+1-rank_1(i)
  
* original paper: Grossi, Gupta and Vitter 2003 High-order entropy-compressed text indexes
+
* instead of a binary string consider string S over a larger alphabet of size sigma
* survey G. Navarro, Wavelet Trees for All, Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), 2012
+
* rank_a(i): the number of occurrences of character a in a prefix S[0..i] of string S
* http://alexbowe.com/wavelet-trees/
+
* OPT is n lg sigma
 
+
* Use rank on binary strings as follows:
* namiesto bitoveho pola mame retazec S nad vacsou abecedou Sigma
+
* Chceme podporovat rank_a(i), co je pocet vyskytov pismena a v retazci S[0..i]
+
* OPT je n lg sigma
+
* Pouzijeme rank na binarnych retazcoch nasledovne:
+
 
** rozdelime abecedu na dve casti Sigma_0 a Sigma_1  
 
** rozdelime abecedu na dve casti Sigma_0 a Sigma_1  
 
** Vytvorime bin. retazec B dlzky n: B[i]=j iff S[i] je v Sigma_j
 
** Vytvorime bin. retazec B dlzky n: B[i]=j iff S[i] je v Sigma_j
 +
** Predspracujeme B pre rank
 
** Vytvorime retazce S_0 a S_1, kde S_j obsahuje pismena z S, ktore su v Sigma_j, sucet dlzok je n
 
** Vytvorime retazce S_0 a S_1, kde S_j obsahuje pismena z S, ktore su v Sigma_j, sucet dlzok je n
 
** Rekurzivne pokracujeme v deleni abecedy a retazcov, az kym nedostaneme abecedy velkosti 2, kde pouzijeme binarny rank
 
** Rekurzivne pokracujeme v deleni abecedy a retazcov, az kym nedostaneme abecedy velkosti 2, kde pouzijeme binarny rank
Riadok 76: Riadok 81:
 
** ak a in Sigma_j, i_2=rank_j(i) v retazci B
 
** ak a in Sigma_j, i_2=rank_j(i) v retazci B
 
** pokracuj rekurzivne s vypoctom  rank_a(i_2) v lavom alebo pravom podstrome
 
** pokracuj rekurzivne s vypoctom  rank_a(i_2) v lavom alebo pravom podstrome
* Velkost: binarne retazce n log_2 sigma + struktury pre rank nad bin. retazcami (mozem skonkatenovat na kazdej urovni a dostat o(n log_2 sigma)
+
* Velkost: binarne retazce n log_2 sigma + struktury pre rank nad bin. retazcami (mozem skonkatenovat na kazdej urovni a dostat o(n log sigma) + O(sigma lg n) pre strom ako taky
 
* cas O(log sigma), robim jeden rank na kazdej urovni
 
* cas O(log sigma), robim jeden rank na kazdej urovni
 
* extrahovanie S[i] tiez v case O(log sigma)
 
* extrahovanie S[i] tiez v case O(log sigma)
Riadok 83: Riadok 88:
 
** pokracuj rekurzivne s hladanim S_j[i2]
 
** pokracuj rekurzivne s hladanim S_j[i2]
  
===Back to FM index===
+
* original paper: Grossi, Gupta and Vitter 2003 High-order entropy-compressed text indexes
* jednoducha finta na zlepsenie pamate pre rank:
+
* survey G. Navarro, Wavelet Trees for All, Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), 2012
** ulozime si rank len pre niektore i, zvysne dopocitame na zaklade pocitania vyskytov v L (vydelime pamat k, vynasobime cas k)
+
* http://alexbowe.com/wavelet-trees/
* rank + L pomocou wavelet tree v n lg sigma + o(n log sigma) bitoch, hladanie vyskytov sa spomali na O(m log sigma)
+
* ako kompaktnejsie ulozit SA
+
* mozeme ulozit SA len pre niektore riadky, pre ine sa pomocou LF transformacie dostaneme k najblizsiemu riadku pre ktory mame SA
+
** ako presne ulozit "niektore riadky" ked su nerovnomerne rozmiestnene v SA (rovnomerne v T)?
+
* samotne L mozeme ulozit aj komprimovanie s vhodnymi indexami, vid. clanok od F a M alebo ho mame reprezentovane implicitne vo wavelet tree
+
  
 
==Rank in compressed string: RRR==
 
==Rank in compressed string: RRR==
Riadok 118: Riadok 118:
 
* Let S be a string in which a occurs n_a times
 
* Let S be a string in which a occurs n_a times
 
* Its entropy is H(S) = sum_a \frac{n_a}{n} \log \frac{n}{n_a}
 
* Its entropy is H(S) = sum_a \frac{n_a}{n} \log \frac{n}{n_a}
* let x_i be the number of 1s in block i, let n_1 be the overall number of 1s
+
* let x_i be the number of 1s in block i, let n_1 be the overall number of 1s, i.e. <math>n_1 = \sum x_i
  
* <math>\sum_i \lceil \log_2 {t_2 \choose x_{i}}\rceil</math>
+
* <math>\sum_i \lceil \lg {t_2 \choose x_{i}}\rceil</math>
* <math>\le \frac{n}{t_2}+\sum_i \log_2 {t_2 \choose x_{i}}</math>
+
* <math>\le \frac{n}{t_2}+\sum_i \lg {t_2 \choose x_{i}}</math>
 
* <math>= o(n)+\lg \prod_i {t_2 \choose x_{i}}</math>
 
* <math>= o(n)+\lg \prod_i {t_2 \choose x_{i}}</math>
 
* <math>\le o(n)+\lg {n \choose n_1}</math> (choice of x_i in each block is subset of choice of n_1 overall)
 
* <math>\le o(n)+\lg {n \choose n_1}</math> (choice of x_i in each block is subset of choice of n_1 overall)
Riadok 128: Riadok 128:
 
* <math>\lg {n \choose n_1} = [\ln(n!) - \ln(n_1!) - \ln(n_0!))]/\ln(2)</math>
 
* <math>\lg {n \choose n_1} = [\ln(n!) - \ln(n_1!) - \ln(n_0!))]/\ln(2)</math>
 
* <math>= [n\ln(n)-n-n_1\ln(n_1)+n_1-n_0\ln n_0+n_0+O(\log n)]/\ln(2)</math>
 
* <math>= [n\ln(n)-n-n_1\ln(n_1)+n_1-n_0\ln n_0+n_0+O(\log n)]/\ln(2)</math>
* <math>= n_1\log_2 (n/n_1) + n_0\log_2 (n/n_0) + O(\log n)</math> (linear terms cancel out, n\ln(n) divided into two parts: n_0\ln(n) and n_1\ln(n))
+
* <math>= n_1\lg (n/n_1) + n_0\lg (n/n_0) + O(\log n)</math> (linear terms cancel out, n\ln(n) divided into two parts: n_0\ln(n) and n_1\ln(n))
 
* <math>= n H(B)+O(\log n)</math>
 
* <math>= n H(B)+O(\log n)</math>
  
Riadok 139: Riadok 139:
  
 
Instead of wavelet tree, store indicator vector for each character from alphabet in RRR string
 
Instead of wavelet tree, store indicator vector for each character from alphabet in RRR string
* <math>\sum_a [n_a \lg \frac{n}{n_a} + (n-n_a)\lg \frac{n}{n-a} + o(n)]</math>
+
* <math>\sum_a [n_a \lg \frac{n}{n_a} + (n-n_a)\lg \frac{n}{n-n_a} + o(n)]</math>
* the first terms in the sum for n H(T), the last terms sum to <math>o(\sigma n)</math>, middle terms can be bound by O(n)
+
* the first terms in the sum equal to n H(T), the last terms sum to <math>o(\sigma n)</math>, middle terms can be bound by O(n)
* e.g. if <math>n_a = n/\sigma</math>, the sum of the middle terms will be
+
* <math>\lg \frac{n}{n-n_a} = \lg(1+\frac{n_a}{n-n_a}) = \ln(1+\frac{n_a}{n-n_a})/\ln(2) \le (1/\ln(2))\cdot \frac{n_a}{n-n_a}</math>
** <math>\sigma (n-n/\sigma)\lg \frac{n}{n-n/\sigma} \le n \sigma \lg \frac{\sigma}{\sigma-1} = n \lg  \left(\frac{\sigma}{\sigma-1}\right)^\sigma</math>
+
* We have used inequality ln(1+x)<=x for x>=-1
** <math>\left(\frac{\sigma}{\sigma-1}\right)^\sigma = \left(1+\frac{1}{\sigma-1}\right)^{\sigma-1\frac{\sigma}{\sigma-1}}=O(1)</math>
+
* Then <math>\sum_a (n-n_a)\lg \frac{n}{n-n_a}\le (1/\ln(2)) \sum_a (n-n_a)\frac{n_a}{n-n_a} = (1/\ln(2)) \sum_a n_a = n/\ln(2) = O(n)</math>
  
 
==Binarny strom==
 
==Binarny strom==
Riadok 157: Riadok 157:
 
   A  B  C  D  E  .  F  .  .  G  H  .  .  .  .  .  .
 
   A  B  C  D  E  .  F  .  .  G  H  .  .  .  .  .  .
 
</pre>
 
</pre>
Lemma: Pre i-ty vnutorny vrchol (i-ta jednotka vo vektore) su jeho deti na poziciach 2i a 2i+1
+
Lemma: Pre i-ty vnutorny vrchol (i-ta jednotka vo vektore, na pozicii select(i)) su jeho deti na poziciach 2i a 2i+1
 
* indukcia na i
 
* indukcia na i
 
* pre i=1 (koren) ma deti na poziciach 2 a 3
 
* pre i=1 (koren) ma deti na poziciach 2 a 3
Riadok 185: Riadok 185:
 
*'''Iny dokaz''' [http://www.ams.org/journals/bull/2002-39-01/S0273-0979-01-00928-4/]
 
*'''Iny dokaz''' [http://www.ams.org/journals/bull/2002-39-01/S0273-0979-01-00928-4/]
 
** Nech X(n,m,k) je mnozina vsetkych postupnosti dlzky n+m obsahujucich n +1, m -1 a takych ze kazdy prefixovy sucet je aspon k.  
 
** Nech X(n,m,k) je mnozina vsetkych postupnosti dlzky n+m obsahujucich n +1, m -1 a takych ze kazdy prefixovy sucet je aspon k.  
** |X(n,m,-infty)| = n+m nad n
+
** <math>|X(n,m,-\infty)| = {n+m \choose n}</math>
 
** my chceme |X(n,n,0)|
 
** my chceme |X(n,n,0)|
 
** nech Y = X(n,n,-infty)-X(n,n,0), t.j. vsetky zle postupnsti
 
** nech Y = X(n,n,-infty)-X(n,n,0), t.j. vsetky zle postupnsti
Riadok 192: Riadok 192:
 
** naopak vezmime postupnost z X(n-1,n+1,-infty), najdeme prvy bod, kde je zaporny prefixovy sucet (-1), od toho bodu napravo zmenme znamienka, dostavame postupnost z Y
 
** naopak vezmime postupnost z X(n-1,n+1,-infty), najdeme prvy bod, kde je zaporny prefixovy sucet (-1), od toho bodu napravo zmenme znamienka, dostavame postupnost z Y
 
** tieto dve zobrazenia su navzajom inverzne, preto musi ist o bijekciu
 
** tieto dve zobrazenia su navzajom inverzne, preto musi ist o bijekciu
** preto |X(n,n,0)| = |X(n,n,-infty)|-|X(n-1,n+1,-infty)|=(2n nad n) - (2n nad n-1) = (2n nad n)/(n+1)
+
** preto <math>|X(n,n,0)| = |X(n,n,-\infty)|-|X(n-1,n+1,-\infty)|={2n \choose n} - {2n \choose n-1}</math>
 +
** <math>{2n \choose n} - {2n \choose n-1} = \frac{(2n)!}{n!n!}-\frac{(2n)!}{(n-1)!(n+1)!} = \frac{(2n)!(n+1-n)}{n!(n+1)!} = {2n \choose n}/(n+1)</math>
 +
 
 +
==Connection to FM index (to be covered later in the course)==
 +
* Retazec dlzky n nad abecedou velkosti sigma mozeme ulozit pomocou n lg sigma bitov (predpokladajme pre jednoduchost, ze sigma je mocnina 2)
 +
* napr pre binarny retazec potrebuje n bitov
 +
* Sufixove pole na umoznuje rychlo hladat vzorku, ale potrebuje n lg n bitov, pripadne 3 n lg n, co je rychlejsie rastuca funkcia ako n
 +
* sufixove stromy maju tiez Theta(n log n) bitov s vacsou konstantou
 +
* vedeli by sme rychlo vyhladavat vzorku aj s ovela mensim indexom?
 +
* oblast uspornych datovych struktur
 +
 
 +
===Back to FM index===
 +
* jednoducha finta na zlepsenie pamate pre rank:
 +
** ulozime si rank len pre niektore i, zvysne dopocitame na zaklade pocitania vyskytov v L (vydelime pamat k, vynasobime cas k)
 +
* rank + L pomocou wavelet tree v n lg sigma + o(n log sigma) bitoch, hladanie vyskytov sa spomali na O(m log sigma)
 +
* ako kompaktnejsie ulozit SA
 +
* mozeme ulozit SA len pre niektore riadky, pre ine sa pomocou LF transformacie dostaneme k najblizsiemu riadku pre ktory mame SA
 +
** let us say we want to print SA[i], but it is not stored, let denote its unknown value x
 +
** L[i] is T[x-1]
 +
** let j = LF[i], j is the row in which SA[j]=x-1
 +
** if SA[j] stored, return SA[j]+1, otherwise continue in the same way
 +
** ako presne ulozit "niektore riadky" ked su nerovnomerne rozmiestnene v SA (rovnomerne v T)?
 +
* samotne L mozeme ulozit aj komprimovane s vhodnymi indexami, vid. clanok od F a M alebo ho mame reprezentovane implicitne vo wavelet tree

Aktuálna revízia z 21:34, 23. máj 2017

Rank and select

Introduction to rank/select

Our goal is to represent a static set A\subseteq \{0\dots n-1\}, let m = |A|

Two natural representations:

  • sorted array of elements of A
  • bit vector of length n
                                Array        Bit vector             
Memory in bits                  m log n         n       
Is x in A?                      O(log m)       O(1)         rank(x)>rank(x-1)                 
max. y in A s.t. y<=x           O(log m)       O(n)         select(rank(x))
i-th smallest element of A      O(1)           O(n)         select(i)
how many elements of A are <=x  O(log m)       O(n)         rank(x)

If m relatively large, e.g. m=\Omega (n/\log n), bit array uses less memory

To get rank in O(1)

  • trivial solution: keep rank(x) for each x\in \{0,\dots ,n-1\}, n log n bits
  • we will show that the bit vector of size n and o(n) additional bits are sufficient
  • a simple practical solution for space reduction:
    • choose value t
    • array R of length n/t, R[i] = rank(i*t)
    • rank(x) = R[floor(x/t)]+number of 1 bits in A[t*floor(x/t)+1..x]
    • memory n + (n/t)*lg(n) bits, time O(t)
  • O(1) solution based on this idea, slightly more complex details

To get select in O(1)

  • trivial solution is sorted array (see above)
  • can be done in n+o(n) bits, more complex than rank
  • practical compromise: binary search using log n calls of rank

Rank

  • overview na slajde
  • rank = rank of super-block + rank of block within superblock + rank of position within block
  • block and superblock: integer division (or shifts of sizes power of 2)
  • time O(1), memory n + O(n log log n / log n) bits

data structures:

  • R1: Array of ranks at superblock boundaries
  • R2: Array of ranks at block boundaries within superblocks
  • R3: precomputed table of ranks for each type of block and each position within block of size t2
  • B: bit array

rank(i):

  • superblock = i/t1 (integer division)
  • block = i/t2
  • index = block*t2;
  • type = B[index..index+t2-1]
  • return R1[superblock]+R2[block]+R3[type, i%t2]

Select

  • let t1 = lg n lglg n
  • store select(t1 * i) for each i = 0..n/t1 memory O(n / t1 * lg n) = O(n/ lg lg n)
  • divides bit vector into super-blocks of unequal size
  • consider one such block of size r. Distinguish between small super-block of size r<t1^2 and big super-block of size >= t1^2
  • large super-block: store array of indices of 1 bits
    • at most n/(t1 ^2) large super-blocks, each has t1 1 bits, each 1 bit needs lg n bits, overall O(n/lg lg n) bits
  • small super-block:
    • repeat the same with t2 = (lg lg n)^2
    • store select(t2*i) within super-block for each i = 0..n/t2 memory O(n/t2 * lg lg n) = O(n/lg lg n)
    • divides super-block into blocks, split blocks into small (size less than t2^2) and large
    • for large blocks store relative indices of 1 bits: n/(t2 ^2) large blocks, each has t2 1 bits, each 1 bit needs lg t1^2 = O(lg lg n) bits, overall O(n/log log n)
    • for small blocks of size at most t2^2, store as (t2^2)-bit integer, lookup table of all answers. Size of table is O(2^(t2^2) * t2 * log(t1)); t2^2 < (1/2) log n for large enough n, so 2^(t2^2) is O(sqrt(n)), the rest is polylog

Wavelet tree: Rank over larger alphabets

  • look at rank as a string operation over binary string B: count the number 1's in a prefix B[0..i]
  • denote it more explicitly as rank_1(i)
  • note: we can compute rank_0(i) as i+1-rank_1(i)
  • instead of a binary string consider string S over a larger alphabet of size sigma
  • rank_a(i): the number of occurrences of character a in a prefix S[0..i] of string S
  • OPT is n lg sigma
  • Use rank on binary strings as follows:
    • rozdelime abecedu na dve casti Sigma_0 a Sigma_1
    • Vytvorime bin. retazec B dlzky n: B[i]=j iff S[i] je v Sigma_j
    • Predspracujeme B pre rank
    • Vytvorime retazce S_0 a S_1, kde S_j obsahuje pismena z S, ktore su v Sigma_j, sucet dlzok je n
    • Rekurzivne pokracujeme v deleni abecedy a retazcov, az kym nedostaneme abecedy velkosti 2, kde pouzijeme binarny rank
  • rank_a(i) v tejto strukture:
    • ak a in Sigma_j, i_2=rank_j(i) v retazci B
    • pokracuj rekurzivne s vypoctom rank_a(i_2) v lavom alebo pravom podstrome
  • Velkost: binarne retazce n log_2 sigma + struktury pre rank nad bin. retazcami (mozem skonkatenovat na kazdej urovni a dostat o(n log sigma) + O(sigma lg n) pre strom ako taky
  • cas O(log sigma), robim jeden rank na kazdej urovni
  • extrahovanie S[i] tiez v case O(log sigma)
    • samotny retazec S teda nemusim ukladat
    • nech j = B[i], i_2 = rank_j(i) v retazci B
    • pokracuj rekurzivne s hladanim S_j[i2]
  • original paper: Grossi, Gupta and Vitter 2003 High-order entropy-compressed text indexes
  • survey G. Navarro, Wavelet Trees for All, Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), 2012
  • http://alexbowe.com/wavelet-trees/

Rank in compressed string: RRR

  • Raman, Raman, Rao, from their 2002 paper Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
  • http://alexbowe.com/rrr/

class of a block: number of 1s

  • R1: Array of ranks at superblock boundaries
  • R2: Array of ranks at block boundaries within superblocks
  • R3: precomputed table of ranks for each class, each signature within class and each position within block of size t2
  • S1: Array of superblock starts in compressed bit array
  • S2: Array of block offsets in compressed bit array
  • C: class (the number of 1s) in each block
  • L: size of signature for each class (ceil lg (t2 choose class))
  • B: compressed bit array (concatenated signatures pf blocks)

rank(i):

  • superblock = i/t1 (integer division)
  • block = i/t2
  • index = S1[superblock]+S2[block]
  • class = C[block]
  • length = L[class]
  • signature = B[index..index+length-1]
  • return R1[superblock]+R2[block]+R3[class, signature, i%t2]

Analysis of RRR:

  • Let S be a string in which a occurs n_a times
  • Its entropy is H(S) = sum_a \frac{n_a}{n} \log \frac{n}{n_a}
  • let x_i be the number of 1s in block i, let n_1 be the overall number of 1s, i.e. n_{1}=\sum x_{i}*<math>\sum _{i}\lceil \lg {t_{2} \choose x_{{i}}}\rceil
  • \leq {\frac  {n}{t_{2}}}+\sum _{i}\lg {t_{2} \choose x_{{i}}}
  • =o(n)+\lg \prod _{i}{t_{2} \choose x_{{i}}}
  • \leq o(n)+\lg {n \choose n_{1}} (choice of x_i in each block is subset of choice of n_1 overall)


  • \lg {n \choose n_{1}}=[\ln(n!)-\ln(n_{1}!)-\ln(n_{0}!))]/\ln(2)
  • =[n\ln(n)-n-n_{1}\ln(n_{1})+n_{1}-n_{0}\ln n_{0}+n_{0}+O(\log n)]/\ln(2)
  • =n_{1}\lg(n/n_{1})+n_{0}\lg(n/n_{0})+O(\log n) (linear terms cancel out, n\ln(n) divided into two parts: n_0\ln(n) and n_1\ln(n))
  • =nH(B)+O(\log n)

Use in wavelet tree (see Navarro survey of wavelet trees):

  • top level n0 zeroes, n1 ones, level 1: n00, n01, n10, n11
  • top level entropy encoding uses n0 lg(n/n0) + n1 lg(n/n1) bits
  • next level uses n00 lg(n0/n00) + n01 lg(n0/n01) + n10 lg(n1/n10) + n11 lg(n1/n11)
  • add together n00 lg(n/n00) + n01 lg(n/n01) + n10 lg(n/n10) + n11 lg(n/n11)
  • continue like this until you get sum_c n_c lg(n/n_c) which is nH(T)

Instead of wavelet tree, store indicator vector for each character from alphabet in RRR string

  • \sum _{a}[n_{a}\lg {\frac  {n}{n_{a}}}+(n-n_{a})\lg {\frac  {n}{n-n_{a}}}+o(n)]
  • the first terms in the sum equal to n H(T), the last terms sum to o(\sigma n), middle terms can be bound by O(n)
  • \lg {\frac  {n}{n-n_{a}}}=\lg(1+{\frac  {n_{a}}{n-n_{a}}})=\ln(1+{\frac  {n_{a}}{n-n_{a}}})/\ln(2)\leq (1/\ln(2))\cdot {\frac  {n_{a}}{n-n_{a}}}
  • We have used inequality ln(1+x)<=x for x>=-1
  • Then \sum _{a}(n-n_{a})\lg {\frac  {n}{n-n_{a}}}\leq (1/\ln(2))\sum _{a}(n-n_{a}){\frac  {n_{a}}{n-n_{a}}}=(1/\ln(2))\sum _{a}n_{a}=n/\ln(2)=O(n)

Binarny strom

  • pridame pomocne listy tak, aby kazdy povodny vrchol bol vnut vrchol s 2 detmi
  • ak sme mali n vrcholov, teraz mame n vnut. vrcholov a n+1 listov, t.j. 2n+1 vrcholov
  • ocislujeme ich v level-order (prehladavanie do sirky) cislami 1..2n+1
  • vrchol je reprezentovany tymto cislom
  • datova struktura: bitovy vektor dlzky 2n+1, ktory ma na pozicii i 1 prave vtedy ak vrchol i je vnutorny vrchol
  • nad tymto vektorom vybudujeme rank/select struktury v o(n) pridavnej pamati
  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
  1  1  1  1  1  0  1  0  0  1  1  0  0  0  0  0  0
  A  B  C  D  E  .  F  .  .  G  H  .  .  .  .  .  .

Lemma: Pre i-ty vnutorny vrchol (i-ta jednotka vo vektore, na pozicii select(i)) su jeho deti na poziciach 2i a 2i+1

  • indukcia na i
  • pre i=1 (koren) ma deti na poziciach 2 a 3
  • deti i-teho vnutorneho vrcholu su v poradi hned po detoch (i-1)-veho
    • dva pripady podla toho, ci (i-1)-vy a i-ty su na tej istej alebo roznych urvniach, ale v oboch pripadoch medzi nimi len pomocne listy a teda medzi ich detmi nic (obrazok)
    • kedze deti (i-1)-veho su na poziciach podla IH 2i-2 a 2i-1, deti i-teho budu na 2i a 2i+1.

Dosledok:

  • left_child(i) = 2 rank(i)
  • right_child(i) = 2 rank(i)+1
  • parent(i) = select(floor(i/2))

Ak chceme pouzit ako lexikograficky strom nad binarnou abecedou: potrebujeme zistit, ci vnut. vrchol zodpoveda slovu v mnozine

  • jednoduchu fix s n dalsimi bitmi: z cisla vrchola 1..2n+1 zistime rankom cislo vnut vrchola 0..n-1 a pre kazde si pamatame bit, ci je to slovo z mnoziny
  • da sa asi aj lepsie

Catalanove cisla

Ekvivalentne problemy:

  • kolko je binarnych stromov s n vrcholmi?
  • kolko je usporiadanych zakorenenych stromov s n+1 vrcholmi?
  • kolko je dobre uzatvorkovanych vyrazov z n parov zatvoriek?
  • kolko je postupnosti, ktore obsahuju n krat +1 a n krat -1 a pre ktore su vsetky prefixove sucty nezaporne?

Chceme dokazat, ze odpoved je Catalanovo cislo <tex>C_n = {2n\choose n}/(n+1)</tex> (C_0 = 1, C_1=1, C_2 = 2, C_3 = 5,...)

  • Rekurencia <tex>T(n) = \sum_{i=0}^{n-1} T(i)T(n-i-1)</tex> (nech najlavejsia zatvorka obsahuje vo vnutri i parov), riesime metodami kombinatorickej analyzy (generujuce funkcie)
  • Iny dokaz [1]
    • Nech X(n,m,k) je mnozina vsetkych postupnosti dlzky n+m obsahujucich n +1, m -1 a takych ze kazdy prefixovy sucet je aspon k.
    • |X(n,m,-\infty )|={n+m \choose n}
    • my chceme |X(n,n,0)|
    • nech Y = X(n,n,-infty)-X(n,n,0), t.j. vsetky zle postupnsti
    • najdeme bijekciu medzi Y a X(n-1,n+1,-infty)
    • vezmime postupnost z Y, najdime prvy zaporny prefixovy sucet (-1), od toho bodu napravo zmenme znamienka, dostavame postupnost z X(n-1,n+1,-infty) (obrazkovy dokaz so schodmi)
    • naopak vezmime postupnost z X(n-1,n+1,-infty), najdeme prvy bod, kde je zaporny prefixovy sucet (-1), od toho bodu napravo zmenme znamienka, dostavame postupnost z Y
    • tieto dve zobrazenia su navzajom inverzne, preto musi ist o bijekciu
    • preto |X(n,n,0)|=|X(n,n,-\infty )|-|X(n-1,n+1,-\infty )|={2n \choose n}-{2n \choose n-1}
    • {2n \choose n}-{2n \choose n-1}={\frac  {(2n)!}{n!n!}}-{\frac  {(2n)!}{(n-1)!(n+1)!}}={\frac  {(2n)!(n+1-n)}{n!(n+1)!}}={2n \choose n}/(n+1)

Connection to FM index (to be covered later in the course)

  • Retazec dlzky n nad abecedou velkosti sigma mozeme ulozit pomocou n lg sigma bitov (predpokladajme pre jednoduchost, ze sigma je mocnina 2)
  • napr pre binarny retazec potrebuje n bitov
  • Sufixove pole na umoznuje rychlo hladat vzorku, ale potrebuje n lg n bitov, pripadne 3 n lg n, co je rychlejsie rastuca funkcia ako n
  • sufixove stromy maju tiez Theta(n log n) bitov s vacsou konstantou
  • vedeli by sme rychlo vyhladavat vzorku aj s ovela mensim indexom?
  • oblast uspornych datovych struktur

Back to FM index

  • jednoducha finta na zlepsenie pamate pre rank:
    • ulozime si rank len pre niektore i, zvysne dopocitame na zaklade pocitania vyskytov v L (vydelime pamat k, vynasobime cas k)
  • rank + L pomocou wavelet tree v n lg sigma + o(n log sigma) bitoch, hladanie vyskytov sa spomali na O(m log sigma)
  • ako kompaktnejsie ulozit SA
  • mozeme ulozit SA len pre niektore riadky, pre ine sa pomocou LF transformacie dostaneme k najblizsiemu riadku pre ktory mame SA
    • let us say we want to print SA[i], but it is not stored, let denote its unknown value x
    • L[i] is T[x-1]
    • let j = LF[i], j is the row in which SA[j]=x-1
    • if SA[j] stored, return SA[j]+1, otherwise continue in the same way
    • ako presne ulozit "niektore riadky" ked su nerovnomerne rozmiestnene v SA (rovnomerne v T)?
  • samotne L mozeme ulozit aj komprimovane s vhodnymi indexami, vid. clanok od F a M alebo ho mame reprezentovane implicitne vo wavelet tree