Vybrané partie z dátových štruktúr
2-INF-237, LS 2016/17
Sufixové stromy a polia: Rozdiel medzi revíziami
Z VPDS
(→Algorithm 3) |
(→Algorithm 1) |
||
5 medziľahlých revízií od jedného používateľa nie je zobrazených. | |||
Riadok 53: | Riadok 53: | ||
* list of leaves in DFS order | * list of leaves in DFS order | ||
* each subtree an interval of the list | * each subtree an interval of the list | ||
− | * separate to sublists: L_i = list of suffixes of S_i in | + | * separate to sublists: L_i = list of suffixes of S_i in DFS order |
* compute lca for each two consecutive members of each L_i | * compute lca for each two consecutive members of each L_i | ||
* in each node counter h(v): how many times found as lca | * in each node counter h(v): how many times found as lca | ||
Riadok 74: | Riadok 74: | ||
* see slides | * see slides | ||
− | == | + | ==Searching in suffix arrays== |
===Algorithm 1=== | ===Algorithm 1=== | ||
<pre> | <pre> | ||
− | //find max i such that T[SA[i]..n]< X | + | //find max i such that T[SA[i]..n] < X |
L = 0; R = n; | L = 0; R = n; | ||
while(L < R){ | while(L < R){ | ||
k = (L + R + 1) / 2; | k = (L + R + 1) / 2; | ||
− | if(T[SA[k] | + | h = 0; |
− | + | while(T[SA[k] + h] == X[h]) h++; | |
− | + | if(T[SA[k]+h] < X[h]) L = k; | |
− | else | + | else R = k - 1; |
− | + | ||
− | + | ||
} | } | ||
return L; | return L; | ||
Riadok 95: | Riadok 93: | ||
* O(log n) iterations | * O(log n) iterations | ||
* String comparison O(m) | * String comparison O(m) | ||
− | * Run twice: once for X=P | + | * Run twice: once for X=P¢, once for X=P#, where ¢ < $ < ''a'' < # for all ''a'' in the alphabet |
− | * If the first search | + | * If the first search returns i, the second search j, patter occurrences will be at SA[i+1],...,SA[j] in T |
− | * | + | * Counting occurrences O(m log n), printing their positions O(m log n + k), where k is the number of occurrences |
===Algorithm 2=== | ===Algorithm 2=== | ||
Riadok 104: | Riadok 102: | ||
* XL = lcp(T[SA[L]..n], X) | * XL = lcp(T[SA[L]..n], X) | ||
* XR = lcp(T[SA[R]..n], X) | * XR = lcp(T[SA[R]..n], X) | ||
+ | |||
+ | [[Image:S_array-alg2.png|200px]] | ||
<pre> | <pre> | ||
Riadok 132: | Riadok 132: | ||
* if LCP(L,k) < XL, this imples T[SA[k]..n] >X and therefore set R=k, XR=LCP(L,k) | * if LCP(L,k) < XL, this imples T[SA[k]..n] >X and therefore set R=k, XR=LCP(L,k) | ||
* if LCP(L,k) = XL, compare X and T[SA[k]..n] starting at XL+1 (which is maximum of XL, XR) | * if LCP(L,k) = XL, compare X and T[SA[k]..n] starting at XL+1 (which is maximum of XL, XR) | ||
+ | |||
+ | [[Image:S_array-alg3-a.png|200px]] | ||
+ | [[Image:S_array-alg3-b.png|200px]] | ||
+ | [[Image:S_array-alg3-c.png|230px]] | ||
Running time | Running time | ||
Riadok 145: | Riadok 149: | ||
Which LCP values are needed? | Which LCP values are needed? | ||
− | [[Image:S_array1.png| | + | [[Image:S_array1.png|270px]] |
==Literatúra== | ==Literatúra== | ||
* Dan Gusfield (1997) [http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521585198 Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.] Cambridge University Press. Prezenčne v knižnici so signatúrou I-INF-G-8. | * Dan Gusfield (1997) [http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521585198 Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.] Cambridge University Press. Prezenčne v knižnici so signatúrou I-INF-G-8. | ||
* Poznámky z predmetu Vyhľadávanie v texte [http://compbio.fmph.uniba.sk/vyuka/vvt/poznamky/main-2013-05-20.pdf] (zčasti písané študentami a nedokončené) | * Poznámky z predmetu Vyhľadávanie v texte [http://compbio.fmph.uniba.sk/vyuka/vvt/poznamky/main-2013-05-20.pdf] (zčasti písané študentami a nedokončené) |
Aktuálna revízia z 09:46, 22. marec 2017
Predbežná verzia poznámok, nebojte sa upravovať: pripojiť informácie z prezentácie, pridať viac vysvetľujúceho textu a pod. Cast prednasok tu nie je zastupena, ale najdete ich v poznamkach z predmetu Vyhľadávanie v texte - vid linka na spodku stranky.
Obsah
Intro to suffix trees
- on slides
- example of generalized suffix tree for S1=aba$, S2=bba$
- at the end: uses of suffix trees (assume we can build in O(n))
- string matching - build tree, then each pattern in O(m+k) - search through a subtree - its size?
- find longest word that occurs twice in S - node with highest string depth O(n)
- find longest word that is at least two strings in input set - node with highest string depth that has two different string represented - how to do details O(n)
- next all maximal repeats
Maximal pairs
- Definition, motivation on a slide
- example: S=xabxabyabz, max pairs xab-xab, ab-ab (2x)
- Each maximal repeat is an internal node (cannot end in the middle of an edge). Why?
- Let v is a leaf for suffix S[i..n-1], let l(v)=S[i-1] - left character
- Def. Node v is diverse (roznorody) if its subtree contains leaves labeled by at least 2 different values of l(v)
- Theorem. Node v corresponds to a maximal repeat <=> v is diverse
- Proof: => obvious.
- <= Let v corresponds to string A
- v is diverse => xA, yA are substrings in S for some x!=y
- v is a node => Ap, Aq are substrings in S for some p!=q
- If xA is always followed by the same character, e.g. p, then Aq must be preceded by some z!=x, we have pairs xAp, zAq
- Otherwise we have xA is followed by at least 2 different characters, say p and q. If yAp in S, use maximal pairs yAp, xAq, otherwise use maximal pairs xAp, yA?
Summary
- create suffix tree for #S$
- compute l(v) for each leaf
- compute diversity of nodes in a bottom-up fashion
- print all substrings (their indices) corresponding to diverse nodes
Further extensions: find only repeats which are not parts of other repeats, finding all maximal pairs,...
LCA v sufixovych stromoch
- def na slajde
- majme listy zodpovedajuce sufixom S[i..n-1] a S[j..n-1]
- lca(u,v) = najdlhsi spolocny prefix S[i..n-1] a S[j..n-1]
- funguje aj v zovseobecnenych sufixovych stromoch, vtedy sufixy mozu byt z roznych retazcov
Aproximate string matching
- on slides, do running time analysis O(nk)
- wildcards
Counting documents
- list of leaves in DFS order
- each subtree an interval of the list
- separate to sublists: L_i = list of suffixes of S_i in DFS order
- compute lca for each two consecutive members of each L_i
- in each node counter h(v): how many times found as lca
- compute in each node l(v): number of leaves in subtree, s(v): sum of h(v) in subtree
- C(v) = l(v)-s(v)
Proof:
- consider node v and string Si
- if Si occurs k>0 times in subtree of v, it should be counted once in c(v)
- counted k-times in l(v), k-1 times in s(v)
- if Si not in subtree of v, it should be counted 0 times
- counted 0 times in both l(v) and s(v)
Printing documents
Muthukrishnan, S. "Efficient algorithms for document retrieval problems." SODA 2002. Strany 657-666.
- Preprocess texts
- Query: which documents contain pattern P?
- We can do O(m+k) where k=number of occurrences of P by simple search in suffix tree
- Using RMQ and print+small, we can achieve O(m+p) where p is the number of documents containing P
- see slides
Searching in suffix arrays
Algorithm 1
//find max i such that T[SA[i]..n] < X L = 0; R = n; while(L < R){ k = (L + R + 1) / 2; h = 0; while(T[SA[k] + h] == X[h]) h++; if(T[SA[k]+h] < X[h]) L = k; else R = k - 1; } return L;
- O(log n) iterations
- String comparison O(m)
- Run twice: once for X=P¢, once for X=P#, where ¢ < $ < a < # for all a in the alphabet
- If the first search returns i, the second search j, patter occurrences will be at SA[i+1],...,SA[j] in T
- Counting occurrences O(m log n), printing their positions O(m log n + k), where k is the number of occurrences
Algorithm 2
Keep invariant:
- XL = lcp(T[SA[L]..n], X)
- XR = lcp(T[SA[R]..n], X)
//find max i such that T[SA[i]..n-1]< X L = 0; R = n; XL = lcp(X, T[SA[L]..n]); XR = lcp(X, T[SA[R]..n]); while(R - L > 1){ k = (L + R + 1) / 2; h = min(XL, XR); while(T[SA[k]+h]==X[h]) { h++; } if(T[SA[k]+h]<X[h]){ L = k; XL = h; } else { R = k; XR = h; } } sequential search in SA[L..R];
Faster on some inputs, but still O(m log n) - see exercise
Algorithm 3
- similar to alg 2, but want to start from max(XL, XR), not min(XL, XR)
- recall that LCP(i,j) = lcp(T[SA[i]..n],T[SA[j]..n])
- assume we know LCP(i,j) for any i,j (later)
Assume XL>=XR in a given iteration (XL<XR is solved similarly. Consider three cases:
- if LCP(L,k) > XL, this imples T[SA[k]..n] < X and therefore set L=k, XL stays the same
- if LCP(L,k) < XL, this imples T[SA[k]..n] >X and therefore set R=k, XR=LCP(L,k)
- if LCP(L,k) = XL, compare X and T[SA[k]..n] starting at XL+1 (which is maximum of XL, XR)
Running time
- Overall O(log n) iterations
- each O(1) + character comparisons
- max(XL,XR) does not decrease
- overall O(log n) character comparisons ends with inequality
- character equality increases max(XL, XR) by 1
- max(XL, XR) <=m
- therefore at most m such comparisons with equality
- running time O(m+\log n)
Which LCP values are needed?
Literatúra
- Dan Gusfield (1997) Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press. Prezenčne v knižnici so signatúrou I-INF-G-8.
- Poznámky z predmetu Vyhľadávanie v texte [1] (zčasti písané študentami a nedokončené)