Sufixové stromy a polia: Rozdiel medzi revíziami

Verzia zo dňa a času 14:44, 25. marec 2014

Obsah

1 Suffix trees and applications
2 Literatúra

Suffix trees and applications

Intro to suffix trees

on slides
- example of generalized suffix tree for S1=aba$, S2=bba$
at the end: uses of suffix trees (assume we can build in O(n))
- string matching - build tree, then each pattern in O(m+k) - search through a subtree - its size?
- find longest word that occurs twice in S - node with highest string depth O(n)
- find longest word that is at least two strings in input set - node with highest string depth that has two different string represented - how to do details O(n)
- next all maximal repeats

Maximal pairs

Definition, motivation on a slide
example: S=xabxabyabz, max pairs xab-xab, ab-ab (2x)
Each maximal repeat is an internal node (cannot end in the middle of an edge). Why?
Let v is a leaf for suffix S[i..n-1], let l(v)=S[i-1] - left character
Def. Node v is diverse (roznorody) if its subtree contains leaves labeled by at least 2 different values of l(v)
Theorem. Node v corresponds to a maximal repeat <=> v is diverse
Proof: => obvious.
<= Let v corresponds to string A
v is diverse => xA, yA are substrings in S for some x!=y
v is a node => Ap, Aq are substrings in S for some p!=q
If xA is always followed by the same character, e.g. p, then Aq must be preceded by some z!=x, we have pairs xAp, zAq
Otherwise we have xA is followed by at least 2 different characters, say p and q. If yAp in S, use maximal pairs yAp, xAq, otherwise use maximal pairs xAp, yA?

Summary

create suffix tree for #S$
compute l(v) for each leaf
compute diversity of nodes in a bottom-up fashion
print all substrings (their indices) corresponding to diverse nodes

Further extensions: find only repeats which are not parts of other repeats, finding all maximal pairs,...

LCA v sufixovych stromoch

def na slajde
majme listy zodpovedajuce sufixom S[i..n-1] a S[j..n-1]
lca(u,v) = najdlhsi spolocny prefix S[i..n-1] a S[j..n-1]
funguje aj v zovseobecnenych sufixovych stromoch, vtedy sufixy mozu byt z roznych retazcov

Aproximate string matching

on slides, do running time analysis O(nk)
wildcards

Counting documents

list of leaves in DFS order
each subtree an interval of the list
separate to sublists: L_i = list of suffixes of S_i in pre-order
compute lca for each two consecutive members of each L_i
in each node counter h(v): how many times found as lca
compute in each node l(v): number of leaves in subtree, s(v): sum of h(v) in subtree
C(v) = l(v)-s(v)

Proof:

consider node v and string Si
if Si occurs k>0 times in subtree of v, it should be counted once in c(v)
- counted k-times in l(v), k-1 times in s(v)
if Si not in subtree of v, it should be counted 0 times
- counted 0 times in both l(v) and s(v)

Literatúra

Dan Gusfield (1997) Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press. Prezenčne v knižnici so signatúrou I-INF-G-8.
Poznámky z predmetu Vyhľadávanie v texte [1] (zčasti písané študentami a nedokončené)

@@ Riadok 6: / Riadok 6: @@
 [[Image:Suffix-gensuffix.png|250px]]
 [[Image:Suffix-count.png|250px]]
+==Suffix trees and applications==
+===Intro to suffix trees===
+* on slides
+** example of generalized suffix tree for S1=aba$, S2=bba$
+* at the end: uses of suffix trees (assume we can build in O(n))
+** string matching - build tree, then each pattern in O(m+k) - search through a subtree - its size?
+** find longest word that occurs twice in S - node with highest string depth O(n)
+** find longest word that is at least two strings in input set - node with highest string depth that has two different string represented - how to do details O(n)
+** next all maximal repeats
+===Maximal pairs===
+* Definition, motivation on a slide
+* example: S=xabxabyabz, max pairs xab-xab, ab-ab (2x)
+* Each maximal repeat is an internal node (cannot end in the middle of an edge). Why?
+* Let v is a leaf for suffix S[i..n-1], let l(v)=S[i-1] - left character
+* Def. Node v is diverse (roznorody) if its subtree contains leaves labeled by at least 2 different values of l(v)
+* Theorem. Node v corresponds to a maximal repeat <=> v is diverse
+* Proof: => obvious.
+* <= Let v corresponds to string A
+* v is diverse => xA, yA are substrings in S for some x!=y
+* v is a node => Ap, Aq are substrings in S for some p!=q
+* If xA is always followed by the same character, e.g. p, then Aq must be preceded by some z!=x, we have pairs xAp, zAq
+* Otherwise we have xA is followed by at least 2 different characters, say p and q. If yAp in S, use maximal pairs yAp, xAq, otherwise use maximal pairs xAp, yA?
+Summary
+* create suffix tree for #S$
+* compute l(v) for each leaf
+* compute diversity of nodes in a bottom-up fashion
+* print all substrings (their indices) corresponding to diverse nodes
+Further extensions: find only repeats which are not parts of other repeats, finding all maximal pairs,...
+===LCA v sufixovych stromoch===
+* def na slajde
+* majme listy zodpovedajuce sufixom S[i..n-1] a S[j..n-1]
+* lca(u,v) = najdlhsi spolocny prefix S[i..n-1] a S[j..n-1]
+* funguje aj v zovseobecnenych sufixovych stromoch, vtedy sufixy mozu byt z roznych retazcov
+===Aproximate string matching===
+* on slides, do running time analysis O(nk)
+* wildcards
+===Counting documents===
+* list of leaves in DFS order
+* each subtree an interval of the list
+* separate to sublists: L_i = list of suffixes of S_i in pre-order
+* compute lca for each two consecutive members of each L_i
+* in each node counter h(v): how many times found as lca
+* compute in each node l(v): number of leaves in subtree, s(v): sum of h(v) in subtree
+* C(v) = l(v)-s(v)
+Proof:
+* consider node v and string Si
+* if Si occurs k>0 times in subtree of v, it should be counted once in c(v)
+** counted k-times in l(v), k-1 times in s(v)
+* if Si not in subtree of v, it should be counted 0 times
+** counted 0 times in both l(v) and s(v)
 ==Literatúra==
 * Dan Gusfield (1997) [http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521585198 Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.] Cambridge University Press. Prezenčne v knižnici so signatúrou I-INF-G-8.
 * Poznámky z predmetu Vyhľadávanie v texte [http://compbio.fmph.uniba.sk/vyuka/vvt/poznamky/main-2013-05-20.pdf] (zčasti písané študentami a nedokončené)

Sufixové stromy a polia: Rozdiel medzi revíziami

Verzia zo dňa a času 14:44, 25. marec 2014

Obsah

Suffix trees and applications

Intro to suffix trees

Maximal pairs

LCA v sufixovych stromoch

Aproximate string matching

Counting documents

Literatúra

Navigačné menu

Osobné nástroje

Menné priestory

Varianty

Zobrazení

Operácie

Hľadať

Navigácia

Nástroje