Vybrané partie z dátových štruktúr
2-INF-237, LS 2016/17

Úvod · Pravidlá · Prednášky · Prezentácia · Ako poznámkovať · Moodle
Táto stránka sa týka školského roku 2016/17. V školskom roku 2017/18 predmet vyučuje Jakub Kováč, stránku predmetu je https://people.ksp.sk/~kuko/ds


Sufixové stromy a polia

Z VPDS
Prejsť na: navigácia, hľadanie

Predbežná verzia poznámok, nebojte sa upravovať: pripojiť informácie z prezentácie, preložiť do slovenčiny, pridať viac vysvetľujúceho textu a pod. Cast prednasok tu nie je zastupena, ale najdete ich v poznamkach z predmetu Vyhľadávanie v texte - vid linka na spodku stranky.

Intro to suffix trees

  • on slides
    • example of generalized suffix tree for S1=aba$, S2=bba$
  • at the end: uses of suffix trees (assume we can build in O(n))
    • string matching - build tree, then each pattern in O(m+k) - search through a subtree - its size?
    • find longest word that occurs twice in S - node with highest string depth O(n)
    • find longest word that is at least two strings in input set - node with highest string depth that has two different string represented - how to do details O(n)
    • next all maximal repeats

Suffix-trie.png Suffix-comp.png Suffix-suffix.png Suffix-gentrie.png Suffix-gensuffix.png


Maximal pairs

  • Definition, motivation on a slide
  • example: S=xabxabyabz, max pairs xab-xab, ab-ab (2x)
  • Each maximal repeat is an internal node (cannot end in the middle of an edge). Why?
  • Let v is a leaf for suffix S[i..n-1], let l(v)=S[i-1] - left character
  • Def. Node v is diverse (roznorody) if its subtree contains leaves labeled by at least 2 different values of l(v)
  • Theorem. Node v corresponds to a maximal repeat <=> v is diverse
  • Proof: => obvious.
  • <= Let v corresponds to string A
  • v is diverse => xA, yA are substrings in S for some x!=y
  • v is a node => Ap, Aq are substrings in S for some p!=q
  • If xA is always followed by the same character, e.g. p, then Aq must be preceded by some z!=x, we have pairs xAp, zAq
  • Otherwise we have xA is followed by at least 2 different characters, say p and q. If yAp in S, use maximal pairs yAp, xAq, otherwise use maximal pairs xAp, yA?

Summary

  • create suffix tree for #S$
  • compute l(v) for each leaf
  • compute diversity of nodes in a bottom-up fashion
  • print all substrings (their indices) corresponding to diverse nodes

Further extensions: find only repeats which are not parts of other repeats, finding all maximal pairs,...

LCA v sufixovych stromoch

  • def na slajde
  • majme listy zodpovedajuce sufixom S[i..n-1] a S[j..n-1]
  • lca(u,v) = najdlhsi spolocny prefix S[i..n-1] a S[j..n-1]
  • funguje aj v zovseobecnenych sufixovych stromoch, vtedy sufixy mozu byt z roznych retazcov

Aproximate string matching

  • on slides, do running time analysis O(nk)
  • wildcards

Counting documents

Suffix-count.png

  • list of leaves in DFS order
  • each subtree an interval of the list
  • separate to sublists: L_i = list of suffixes of S_i in pre-order
  • compute lca for each two consecutive members of each L_i
  • in each node counter h(v): how many times found as lca
  • compute in each node l(v): number of leaves in subtree, s(v): sum of h(v) in subtree
  • C(v) = l(v)-s(v)

Proof:

  • consider node v and string Si
  • if Si occurs k>0 times in subtree of v, it should be counted once in c(v)
    • counted k-times in l(v), k-1 times in s(v)
  • if Si not in subtree of v, it should be counted 0 times
    • counted 0 times in both l(v) and s(v)


=Suffix arrays

S array1.png

Literatúra