Vybrané partie z dátových štruktúr
2-INF-237, LS 2016/17
Vyhľadávanie kľúčových slov
Z VPDS
Predbežná verzia poznámok, nebojte sa upravovať: pripojiť informácie z prezentácie, pridať viac vysvetľujúceho textu a pod.
Obsah
Keyword search, inverted index, tries, set intersection
- uvod: slides
Full-text keyword search (Plnotextove vyhladavanie klucovych slov)
- definicia problemu: slides
- zacat diskusiu o tom, co treba indexovat
- Inverted index: for each word list the documents in which it appears.
Ema: 0,2 Emu: 1 ma: 0,1,2 Mama: 1,2 mamu: 0 sa: 2
- What data structures do we need:
- Map from words to lists (e.g. sorted array, binary search tree, hash)
- List of documents for each word (array or linked list)
- Budeme navrhovat riesenia a analyzovat zlozitost:
- cas predspracovania, cas na jeden dotaz
- Parametre: n pocet roznych slov, N celkovy pocet vyskytov (sucet dlzok zoznamov), m - max. dlzka slova, p - pocet najdenych vyskytov v danom dotaze, sigma - velkost abecedy
| Štruktúra | query | preprocessing |
|---|---|---|
| Utriedené pole |
|
|
| Vyhľ. strom |
|
|
| Hašovanie - najh. prípad |
|
|
| Hašovanie - priem. prípad |
|
|
| Lex. strom |
|
(neskor)
|
Tries (lexikograficke stromy)
- A different structure for storing a collection of words
- Edges from each node labeled with characters of the alphabet (0 to |Sigma| edges)
- Each node represents a string obtained by concatenation of characters on the edges
- If this string is in the collection, the node stores pointer to the list of occurrences (or other data)
- Small alphabet 0..sigma-1: in each node array of pointers child of length sigma-1, list of occurrences (possibly empty)
- Search for a word w of length m in the trie:
node = root;
for(i=0; i<m; i++) {
node = node->child[w[i]];
if(! node) return empty_list;
}
return node->list;
Time: O(m)
- Adding an occurrence of word w in document d to the trie:
node = root;
for(i=0; i<m; i++) {
if(! node->child[w[i]]) {
node->child[w[i]] = new node;
}
node = node->child[w[i]];
}
node->list.add(d)
Time: O(m) (assuming list addition in O(1))
Deletion of a word: what needs to be done (when do we delete a vertex), possibly a problem with deletion from list of occurrences
Summary and implementations
- several words, total length D, alphabet size sigma, query size m
- full array: memory O(D sigma), search O(m), insert all words O(D sigma)
- sorted array of variable size memory O(D), search O(m log sigma), add one word O(m log sigma + sigma), all words O(D log sigma + z sigma)
- binary tree with letters as keywords: as before, but add only O(m log sigma)
- hashing of alphabet: on average get rid of log sigma but many practical issues (resizing...)
Multiple keywords
- Given several keywords, find documents that contain all of them (intersection, AND)
- Let us consider 2 keywords, one with m documents, other with n, m<n
- Assume that for each keyword we have a sorted array of occurrences (sorted e.g. by document ID, pagerank,...)
- Solution 1: Similar to merge in mergesort O(m+n) = O(n):
i=0; j=0;
while(i<m && j<n) {
if(a[i]==b[j]) { print a[i]; i++; j++; }
else if(a[i]<b[j]) { i++; }
else { j++; }
}
What if m<<n?
- do m times binary search
for(i=0; i<m; i++) {
k = binary_search(b, a[i])
if(a[i]==b[k]) print k;
}
- Rewrite a little bit to get an algorithm that generalizes the two (and possibly works faster)
j = 0; // current position in b
for(i=0; i<m; i++) {
find smallest k>=j s.t. b[k]>=a[i]; (*)
if(k==n) { break; }
j=k;
if(a[i]==b[j]) {
print a[i];
j++;
}
}
How to do (*):
- linear search k=j; while(k<n && b[k]<a[i]) { k++; } the same algorithm as before
- binary search in b[j..n]
, faster for very small m
- doubling search/galloping search Bentley, Yao 1976 (general idea), Demaine, Lopez-Ortiz, Munro 2000 (application to intersection, more complex alg.), see also Baeza-Yates, Salinger (nice overview):
step = 1; k = j; k'=k;
while(k<n && b[k]<a[i]) {
k'=k; k=j+step; step*=2;
}
binary search in b[k'..k];
- If k=j+x, doubling search needs O(log x) steps
- Overall running time of the algorithm: doubling search at most m times, each time eliminates some number of elements x_i, time log(x1)+log(x2)+...log(xm) where x1+x2+...+xm<=n For what value of x_i is the sum is maximized? Happens for equally sized xi's, each n/m elements,
time. For constant m logarithmic, for large m linear. Also better than simple merge if different clusters of similar values in each array.
- If more than 2 keywords, iterate (possibly starting with smallest lists - why better?), or extend the algorithm
(neskor)