Vyhľadávanie kľúčových slov

Predbežná verzia poznámok, nebojte sa upravovať: pripojiť informácie z prezentácie, pridať viac vysvetľujúceho textu a pod.

Obsah

1 Keyword search, inverted index, tries, set intersection

Keyword search, inverted index, tries, set intersection

uvod: slides

Full-text keyword search (Plnotextove vyhladavanie klucovych slov)

definicia problemu: slides
zacat diskusiu o tom, co treba indexovat
Inverted index: for each word list the documents in which it appears.

Ema: 0,2
Emu: 1
ma: 0,1,2
Mama: 1,2
mamu: 0
sa: 2

What data structures do we need:
- Map from words to lists (e.g. sorted array, binary search tree, hash)
- List of documents for each word (array or linked list)

Budeme navrhovat riesenia a analyzovat zlozitost:
- cas predspracovania, cas na jeden dotaz
- Parametre: n pocet roznych slov, N celkovy pocet vyskytov (sucet dlzok zoznamov), m - max. dlzka slova, p - pocet najdenych vyskytov v danom dotaze, sigma - velkost abecedy

Štruktúra	query	preprocessing
Utriedené pole	$O(m\log n+p)$	$O(mN\log N)$
Vyhľ. strom	$O(m\log n+p)$	$O(mN\log n)$
Hašovanie - najh. prípad	$O(mn+p)$	$O(mNn)$
Hašovanie - priem. prípad	$O(m+p)$	$O(mN)$
Lex. strom	$O(m\log \sigma +p)$	$O(mN\log \sigma )$ (neskor)

Tries (lexikograficke stromy)

A different structure for storing a collection of words
Edges from each node labeled with characters of the alphabet (0 to |Sigma| edges)
Each node represents a string obtained by concatenation of characters on the edges
- If this string is in the collection, the node stores pointer to the list of occurrences (or other data)
Small alphabet 0..sigma-1: in each node array of pointers child of length sigma-1, list of occurrences (possibly empty)

Search for a word w of length m in the trie:

node = root;
for(i=0; i<m; i++) {
   node = node->child[w[i]];
   if(! node) return empty_list;
}
return node->list;

Time: O(m)

Adding an occurrence of word w in document d to the trie:

node = root;
for(i=0; i<m; i++) {
   if(! node->child[w[i]]) {
     node->child[w[i]] = new node;
   }
   node = node->child[w[i]];
}
node->list.add(d)

Time: O(m) (assuming list addition in O(1))

Deletion of a word: what needs to be done (when do we delete a vertex), possibly a problem with deletion from list of occurrences

Summary and implementations

several words, total length D, alphabet size sigma, query size m
full array: memory O(D sigma), search O(m), insert all words O(D sigma)
sorted array of variable size memory O(D), search O(m log sigma), add one word O(m log sigma + sigma), all words O(D log sigma + z sigma)
binary tree with letters as keywords: as before, but add only O(m log sigma)
hashing of alphabet: on average get rid of log sigma but many practical issues (resizing...)

Multiple keywords

Given several keywords, find documents that contain all of them (intersection, AND)
Let us consider 2 keywords, one with m documents, other with n, m<n
Assume that for each keyword we have a sorted array of occurrences (sorted e.g. by document ID, pagerank,...)
Solution 1: Similar to merge in mergesort O(m+n) = O(n):

i=0; j=0;
while(i<m && j<n) {
  if(a[i]==b[j]) { print a[i]; i++; j++; }
  else if(a[i]<b[j]) { i++; } 
  else { j++; }
}

What if m<<n?

do m times binary search

for(i=0; i<m; i++) {
   k = binary_search(b, a[i])
   if(a[i]==b[k]) print k;
}

Rewrite a little bit to get an algorithm that generalizes the two (and possibly works faster)

j = 0; // current position in b
for(i=0; i<m; i++) {
    find smallest k>=j s.t. b[k]>=a[i]; (*)
    if(k==n) { break; }
    j=k;
    if(a[i]==b[j]) {
      print a[i]; 
      j++;
    }
}

How to do (*):

linear search k=j; while(k<n && b[k]<a[i]) { k++; } the same algorithm as before
binary search in b[j..n] $O(m\log n)$ , faster for very small m
doubling search/galloping search Bentley, Yao 1976 (general idea), Demaine, Lopez-Ortiz, Munro 2000 (application to intersection, more complex alg.), see also Baeza-Yates, Salinger (nice overview):

step = 1; k = j; k'=k;
while(k<n && b[k]<a[i]) {
   k'=k; k=j+step; step*=2; 
}
binary search in b[k'..k];

If k=j+x, doubling search needs O(log x) steps
Overall running time of the algorithm: doubling search at most m times, each time eliminates some number of elements x_i, time log(x1)+log(x2)+...log(xm) where x1+x2+...+xm<=n For what value of x_i is the sum is maximized? Happens for equally sized xi's, each n/m elements, $O(m\log(n/m))$ time. For constant m logarithmic, for large m linear. Also better than simple merge if different clusters of similar values in each array.
If more than 2 keywords, iterate (possibly starting with smallest lists - why better?), or extend the algorithm

Vyhľadávanie kľúčových slov

Obsah

Keyword search, inverted index, tries, set intersection

Full-text keyword search (Plnotextove vyhladavanie klucovych slov)

Tries (lexikograficke stromy)

Multiple keywords

Navigačné menu

Osobné nástroje

Menné priestory

Varianty

Zobrazení

Operácie

Hľadať

Navigácia

Nástroje