Dátové štruktúry pre externú pamäť: Rozdiel medzi revíziami

Verzia zo dňa a času 22:37, 25. apríl 2015

Zdroje:

Prednaska L07 Erika Demaina z MIT: http://courses.csail.mit.edu/6.851/spring12/lectures/
Kapitola o B-trees z Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. (ja spom pouzila 3. vydanie)
onlie algoritmy a paging http://courses.csail.mit.edu/6.854/03/scribe/scribe19.ps

Osnova:

uvod do externej pamati z prednasky Erika Demaina
- strucne opakovanie B-trees: struktura, hladanie, vkladanie, odhad vysky
- strucne opakovanie mergesortu v externej pamati
uvod do cache-oblivious modelu z prednasky Erika Demaina
- strucnu uvod do online algoritmov, kompetitivnost, paging
staticke vyhldavacie stromy (vEB layout) z prednasky Erika Demaina
nacrt ako zhruba vyzeraju dynamicke stromy (cache oblivious B-trees), nesli sme do vsetkych detailov

Example: scanning through n elements (e.g. to compute their sum): Theta(n/B) memory transfers

I/O cost of binary search tree?
from CLRS 3rd edition
parameter T related to block size B (so that each node fits into a constant number of blocks)
each node v has some number v.n of keys
- if it is an internal node, it has v.n+1 children
- keys in a node are sorted
- each subtree contains only values between two successive keys in the parent
all leaves are in the same depth
each node except root has at least T-1 keys and each internal node except root has at least T children
each node has at most 2T-1 keys and at most 2T children
height is O(log_T n)

Search:

Insert:

find a leaf where the new key belongs
if leaf is full (2T-1 keys), split into two equal parts of size T-1 and insert median to the parent
- new element can be inserted to one of the new leaves
we continue splitting ancestors until we find a node which is not full or until we reach the root. If root is full, we create a new root with 2 children (increasing the height of the tree)
O(log_T n) block reads and writes

mergesort
- I/O cost of ordinary implementation?
- create sorted runs of size M using O(N/B) disk reads and writes
- repeatedly merge M/B-1 runs into 1 - 1 pass uses N/B disk reads and writes
- log_{M/B-1} (N/M) passes

B-trees can do search tree operations (insert, delete, search/predecessor) in Theta(log_{B+1} n) = Theta(log n / log (B+1)) memeory transfers
- number of memory transfers compared to usual search tree running time: factor of 1/log B
search optimal in comparison model
- proof outline through information theory
- goal of the seach is to tell us between which two elements is query located
- there are Theta(n) possible answers, thus Theta(logg n) bits of information
- in normal comparison model one comparison of query to an elements of the set gives us at most 1 bit of information
- in I/O model reading a block of B elements and comparing query to them gives us in the beste case its position within these B ele,ents, i.e. Theta(log B) bits
- to get Theta(log n) bits of infomration for the answer, we need Theta(log n/log B) block reads
- this can be transformed to a more formal and precise proof
sorting O((N/B) log_{M/B} (N/B)) memory trasfers
- number of memory transfers compared to usual sorting running time: factor of more than 1/B

cache has size k = M/B blocks, each holds one
sequence of page requests, if the requested page not in memory not in memory, bring it in and remove some other page (page fault)
- goal is to minimize the number of page faults
optimal offline algorithm (knows the whole sequence of page requests)
- At a page fault remove the page that will not be used for the longest time
example of an on-line algoritm: FIFO:
- at a page fault remove page which is in the memory longest time
- uses at most k times as many page faults as the optimal algorithm (k-competitive)
- no deterministic alg. can be better than k-competitive
- it is conservative: in a segment of requests containing at most k distinct pages it does at most k page faults
- every conservative alg. is k-competitive
Compare a conservative paging alg. (e.g. FIFO) on memory with k blocks to optimum offline alg. on a smaller memory of size h - competitive ratio k/(k-h)
- if h = k/2, we get competitive ratio 2
- divide input sequence into maximal blocks, each containing k distinct elements (first element of the next block is distinct from the k elements of the previous block)
- FIFO uses at most k page faults in each block
- optimum has at most h sets of a block in memory at the beginning - at least k-h pages will cause a fault
- in fact we can prove k/(k-h+1), works even for k=h

@@ Riadok 14: / Riadok 14: @@
 ==External memory model, I/O model==
+===Introduction===
 * big and slow disk, fast memory of a limited size M words
 * disk reads/writes in blocks of size B words