**Vybrané partie z dátových štruktúr2-INF-237, LS 2016/17**

# Dátové štruktúry pre externú pamäť

Z VPDS

## Obsah

## External memory model, I/O model

### Introduction

- big and slow disk, fast memory of a limited size M words
- disk reads/writes in blocks of size B words
- when analyzing algorithms, count how many disk reads/writes (memory transfers) are done

Example: scanning through n elements (e.g. to compute their sum): Theta(ceiling of n/B) memory transfers

### B-trees

- I/O cost of binary search tree woth arbitray node placement?
- parameter T related to block size B (so that each node fits into a constant number of blocks)
- each node v has some number v.n of keys
- if it is an internal node, it has v.n+1 children
- keys in a node are sorted
- each subtree contains only values between two successive keys in the parent

- all leaves are in the same depth
- each node except root has at least T-1 keys and each internal node except root has at least T children
- each node has at most 2T-1 keys and at most 2T children
- height is O(log_T n)

Search:

- O(log_T n) block reads

Insert:

- find a leaf where the new key belongs
- if leaf is full (2T-1 keys), split into two equal parts of size T-1 and insert median to the parent (recursively)
- new element is inserted to one of the new leaves
- we continue splitting ancestors until we find a node which is not full or until we reach the root. If root is full, we create a new root with 2 children (increasing the height of the tree)

- O(log_T n) transfers

### Sorting

external mergesort

- I/O cost of ordinary mergesort implementation?
- create sorted runs of size M using O(n/B) transfers
- repeatedly merge M/B-1 runs into one in 1 pass, this uses O(n/B) disk reads and writes per pass
- read one block from each run, use one block for output
- when output block gets full, write it to disk
- when some input gets exhausted, read next block from the run (if any)

- log_{M/B-1} (n/M) passes

### Summary

- B-trees can do search tree operations (insert, delete, search/predecessor) in Theta(log_{B+1} n) = Theta(log n / log (B+1)) memory transfers
- number of transfers compared to usual search tree running time: factor of 1/log B

- search optimal in comparison model
- proof outline through information theory
- goal of the search is to tell us between which two elements is query located
- there are Theta(n) possible answers, thus Theta(log n) bits of information
- in normal comparison model one comparison of query to an elements of the set gives us at most 1 bit of information
- in I/O model reading a block of B elements and comparing query to them gives us in the best case its position within these B elements, i.e. Theta(log B) bits
- to get Theta(log n) bits of information for the answer, we need Theta(log n/log B) block reads
- this can be transformed to a more formal and precise proof

- sorting O((n/B) log_{M/B} (n/M)) memory transfers
- number of memory transfers compared to usual sorting running time: factor of more than 1/B
- also much better than sorting using B-trees, which would take O(N * log_B n)

## Cache oblivious model

### Introduction

- in the I/O model the algorithm explicitly requests block transfers, knows B, controls memory allocation
- in the cache oblivious model algorithm does not know B or M, memory operates as a cache
- M/B slots, each holding one block from disk
- algorithm requests reading a word from disk
- if the block containing this word in cache, no transfer
- replace one slot with block holding requested item, write original block if needed (1 or 2 transfers)
- which one to replace: classical on-line problem of paging

### Paging

- cache has size k = M/B slots, each holds one page (above called block)
- sequence of page requests, if the requested page not in memory not in memory, bring it in and remove some other page (page fault)
- goal is to minimize the number of page faults

- optimal offline algorithm (knows the whole sequence of page requests)
- At a page fault remove the page that will not be used for the longest time

- example of an on-line algoritm: FIFO:
- at a page fault remove page which is in the memory longest time
- uses at most k times as many page faults as the optimal algorithm (k-competitive)
- no deterministic alg. can be better than k-competitive
- it is conservative: in a segment of requests containing at most k distinct pages it does at most k page faults
- proof by induction on segment length, hypothesis: each element which was inserted within segment is never removed within segment

- every conservative alg. is k-competitive

- Compare a conservative paging alg. (e.g. FIFO) on memory with k blocks to optimum offline alg. on a smaller memory of size h - competitive ratio k/(k-h)
- if h = k/2, we get competitive ratio 2
- divide input sequence into maximal blocks, each containing k distinct elements (first element of the next block is distinct from the k elements of the previous block)
- FIFO uses at most k page faults in each block
- optimum has at most h pages from a block in memory at the beginning - at least k-h pages will cause a fault
- in fact we can prove k/(k-h+1), works even for k=h

### Back to cache-oblivious model

- we analyze algorithms under the assumption that paging algorithm uses off-line optimum
- instead it could use e.g. FIFO in memory 2M and increase the number of transfers by a constant factor
- advantages of cache-oblivious model:
- may adapt to changing M, B
- good for a whole memory hierarchy (several levels of cache, disk, network,...)

- scanning still Theta(ceiling of N/B)
- search trees still Theta(log_{B+1} N)
- sorting Theta(n/B log_{M/B}(N/B)) but requires big enough M i.e. M=Omega(B^{1+epsilon})

## Static cache-oblivious search trees

- first published by Harald Prokop, Master thesis, MIT 1999 [1]
- alternative to a binary search
- exercise: how many transfers for binary search?

- use perfectly balanced binary search tree
- search algorithm as usual (follows path from root down as usual)
- nodes are stored on disk in some order which we can choose
- the goal is to choose it so that for any B, M and any search we use only few blocks
- for example, what would happen if we store nodes in pre-order or level-order?
- what would be a good order for fixed known B?
- e.g. store top lg_B n levels together in one block, then split the rest into subtrees and again store top of each subtree in a block etc.

- we do not know B, we will use van Emde Boas order which does this recursively for several values of "B" (later in the course we will see van Emde Boas trees for integer keys)

### van Emde Boas order

- split tree of height lg n into top and bottom, each of height (1/2) lg n
- top part is a small tree with about sqrt(n) vertices
- bottom part consists of about sqrt(n) small trees, each size about sqrt(n)
- each of these small trees is processed recursively and the results are concatenated
- for example a tree with 4 levels is split into 5 trees with 2 levels, resulting in the following ordering:

1 2 3 4 7 10 13 5 6 8 9 11 12 14 15

- if B = sqrt n, we need only 2 trasfers
- but we will show next that for any B<=M/2 and for any path from the root we need only Theta(log_{B+1} n) block transfers

### Analysis

- k: height of trees in some level of recursion, such that size of these trees is <=B and the next higher level has tree size >B
- (1/2) lg B <= k <= lg B
- careful: of size of these trees is x=2^k, we have sqrt(B) <= x <= B
- length of path lg n, visits lg n / k trees of height k, which is Theta(lg n / lg B)
- each tree of height k occupies at most B cells in memory (assuming each node fits in a word, otherwise everything is multiplied by a constant factor...)
- traversing the tree might need up to 2 memory transfers if there is a block boundary inside interval for the tree (whole tree fits in memory)

## Dynamic cache oblivious trees

- only rough sketch

uses "ordered file maintenance"

- maintain n items in an array of size O(n) with gaps of size O(1)
- updates: delete item, insert item between given two items (similar to linked list)
- update rewrites interval of size at most O(log^2 n) amortized in O(1) scans
- done by keeping appropriate density in a hierarchy of intervals
- we will not cover details

simpler version of data structure

- our keep elements in "ordered file" in a sorted order
- build a full binary tree on top of array (segment tree)
- each node stores maximum in its subtree
- tree stored in vEB order
- when array gets too full, double the size, rebuild everything

search

- search checks max in left child and decides to move left or right
- basically follows a path from root, uses O(log_B n) transfers

update

- search to find the leaf
- update "ordered file", resulting in changes in interval of size L (L=O(log^2 n) amortized)
- then update all ancestors of these values in the tree by postorder traversal

analysis of update

- let he size of interval update in ordered file be L (amortized O(log^2 n))
- again let k be height of vEM subtrees which have size <=B but next higher level has size >B
- size of these trees is at least sqrt(B)
- (1) ordered file needs O(L/B+1) transfers
- (2) consider bottom 2 levels: O(1+L/sqrt(B)) trees of size k need updating (+1 from ceilings)
- each tree stored in <=2 blocks
- postorder traversals revisits parent block between child blocks, but if M>=4B, stays in memory
- overall O(L/sqrt(B)) transfers for bottom two layers
- (3) at the top of 2 levels, at most O(L/sqrt(B)+1) nodes, all their ancestors also need updating (set of nodes X)
- vertices have 2 children in X, some only one child
- The number of vertices with 2 children is O(L/sqrt(B)+1), this is ok, even if each causes separate transfer
- Vertices with one child are arranged in at most 2 root-to-leaf paths, each path requires at most O(log_B n) transfers as in search
- Consider lca of all leaves in L: from root to lca, one a single path, there splits into 2. Each node in the left path has interval with rightmost point inside the updated interval. There are two possibilities: (A) the right subtree is completely inside the updated interval, which means all nodes in the right subtree 2 children inside; the path continues left (B) the right tree is only partially inside the updated interval, then the left subtree is outside and the path continues right

- (1)+(2)+(3): amortized O(log^2 n / sqrt(B) + log_B n) - too big for small values of B

improvement of update

- split sorted data into buckets, each bucket of size between (1/4) lg n and lg n
- each bucket stored sequentially on disk,
- minimum element from each bucket used as a leaf in the above data structure, now contains only O(n/log n) elements
- update/search inside bucket: 1 scan, O(log n/B)
- if a bucket gets too large, split into two buckets (similarly to B-trees)
- if a bucket gets too small, connect with adjacent bucket, maybe split again more evenly
- for example, if adjacent bucket more than (1/2) lg n, split again to get two buckets of size at least (3/8) lg n; for small adjacent bucket get a combined bucket with size at most 3/4 lg n; in either case at least (1/8) lg n from the end of admissible interval of bucket sizes (1/4)lg n ad lg n.

- splitting/connecting propagates as update to the structure for n/log n items
- but this happens at most once very (1/8)lg n updates of this particular bucket, so amortized time for update divided by log n
- however, O(log_B n) term still necessary for search

## Sources

- Prednaska L07 Erika Demaina z MIT: http://courses.csail.mit.edu/6.851/spring12/lectures/
- Kapitola o B-trees z Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. (ja som pouzila 3. vydanie)
- online algoritmy a paging http://courses.csail.mit.edu/6.854/03/scribe/scribe19.ps