**Vybrané partie z dátových štruktúr2-INF-237, LS 2016/17**

# Amortizované vyhľadávacie stromy a link-cut stromy

**Predbežná verzia poznámok, nebojte sa upravovať: pripojiť informácie z prezentácie, pridať viac vysvetľujúceho textu a pod.**

## Scapegoat trees

### Introduction

- Lazy amortized binary search trees
- Relatively simple to implement
- Do not require balancing information stored in nodes (needs only left child, right child and key)
- Insert and delete O(log n) amortized
- Search O(log n) worst-case
- Height of tree always O(log n)
- For now let us consider only insert and search
- Store the total number of nodes
*n* - Invariant: keep the height of the tree at most
- Note: 3/2 can be changed to 1/alpha for alpha in (1/2,1)

- Let D(v) denotes the size (the number of nodes) of subtree rooted at v

### Insert

- Start with a usual insert, as in unbalanced BST
- If a node inserted at higher depth, restructure as follows
- Follow the path back from inserted node x to the root
- Find the first node u on this path and its parent p for which D(u)/D(p) > 2/3
- This means that node p is very unbalanced - one of its children containing more than 2/3 of the total subtree size
- Node p will be called the
*scapegoat* - We will show later that a scapegoat always exists on the path to root (Lemma 1)

- Completely rebuild subtree rooted at p so that it contains the same values but is completely balanced
- This is done by traversing entire subtree in inorder, storing keys in an array (sorted)
- Root of the subtree will be the median of the array, left and right subtrees constructed recursively
- Running time O(D(p))
- This can be also done using linked lists built from right pointers in the node, with only O(log n) auxiliary memory for stack
- The height was before violated by 1, this improves it at least by 1, so the invariant is satisfied
- Why we always get improvement by at least 1:
- Let v be a sibling of u, if no sibling set D(v)=0
- D(u) > 2D(p)/3 > D(p)/3 >= D(p)-D(u), do D(p)-D(u)<=D(u)-1
- D(v) = D(p)-D(u)-1 <= D(u)-2
- so the sizes of left and right subtree of p differ by at least 2
- Let height of D(u) be h. The last level contains only the last inserted node, so D(u)-1 nodes can fit to h-1 levels
- In the rebuilt tree, the size of left and right subtree will differ by at most 1 and thus both will have at most D(u)-1 nodes and so they can fit into h-1 levels; the hight will thus decrease by at least 1

- Finding the scapegoat also takes O(D(p)) because we need to compute D(v) at every visited node
- Each computation in time O(D(v))
- But up to node u, child has always size at most 2/3 of parent
- Therefore sizes of subtrees for which we compute D(v) grow exponentially (geometric sum) and are dominated by the last 2 computations of D(u) and D(p)
- TODO: formulate lemma to imply both this and Lemma 1

Lemma 1: If a node x in a tree with n nodes is in depth greater than , then on the path from x to the root there is a node u and it parent p such that D(u)/D(p) > 2/3

Proof:

- Assume that this is not the case, let nodes nodes on the path from x to root be v_k, v_{k-1},...v_0, where v_k=x and v_0 is the root and
- D(v_0) = n.
- Since D(v_i)<=2/3 D(v_{i-1}), we have D(v_i) <= (2/3)^i n.
- Thus we have D(x)<1, which is a contradiction, because the subtree rooted at
*x*contains at least one node

### Amortized analysis of insert

- Potential of a node v with children x,y:
- if |D(x)-D(y)| <= 1 (sizes of subtrees approximately the same)

- Potential of tree: sum of potentials of subtrees
- Insert in depth d except rebuilding:
- real cost O(d)
- change of Phi at most 3d, because potential increases by at most 3 for nodes on the path to newly inserted vertex and does not change elsewhere
- amortized cost O(d) = O(log n)

- Rebuilding:
- real cost m = D(p) where p is the scapegoat
- let u and v be children of p before rebuilt, D(u)>D(v)
- m = D(p)=D(u)+D(v)+1, D(u)>2m/3, D(v)<m/3, D(u)-D(v)>1/3 m
- before rebuilt Phi(p) > m-3
- potential in all nodes in the rebuilt subtree is 0
- change of Phi <= -m +3
- amortized cost O(1)

### Delete

- Very lazy version
- Do not delete nodes, only mark them as deleted
- Keep counter of deleted nodes
- When more than half nodes marked as deleted, completely rebuild the whole tree without them
- Each lazy delete pays a constant towards future rebuild. At least n/2 deleted before next rebuild.
- Variant: delete nodes as in unbalanced binary tree. This never increases height. Rebuild when many nodes deleted.

### Uses of scapegoat trees

Scapegoat trees are useful if nodes of tree hold auxiliary information which is hard to rebuild after rotation

- we will see an example in range trees used in geometry
- another simple example from Morin's book Open Data Structures:
- we want to maintain a sequence of elements (conceptually something like a linked list)
- insert gets a pointer to one node of the list and inserts a new node before it, returns pointer to the new node
- compare gets pointers to two nodes from the same list and decided which is earlier in the list
- idea: store in a scapegoat tree, key is the position in the list. Each node holds the path from the root as a binary number stored in an int (height is logarithmic, so it should fit)
- Exercise: details?

### Sources

- https://en.wikipedia.org/wiki/Scapegoat_tree
- https://people.ksp.sk/~kuko/gnarley-trees/?page_id=55
- I. Galperin, R.L.Rivest. Scapegoat trees. SODA 1993 [1]
- Similar idea also A.Andersson. Improving partial rebuilding by using simple balance criteria. WADS 1989. [2]

- Pat Morin. Open Data Structures, chapter 8 [3]

## Splay trees

- intro on slides

### Intuition and examples

- if height of tree is large, search is expensive, we use potential
- potential of path-like trees should be large, balanced trees smaller
- what happens if we apply splay on a leaf on a path?
- side benefit: if we search for some nodes more often, they tend to be near the root

### Amortized analysis of splaying

- real cost: number of rotations
- D(x): number of descendants of node x, including x (size of a node)
- r(x): lg (D(x)) rank of a node (lg denotes log_2)
- Exercise: asymptotically estimate potential of a path and balanced tree with n nodes

Lemma1 (proof later): consider one step of splaying x (1 or 2 rotations)

- let r(x) be rank of x before splaying, r'(x) after splaying
- amortized cost of one step of splaying is at most
- 3(r'(x)-r(x)) for zig-zag and zig-zig
- 3(r'(x)-r(x))+1 for zig

Lemma2: Amortized cost of splaying x to the root in a tree with n nodes is O(log n)

- Let r_i be rank of x after i steps of splaying, let k be the number of steps
- amortized cost is at most
- (use Lemma 1, +1 is from the last step of splaying)

- telescopic sum, terms cancel, we get

Theorem: Amortized cost of insert, search and delete in splay tree is O(log n)

- search:
- walk down to x, then splay x (if x not found, splay the last visited node)
- amortized cost is O(log n)

- insert:
- insert x as in normal binary search tree, then splay x.
- Inserting increase rank of all nodes on the path P from root t to x
- consider node v and its parent p on this path P
- D'(v) = 1 + D(v) <= D(p)
- r'(v) <= r(p)
- change in potential
- amortized cost of insert + splaying is thus O(log n)

- delete:
- delete x as from binary search tree (may in fact delete node for predecessor or successor of x), then splay parent of deleted node
- deletion itself decreases potential, amortized cost of splaying is O(log n)
- amortized cost is O(log n)

- overall time in each operation proportional to the length of the path traversed and therefore to the number of rotations done in splaying
- therefore amortized time O(log n) for each operation.

Proof of Lemma 1:

- case zig: cost 1, r'(x) = r(y), all nodes except x and y keep their ranks
- amortized cost

- case zig-zig: cost 2,
- amortized cost
- we have r'(x) = r(z), r'(y)< r'(x), r(y)> r(x)
- therefore
- therefore we need (r(x)-r'(x))+(r'(z)-r'(x))<=-2 or lg (D(x)/D'(x)) + lg (D'(z)/D'(x)) <= -2
- observe that D(x)+D'(z) <= D'(x) (in fact D'(x)=D(x)+D(z)+1 where +1 corresponds to node y)
- let a = D(x)/D'(x), b = D'(z)/D'(x), we have a+b<=1
- binary logarithm is concave, i.e. (lg(a)+lg(b))/2 <= lg((a+b)/2)
- and if a+b<=1, lg((a+b)/2) <= lg(1/2) = -1
- lg (D(x)/D'(x)) + lg (D'(z)/D'(x)) = lg a + lg b <= -2

- case zig-zag is analogous
- amortized cost
- r'(x) = r(z), r(x)<=r(y)
- therefore am. cost at most
- we will prove that 2 <= 2r'(x)-r'(y)-r'(z) and this will imply amortized cost is at most 2(r'(x)-r(x))
- we have D'(y)+D'(z) <= D'(x), use the same logarithmic trick as before

### Adaptability of splay trees

If some element is accessed frequently, it tends to be located near the root, making access time shorter

**Weighted potential function**

- Assign each node x fixed weight w(x)>0
- D(x): the sum of weights in the subtree rooted at x
- r(x) = lg(D(x)) (rank of node x)

**Weighted version of Lemma 2:**

- Amortized cost of splaying x to the root is 1+3(r(t)-r(x)) = O(1+log (D(t)/D(x))), where t is the original root before splaying.
- Proof similar to Lemma 2
- By setting weight 1 for every node we get Lemma 2 as a corollary

**Static optimality theorem:**

- Starting with a tree with n nodes, execute a sequence of m find operations, where find(x) is done q(x)>=1 times.
- The total access time is

- where H is the entropy of the sequence of operations.

Proof:

- Set w(x)=q(x)/m. In weighted Lemma 2, w(T)<=1, D(x)>=w(x)=q(x)/m.
- Amortized cost of splaying x is thus O(1+log(m/q(x))); this is done q(x) times.
- After summing this expression for all operations we get the stated bound.
- However, we have to add a decrease of potential (using up money from the bank)
- with these weights, D(x)<=1, r(x)<=0 and thus Phi(x)<=0
- on the other hand, D(x)>w(x),
- Since q(x)>=1 and log(m/q(x))>=0, this is within stated upper bound

Note:

- Lower bound for a static tree is (see Melhorn 1975); our tree not static

**Sequential access theorem:**

- Starting from any tree with n nodes, splaying each node to the root once in the increasing order of the keys has total time O(n).
- Proved by Tarjan 1985, improved proof Elmasry 2004 (proof not covered in the course)
- Trivial upper bound from our amortized analysis is O(n log n)

### Collection of splay trees

Image now a collection of splay trees. The following operations can be done in O(log n) amortized time (each operation gets pointer to a node)

- findMin(v): find the minimum element in a tree containing v and make it root of that tree
- splay v, follow left pointers to min m, splay m, return m

- join(v1, v2): all elements in tree of v1 must be smaller than all elements in tree of v2. Join these two trees into one
- m = findMin(v2), splay(v1), connect v1 as a left child of m, return m
- connecting increases rank of m by at most log n

- splitAfter(x) - split tree containing v into 2 trees, one containing keys <=x, one containing keys > x, return root of the second tree
- splay(v), then cut away its right child and return it
- decreases rank of v

- splitBefore(v) - split tree containing v into 2 trees, one containing keys <x, one containing keys >=x, return root of the first tree
- analogous, cut away left child

### Sources

- Sleator, Daniel Dominic, and Robert Endre Tarjan. "Self-adjusting binary search trees." Journal of the ACM 32, no. 3 (1985): 652-686.
- Original publication of splay trees.

- Elmasry A. On the sequential access theorem and deque conjecture for splay trees. Theoretical Computer Science. 2004 Apr 10;314(3):459-66.
- Proof of an improved version of the sequential access theorem

- Mehlhorn K. Nearly optimal binary search trees. Acta Informatica. 1975 Dec 1;5(4):287-95.
- Proof of the lower bound on static optma binary search trees (theorem 2, page 292)

## Link-cut trees

### Union/find

Maintains a collection of disjoint sets, supports operations

- union(v, w): connects sets containing v and w
- find(v): returns representative element of the set containing v (can be used to test of v and w are in the same set)

Can be used to maintain connected components if we add edges to the graph, also useful in Kruskal's algorithm for minimum spanning tree

Implementation

- each set represented by a (non-binary) tree, in which each node v has a pointer to its parent v.p
- find(v) follows pointers to the root of the tree, returns the root
- union calls find for v and w and joins one with the other
- if we keep track of tree height and always join shorter tree below higher tree plus do path compression, we get amortized time O(alpha(m+n,n)) where alpha is inverse ackermann function, extremely slowly growing (n = number of elements, m = number of queries)

Exercise: implement as a collection of splay trees

- hint: want to use join for union, but what about the keys?
- splay trees actually do not need to store key values if we do not search

### Link/cut trees

Maintains a collection of disjoint rooted trees on *n* nodes

- findRoot(v) find root of tree containing
*v* - link(v,w)
*w*a root,*v*not in tree of*w*, make*w*a child of*v* - cut(v) cut edge connecting
*v*to its parent (*v*not a root)

O(log n) amortized per operation. We will show O(log^2) amortized time. Can be also modified to support these operations in worst-case O(log n) time, add more operations e.g. with weights in nodes.

Similar as union-find, but adds cut plus we cannot rearrange trees as needed as in union-find

### Simpler version for paths

Simpler version for paths (imagine them as going upwards):

- findPathHead(v) - highest element on path containing v
- linkPaths(v, w) - join paths containing v and w (head of v's path will remain head)
- splitPathAbove(v) - remove edge connecting v to its parent p, return some node in the path containing p
- splitPathBelow(v) - remove edge connecting v to its child c, return some node in the path containing c

Can be done by splay trees

- keep each path in a splay tree, using position as a key (not stored anywhere)
- findPathHead(v): findMin(v)
- linkPaths(v, w): join(v, w)
- splitPathAbove(v): splitBefore(v)
- splitPathBelow(v): splitAfter(v)

### Back to link/cut trees

- label each edge as dashed or solid; among children of each node at most one connected by a solid edge
- series of solid paths (some solid paths contain only a single vertex)
- each solid path represented as a splay tree as above
- each node remembers the dashed edge going to its parent (if any) - only in the head of a solid path
- expose(v): make v the lower end of a solid path to root (make all edges on the path to root solid and make other edges dashed as necessary)
- assume for now that this can be done in O(log^2 n) amortized time

- findRoot(v): findPathHead(expose(v))
- link(v,w):
- expose(v)
- linkPaths(v, w)

- cut(v):
- expose(v) note: v not a root and thus connected to its parent by a solid edge
- splitPathAbove(v)

expose(v)

- series of splices where each splice adds one solid edge on the path from v to root

y = cutPathBellow(v); if(y!=NULL) findPathHead(y).dashed = v; while(true) { x = findPathHead(v); w = x.dashed; if(w == NULL) break; x.dashed = NULL; q = splitPathBelow(w); if(q != NULL) { findPathHead(q).dashed = w; } linkPaths(w, x); v = x; }

- we will prove that m link-cut operations with n nodes cause O(m log n) splices which means O(log n) splices amortized per expose
- each splice O(log n) amortized time (from splay trees), therefore O(log^2 n) amortized time per link-cut operation

### Heavy-light decomposition

edge from v to its parent p is called heavy if D(v)> D(p)/2, otherwise it is light. Observations:

- each path from v to root at most lg n light edges because with each light edge the size of the subtree goes down by at least 1/2
- each node has at most one child connected to it by a heavy edge

### Proof that there are O(log n) splices amortized per expose

- potential function: number of heavy dashed edges
- real cost: number of splices (= number of iterations of the while loop in expose)

consider a single splice, there are two cases:

- case 1: new solid edge is heavy
- before it was dashed heavy edge, not it is solid heavy edge, so potential decreases by 1
- some edge might have been changed from solid to dashed, but this edge is for sure light, because its sibling is heavy
- overall change of potential is therefore -1, real cost 1, amortized cost 0

- case 2: new solid edge is light
- change of potential is at most 1, because we might have created one new dashed edge and it is possible that this edge is heavy
- real cost is 1, therefore amortized cost is at most 2

Amortized cost of the whole expose is therefore 2L+1, where L is the number of new light solid edges created in expose

- +1 comes from the child of v which was potentially changed from solid to dashed in the first step
- all new light solid edges are light edges on the path from v to root, therefore L=O(log n)
- the total amortized cost of expose is therefore O(log n)

some operations change the tree and create new heavy dashed edges from light dashed edges

- consider edge from
*u*to its parent*p* - it can become heavy if
*u*becomes heavier by linking in its subtree- this never happens because edge
*u->p*would be solid before linking

- this never happens because edge
- it can also become heavy if some other child
*x*of*p*becomes lighter by cutting something in its subtree- after cutting,
*x*is connected to*p*by a light edge - there are at most lg
*n*vertices on the path exposed by cutting that are light after the cut and can play role of*x* - each
*x*has at most one sibling*u*that becomes heavy - therefore there are at most lg
*n*new heavy dashed edges, which adds O(log n) splices to amortized cost of cut

- after cutting,

### Fully dynamic connectivity

- insert delete edges, insert delete isolated vertices, test if there is a path from v to w O(log^2 n) amortized
- Holm, Jacob, Kristian De Lichtenberg, and Mikkel Thorup. "Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity." Journal of the ACM (JACM) 48, no. 4 (2001): 723-760.
- uses link-cut trees as components
- good idea for presentation?

- if no delete, can use union-find

### Applications

- At the time of publication improved best known running time of maximum flow problem from O(nm log^2 n) to O(nm log n). Since then some improvements.
- lca(u,v) least common ancestor:
- expose(u)
- expose(v)
- w = findPathHead(u)
- if w.dashed==null return u else return w.dashed
- we have seen O(1) lca after preprocessing a static tree, this would be O(log n) dynamic

### Sources

- Sleator, Daniel D., and Robert Endre Tarjan. ``A data structure for dynamic trees.
*Journal of computer and system sciences 26, no. 3 (1983): 362-391.*

- Sleator, Daniel Dominic, and Robert Endre Tarjan. "Self-adjusting binary search trees." Journal of the ACM 32, no. 3 (1985): 652-686.

- Erik Demaine and his students. "Notes for Lecture 19 Dynamic graphs" MIT course 6.851: Advanced Data Structures, Spring 2012. [4]

- Mehta, Dinesh P. and Sartaj Sahni eds. "Handbook of data structures and applications." CRC Press, 2004. Section 35.2