CSCI 4380 Database Systems 2025 spring Instructor: Lei Yu Assistant Prof, Department of Computer Science Outline: • Test 2 grades released. • Makeup test grades will be available soon. • Lecture 21 Exercise after class: 48 hours • Conventional indexes • B-Trees The Main Purpose of Index Structures Speedup the search process 3 indexσa=6(R) blocks containing the desired tuples quickly figure out disks otherwise have to scan the entire R also need to handle dynamic changes of R Root B+Tree Example n=3 10 0 12 0 15 0 18 0 30 3 5 11 30 35 10 0 10 1 11 0 12 0 13 0 15 0 15 6 17 9 18 0 20 0 Sample non-leaf to keys to keys to keys to keys < 57 57£ k<81 81£k<95 ³95 57 81 95 Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 To re co rd w ith k ey 5 7 To re co rd w ith k ey 8 1 To re co rd w ith k ey 8 5 In textbook’s notation n=3 Leaf: Non-leaf: 30 35 30 30 35 30 B+tree rules q Rule 1. All leaves are at same lowest level (balanced tree) q Rule 2. Pointers in leaves point to records except for “sequence pointer” q Rule 3. Number of keys/pointers in nodes: 8 Max. # pointers Max. # keys Min. # pointers Min. # keys Non-leaf n+1 n é(n+1)/2ù é(n+1)/2ù-1 Leaf n+1 n ë(n+1)/2û+1 ë(n+1)/2û Root n+1 n 2 1 A B+tree node of order n pn+1knk2p2k1p1 p3 …… k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn pn k1 k2 … p1 p2 p3 Minimum Occupancy Rule (or Minimum Fill Factor Rule) • Balance and Predictable Height: To keep the tree from becoming lopsided or sparse, we enforce a rule that each node (except the root) must be at least half full. • Space Efficiency: Nodes map directly to disk blocks, and Accessing a half-empty block is a waste of I/O. Max. # pointers Max. # keys Min. # pointers Min. # keys Non-leaf n+1 n é(n+1)/2ù é(n+1)/2ù-1 Leaf n+1 n ë(n+1)/2û+1 ë(n+1)/2û Root n+1 n 2 1 Full node min. node Non-leaf Leaf n=3 12 0 15 0 18 0 30 3 5 11 30 35 co un ts e ve n if nu ll Search in a B+tree 1. Start from the root 2. Search in a leaf block 3. May not have to go to the data file 11 Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); Search in B+tree 12 A tree node Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Search in B+tree 13 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Search in B+tree 14 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Search(*prt, 130) Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Search in B+tree 15 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Search(*prt, 130) Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Search in B+tree 16 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Search(*prt, 130) 100 £ 130 Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Search in B+tree 17 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Search(*prt, 130) 100 £ 130 12 0 £ 13 0 < 15 0 Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Search in B+tree 18 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Search(*prt, 130) 12 0 £ 13 0 < 15 0 130 =130 return 100 £ 130 Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Range Search in B+tree 19 To research all records whose key values are between k1 and k2: Range Search in B+tree 20 To research all records whose key values are between k1 and k2: Range-Search(ptr, k1, k2) 1. Call Search(ptr, k1); 2. Follow the sequence pointers until the search key value is larger than k2. Range Search in B+tree 21 Range-Search(ptr, k1, k2) 1. Call Search(ptr, k1); 2. Follow the sequence pointers until the search key value is larger than k2. 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Range Search in B+tree 22 Range-Search(ptr, k1, k2) 1. Call Search(ptr, k1); 2. Follow the sequence pointers until the search key value is larger than k2. 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Range-Search(*prt, 50, 125) Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Range Search in B+tree 23 Range-Search(ptr, k1, k2) 1. Call Search(ptr, k1); 2. Follow the sequence pointers until the search key value is larger than k2. 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root Range-Search(*prt, 50, 125) Search(*prt, 50) Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Range Search in B+tree 24 Range-Search(ptr, k1, k2) 1. Call Search(ptr, k1); 2. Follow the sequence pointers until the search key value is larger than k2. 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root 50 < 1 00 30 £ 50 Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); Range-Search(*prt, 50, 125) Search(*prt, 50)A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Range Search in B+tree 25 Range-Search(ptr, k1, k2) 1. Call Search(ptr, k1); 2. Follow the sequence pointers until the search key value is larger than k2. 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 root 120 £ 125 return 50 < 1 00 30 £ 50 110 £ 125 101 £ 125 100 £ 125 return returnreturn 135 > 125 STOP 35 < 50 Not Return Search(ptr, k); \\ search a record of key value k in the subtree rooted at ptr \\ assume the B+tree is a dense index of order n Case 1. ptr is a leaf \\ pn+1 is the sequence pointer IF (k = ki) for a key ki in *ptr THEN return(pi); ELSE return(Null); Case 2. ptr is not a leaf find a key ki in *ptr such that ki-1 ≤ k < ki; return(Search(pi, k)); Range-Search(*prt, 50, 125) Search(*prt, 50)A tree node k1 k2 ‧‧‧‧‧‧ kn p1 p2 p3 pn+1pn Insert into B+tree 26 Insert into B+tree Basic idea: 1. Find the leaf L where the record r should be inserted; 2. If L has further room, then insert r into L, and return; 3. If L is full, spilt L plus r into two leaves (each is about half full): this causes an additional child for the parent P of L, thus we need to add a child to P; 4. If P is already full, then we have to split P and add an additional child to P’s parent … (recursively) 27 Insert into B+tree q Simple case (space available for new child) q Leaf overflow q Non-leaf overflow q New root 28 Insert into B+tree q Simple case (space available for new child) q Leaf overflow q Non-leaf overflow q New root 29 order n=3 30 I. Simple case: Insert key 32 100 30 40 3 5 11 30 31 order n=3 31 I. Simple case: Insert key 32 100 30 40 3 5 11 30 31 Insert(prt, 32) order n=3 32 I. Simple case: Insert key 32 100 30 40 3 5 11 30 31 Insert(prt, 32) 32 < 1 00 30 £ 32 <40 order n=3 33 I. Simple case: Insert key 32 100 30 40 3 5 11 30 31 Insert(prt, 32) 32 < 1 00 30 £ 32 <40 room for 32 order n=3 34 I. Simple case: Insert key 32 100 30 40 3 5 11 30 31 32 Insert(prt, 32) 32 < 1 00 30 £ 32 <40 Insert into B+tree q Simple case (space available for new child) q Leaf overflow q Non-leaf overflow q New root 35 Insert into B+tree q Simple case (space available for new child) q Leaf overflow q Non-leaf overflow q New root 36 Insert into B+tree q Simple case (space available for new child) q Leaf overflow q Non-leaf overflow q New root 37 Idea: when there is no space for a new child (all pointers are in use), split the node into two nodes, with both at least half full. 38 Complication: node overflow 39 Complication: Leaf overflow 40 pn+1 kn+1 p1 k1 … p k p k … … pn kn p* + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 Complication: Leaf overflow p k p’ 41 p1 k1 … p k --- p** p k … pn kn pn+1 kn+1 --- p* + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 pn+1 kn+1 p1 k1 … p k p k … … pn kn p* + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 Complication: Leaf overflow p k k p’ p k p’ 42 p1 k1 … p k --- p** p k … pn kn pn+1 kn+1 --- p* + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 pn+1 kn+1 p1 k1 … p k p k … … pn kn p* + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 Complication: Leaf overflow p k k p’ p k p’ 43 p1 k1 … p k --- p** p k … pn kn pn+1 kn+1 --- p* + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 pn+1 kn+1 p1 k1 … p k p k … … pn kn p* + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 Complication: Leaf overflow p k k p’ q p k p’ 44 p1 k1 … p k --- p** p k … pn kn pn+1 kn+1 --- p* p k k p’ q + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 pn+1 kn+1 p1 k1 … p k p k … … pn kn p* + 1 /2 p k p’ + 1 /2 + 1 /2 +1 + 1 /2 +1 Complication: Leaf overflow ? What is the key here? 45 p1 k1 … p k --- p** p k … pn kn pn+1 kn+1 --- p* p k k p’ q + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 pn+1 kn+1 p1 k1 … p k p k … … pn kn p* + 1 /2 p k p’ + 1 /2 + 1 /2 +1 + 1 /2 +1 Complication: Leaf overflow ? What is the key here? 46 p1 k1 … p k --- p** p k … pn kn pn+1 kn+1 --- p* p k k p’ q + 1 /2 + 1 /2 + 1 /2 +1 + 1 /2 +1 pn+1 kn+1 p1 k1 … p k p k … … pn kn p* + 1 /2 p k p’ + 1 /2 + 1 /2 +1 + 1 /2 +1 Complication: Leaf overflow + 1 /2 +1 k ≤ x < k; + 1 /2 +1 47 100 30 3 5 11 30 31 Leaf overflow: Insert key 7 (order n = 3) 48 100 30 3 5 11 30 31 Leaf overflow: Insert key 7 (order n = 3) 7 49 100 30 3 5 30 31 Leaf overflow: Insert key 7 (order n = 3) 7 11 50 100 30 3 5 30 31 Leaf overflow: Insert key 7 (order n = 3) 7 11 3 5 11 7 51 100 30 3 5 30 31 Leaf overflow: Insert key 7 (order n = 3) 7 11 3 5 11 7 52 100 30 3 5 30 31 Leaf overflow: Insert key 7 (order n = 3) 7 11 3 5 11 7 53 100 30 3 5 30 31 Leaf overflow: Insert key 7 (order n = 3) 7 11 3 5 11 7 54 100 7 30 3 5 30 31 Leaf overflow: Insert key 7 (order n = 3) 7 11 3 5 11 7 55 Complication: nonleaf overflow 56 kn+1 pn+2p1 k1 … p k p k p k … pn kn pn+1 ( + 1)/2( + 1)/2 -1 ( + 1)/2 -1 ( + 1)/2 p k’ p’ ( + 1)/2 +1 ( + 1)/2 +1 Complication: nonleaf overflow 57 kn+1 pn+2p1 k1 … p k p k p k … pn kn pn+1 ( + 1)/2( + 1)/2 -1 ( + 1)/2 -1 ( + 1)/2 p k’ p’ p1 k1 p2 k2 … p k p( + 1)/2 -1 ( + 1)/2 -1 p k … pn kn pn+1 kn+1 pn+2( + 1)/2 +1 k ( + 1)/2p k’ p’q ( + 1)/2 ( + 1)/2 +1 ( + 1)/2 +1 ( + 1)/2 +1 Complication: nonleaf overflow 58 kn+1 pn+2p1 k1 … p k p k p k … pn kn pn+1 ( + 1)/2( + 1)/2 -1 ( + 1)/2 -1 ( + 1)/2 p k’ p’ p1 k1 p2 k2 … p k p( + 1)/2 -1 ( + 1)/2 -1 p k … pn kn pn+1 kn+1 pn+2( + 1)/2 +1 k ( + 1)/2p k’ p’q ( + 1)/2 ( + 1)/2 +1 ( + 1)/2 +1 ( + 1)/2 +1 Complication: nonleaf overflow 59 kn+1 pn+2p1 k1 … p k p k p k … pn kn pn+1 ( + 1)/2( + 1)/2 -1 ( + 1)/2 -1 ( + 1)/2 p k’ p’ p1 k1 p2 k2 … p k p( + 1)/2 -1 ( + 1)/2 -1 p k … pn kn pn+1 kn+1 pn+2( + 1)/2 +1 k ( + 1)/2p k’ p’q ( + 1)/2 ( + 1)/2 +1 ( + 1)/2 +1 ( + 1)/2 +1 Complication: nonleaf overflow 60 kn+1 pn+2p1 k1 … p k p k p k … pn kn pn+1 ( + 1)/2( + 1)/2 -1 ( + 1)/2 -1 ( + 1)/2 p k’ p’ p1 k1 p2 k2 … p k p( + 1)/2 -1 ( + 1)/2 -1 p k … pn kn pn+1 kn+1 pn+2( + 1)/2 +1 k ( + 1)/2p k’ p’q ( + 1)/2 ( + 1)/2 +1 ( + 1)/2 +1 ( + 1)/2 +1 Complication: nonleaf overflow ? What is the key here? 61 kn+1 pn+2p1 k1 … p k p k p k … pn kn pn+1 ( + 1)/2( + 1)/2 -1 ( + 1)/2 -1 ( + 1)/2 p k’ p’ p1 k1 p2 k2 … p k p( + 1)/2 -1 ( + 1)/2 -1 p k … pn kn pn+1 kn+1 pn+2( + 1)/2 +1 k ( + 1)/2p k’ p’q ( + 1)/2 ( + 1)/2 +1 ( + 1)/2 +1 ( + 1)/2 +1 Complication: nonleaf overflow ? not used What is the key here? 62 kn+1 pn+2p1 k1 … p k p k p k … pn kn pn+1 ( + 1)/2( + 1)/2 -1 ( + 1)/2 -1 ( + 1)/2 p k’ p’ p1 k1 p2 k2 … p k p( + 1)/2 -1 ( + 1)/2 -1 p k … pn kn pn+1 kn+1 pn+2( + 1)/2 +1 k ( + 1)/2p k’ p’q ( + 1)/2 ( + 1)/2 +1 ( + 1)/2 +1 ( + 1)/2 +1 Complication: nonleaf overflow 63 Nonleaf overflow: Insert key 160 (order n = 3) 100 120 150 180 150 156 179 180 210 160 64 Nonleaf overflow: Insert key 160 (order n = 3) 100 120 150 180 150 156 180 210 150 156 179 160 160 179 65 Nonleaf overflow: Insert key 160 (order n = 3) 100 120 150 180 150 156 180 210 150 156 179 160 160 179 66 Nonleaf overflow: Insert key 160 (order n = 3) 100 120 150 180 150 156 180 210 150 156 179 160 160 179 160 67 Nonleaf overflow: Insert key 160 (order n = 3) 100 120 150 180 150 156 180 210 150 156 179 160 160 179 no room! 160 68 Nonleaf overflow: Insert key 160 (order n = 3) 100 120 150 156 180 210 150 156 179 160 160 179 120 150 180 160 160 180150 69 Nonleaf overflow: Insert key 160 (order n = 3) 100 120 150 156 180 210 150 156 179 160 160 179 160 180150 120 150 180 160 70 Nonleaf overflow: Insert key 160 (order n = 3) 100 150 120 150 156 180 210 150 156 179 160 160 179 160 180150 120 150 180 160 Delete in B+tree 71 Delete in B+tree Basic idea: 1. Find the leaf L where the record r should be deleted; 2. If L remains at least half-full after deleting r, then delete r, and return; 3. Else consider the sibling L’ of L; 3.1 If L’ is more than half-full, then move a record from L’ to L, and return; 3.2 Else combine L and L’ into a single leaf; 3.3 But now you need to consider the case of deleting a child from L’s parent … (recursively) 72 q Simple case: the node remains at least half-full after deletion. q Re-distribute keys among siblings q Coalesce with a sibling and delete a pointer from its father 73 Delete in B+tree q Simple case: the node remains at least half-full after deletion. q Re-distribute keys among siblings q Coalesce with a sibling and delete a pointer from its father 74 Delete in B+tree 75 order n=3Simple case: Delete key 35 76 order n=3Simple case: Delete key 35 105 10 40 3 5 8 10 20 35 Delete(prt, 35) 40 50 77 order n=3Simple case: Delete key 35 105 10 40 3 5 8 10 20 35 Delete(prt, 35) 40 50 Delete 35 78 order n=3Simple case: Delete key 35 105 10 40 3 5 8 10 20 35 Delete(prt, 35) 40 50 After deletion, the node still keeps at least half-full 79 order n=3Simple case: Delete key 35 105 10 40 3 5 8 10 20 40 50 q Simple case: the node remains at least half-full after deletion. q Re-distribute keys among siblings q Coalesce with a sibling and delete a pointer from its father 80 Delete in B+tree 81 Key re-distribution at leaves p1 k1 … pi ki --- p’ k’ p k p” k” q1 h1 q2 h2 … qt ht --- p* q* i = ë(n+1)/2û - 1 t > ë(n+1)/2û 82 p1 k1 … pi ki --- p’ k’ p k p” k” q1 h1 q2 h2 … qt ht --- p* q* p1 k1 … pi ki q1 h1 --- p’ k’ p k p” k” q2 h2 … qt ht --- p* q* i = ë(n+1)/2û - 1 t > ë(n+1)/2û Key re-distribution at leaves 83 p1 k1 … pi ki --- p’ k’ p k p” k” q1 h1 q2 h2 … qt ht --- p* q* p1 k1 … pi ki q1 h1 --- p’ k’ p k p” k” q2 h2 … qt ht --- p* q* h2 i = ë(n+1)/2û - 1 t > ë(n+1)/2û Key re-distribution at leaves 84 order n=3 Key re-distribution at leaves: Delete 5 85 order n=3 105 10 40 3 5 10 20 35 Delete(prt, 5) 40 50 Key re-distribution at leaves: Delete 5 86 105 10 40 3 5 10 20 35 Delete(prt, 5) 40 50 Delete 5 order n=3 Key re-distribution at leaves: Delete 5 87 105 10 40 3 5 10 20 35 Delete(prt, 5) 40 50 order n=3 Key re-distribution at leaves: Delete 5 88 105 10 40 3 5 10 20 35 Delete(prt, 5) 40 50 Less than half-full !! order n=3 Key re-distribution at leaves: Delete 5 89 105 10 40 3 5 10 20 35 Delete(prt, 5) 40 50 Look at the sibling, which is more than half-full, so we can redistribute the keysa order n=3 Key re-distribution at leaves: Delete 5 90 105 10 40 3 5 10 20 35 Delete(prt, 5) 40 50 Look at the sibling, which is more than half-full, so we can redistribute the keys order n=3 Key re-distribution at leaves: Delete 5 91 105 10 40 3 10 20 35 Delete(prt, 5) 40 50 redistribution 3 10 20 35 order n=3 Key re-distribution at leaves: Delete 5 92 105 10 40 3 10 20 35 Delete(prt, 5) 40 50 redistribution 3 10 20 35 20 order n=3 Key re-distribution at leaves: Delete 5 93 105 10 40 3 10 20 35 Delete(prt, 5) 40 50 redistribution 3 10 20 35 20 20 order n=3 Key re-distribution at leaves: Delete 5 94 105 20 40 3 10 20 35 40 50 order n=3 Key re-distribution at leaves: Delete 5 95 Key re-distribution at nonleaves 96 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 97 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 98 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 q1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 99 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 q1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - ? i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 100 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 q1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - ? i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 101 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 q1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - k’ i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 102 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 q1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - k’ ? i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 103 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 q1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - k’ ? i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 104 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 q1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - k’ h1 i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù 105 Key re-distribution at nonleaves p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 q2 h2 … qt ht qt+1 - p1 k1 … pi ki pi+1 q1 p’ k’ p k p” k” q2 h2 … qt ht qt+1 k’ h1 i+1 = é(n+1)/2ù - 1 t+1> é(n+1)/2ù q Simple case: the node remains at least half-full after deletion. q Re-distribute keys among siblings q Coalesce with a sibling and delete a pointer from its father 106 Delete in B+tree q Simple case: the node remains at least half-full after deletion. q Re-distribute keys among siblings q Coalesce with a sibling and delete a pointer from its father 107 Delete in B+tree Observation: when two siblings both are no more than half full, coalesce them into a single node (which is nearly full) 108 Node Coalescence 109 Leaf Coalescence 110 p1 k1 … pi ki --- p’ k’ p k p” k” q1 h1 … qt ht --- p* q* i = ë(n+1)/2û - 1 t = ë(n+1)/2û Leaf Coalescence 111 p1 k1 … pi ki --- p’ k’ p k p” k” q1 h1 … qt ht --- p* q* p1 k1 … pi ki p’ k’ p k p” k” q1 h1 … qt ht - q* q* i = ë(n+1)/2û - 1 t = ë(n+1)/2û Leaf Coalescence 112 p1 k1 … pi ki --- p’ k’ p k p” k” q1 h1 … qt ht --- p* q* p1 k1 … pi ki p’ k’ p k p” k” q1 h1 … qt ht - q* q* i = ë(n+1)/2û - 1 t = ë(n+1)/2û Leaf Coalescence 113 Leaf coalescence : Delete key 5 114 order n=3 Leaf coalescence : Delete key 5 38 10 3 5 10 20 Delete(prt, 5) 40 50 60 80 60 75 80 90 115 order n=3 Leaf coalescence : Delete key 5 38 10 3 5 10 20 Delete(prt, 5) 40 50 60 80 60 75 Delete 5 80 90 116 order n=3 Leaf coalescence : Delete key 5 38 10 3 5 10 20 Delete(prt, 5) 40 50 60 80 60 75 80 90 117 order n=3 Leaf coalescence : Delete key 5 38 10 3 5 10 20 Delete(prt, 5) 40 50 60 80 60 75 The sibling is just half-full, so we should coalesce 80 90 118 order n=3 Leaf coalescence : Delete key 5 38 10 3 10 20 10 20 Delete(prt, 5) 40 50 60 80 60 75 3 5 10 20 Leaf coalescence 80 90 119 order n=3 Leaf coalescence : Delete key 5 38 10 3 10 20 10 20 Delete(prt, 5) 40 50 60 80 60 75 3 5 10 20 Less than half-full Leaf coalescence 80 90 120 order n=3 Leaf coalescence : Delete key 5 38 10 3 10 20 10 20 Delete(prt, 5) 40 50 60 80 60 75 3 5 10 20 Less than half-full Leaf coalescence more than half-full, so we can re- distribute pointers at nonleaves 80 90 121 order n=3 38 10 3 10 20 10 20 Delete(prt, 5) 40 50 60 80 60 75 3 5 10 20 Less than half-full Leaf coalescence more than half-full, so we can re- distribute pointers at nonleaves 80 90 Leaf coalescence : Delete key 5 122 order n=3 Key re-distribution at Nonleaves 38 10 3 10 20 10 20 Delete(prt, 5) 40 50 60 80 60 75 3 5 10 20 Less than half-full Leaf coalescence more than half-full, so we can re- distribute pointers at nonleaves 80 90 123 order n=3 Key re-distribution at Nonleaves 38 10 3 10 20 10 20 Delete(prt, 5) 40 50 60 80 60 75 3 5 10 20 Less than half-full Leaf coalescence 38 60 more than half-full, so we can re- distribute pointers at nonleaves 80 90 124 order n=3 Key re-distribution at Nonleaves 38 10 3 10 20 10 20 Delete(prt, 5) 40 50 60 80 60 75 3 5 10 20 Less than half-full Leaf coalescence 38 60 80 more than half-full, so we can re- distribute pointers at nonleaves 80 90 125 order n=3 Key re-distribution at Nonleaves 60 38 3 10 20 40 50 80 80 60 75 80 90 126 Nonleaf Coalescence 127 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 i+1 = é(n+1)/2ù - 1 t+1= é(n+1)/2ù Nonleaf Coalescence 128 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 q1 h1 … qt ht qt+1 i+1 = é(n+1)/2ù - 1 t+1= é(n+1)/2ù Nonleaf Coalescence 129 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 q1 h1 … qt ht qt+1 ? i+1 = é(n+1)/2ù - 1 t+1= é(n+1)/2ù Nonleaf Coalescence 130 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 q1 h1 … qt ht qt+1 ? i+1 = é(n+1)/2ù - 1 t+1= é(n+1)/2ù Nonleaf Coalescence 131 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 q1 h1 … qt ht qt+1 k i+1 = é(n+1)/2ù - 1 t+1= é(n+1)/2ù Nonleaf Coalescence 132 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 p1 k1 … pi ki pi+1 p’ k’ p k p” k” q1 h1 … qt ht qt+1 q1 h1 … qt ht qt+1 k i+1 = é(n+1)/2ù - 1 t+1= é(n+1)/2ù Nonleaf Coalescence 133 order n=3 Nonleaf coalescence : Delete key 5 134 order n=3 Nonleaf coalescence : Delete key 5 55 10 Delete(prt, 5) 55 58 60 61 723 5 10 20 135 order n=3 Nonleaf coalescence : Delete key 5 55 10 Delete(prt, 5) 55 58 60 61 723 5 10 20 delete 5 136 order n=3 Nonleaf coalescence : Delete key 5 55 10 Delete(prt, 5) 55 58 60 61 723 5 10 20 137 order n=3 Nonleaf coalescence : Delete key 5 55 10 Delete(prt, 5) 55 58 60 61 723 10 20 10 20 leaf coalescence 3 5 10 20 138 order n=3 Nonleaf coalescence : Delete key 5 55 10 Delete(prt, 5) 55 58 60 61 723 10 20 10 20 leaf coalescence 3 5 10 20 less than half-full 139 order n=3 Nonleaf coalescence : Delete key 5 55 10 Delete(prt, 5) 55 58 60 61 723 10 20 10 20 leaf coalescence 3 5 10 20 less than half-full just half-full, so we need to coalesce 140 order n=3 Nonleaf coalescence : Delete key 5 55 55 60 Delete(prt, 5) 55 58 60 61 723 10 20 10 20 leaf coalescence 3 5 10 20 55 60 nonleaf coalescence 141 order n=3 Nonleaf coalescence : Delete key 5 55 55 60 Delete(prt, 5) 55 58 60 61 723 10 20 10 20 leaf coalescence 3 5 10 20 55 60 nonleaf coalescence new root 142 order n=3 Nonleaf coalescence : Delete key 5 55 60 55 58 61 723 10 20 B+tree deletions in practice q Often, coalescing is not implemented Too hard and not worth it! 143 Pseudo Code for Insertion in B+tree Insert(pt, (p, k), (p', k')); \\ technically, the smallest key kmin in pt is also returned \\ (p, k) is a pointer-key pair to be inserted into the subtree rooted at pt; p' is a new sibling \\ of pt, if created, and k' is the smallest key value in p' ; Case 1. pt = (p1, k1, ..., pi, ki, --, pn+1) is a leaf \\ pn+1 is the sequence pointer IF (i < n) THEN insert (p, k) into pt; return(p' = Null, k' = 0); ELSE re-arrange (p1, k1), ..., (pn, kn), and (p, k) into (ρ1, κ1, ..., ρn+1, κn+1); create a new leaf p''; put (ρr+1, κr+1, …, ρn+1, κn+1, --, pn+1) in p''; \\ r = ë(n+1)/2û put (ρ1, κ1, ..., ρr, κr, --, p'') in pt; \\ pn+1 and p'' are sequence pointers in p'' and pt. IF pt is the root THEN create a new root with children pt and p'' (and key κr+1) ELSE return(p' = p'', k' = κr+1); Case 2. pt is not a leaf find a key ki in pt such that ki-1 ≤ k < ki; Insert(pi , (k, p), (k", p")); IF (p" = Null) THEN return(k' = 0, p' = Null); ELSE IF there is room in pt, THEN insert (k", p") into pt; return(k' = 0, p' = Null); ELSE re-arrange the content in pt and (k", p") into (ρ1, κ1, ..., ρn+1, κn+1, ρn+2); create a new node p''; put (ρr+1, κr+1, …, ρn+1, κn+1, ρn+2, -- ) in p''; \\ r = é(n+1)/2ù leave (ρ1, κ1, ..., ρr-1, κr-1, ρr , -- ) in pt; IF pt is the root THEN create a new root with children pt and p'' (and key κr) ELSE return(p' = p'', k' = κr ). 144 Pseudo Code for Deletion in a B+tree Delete(pt, (k,p), belowmin); \\ technically, the smallest key kmin in *pt is also returned \\ (k,p) is the data record to be deleted from the subtree rooted at pt; belowmin = true if \\ after deletion, pt has fewer than the required min # of pointers; Case 1. pt is a leaf delete (k,p) in pt; IF pt has at least ë(n+1)/2û data pointers OR pt is the root THEN return (belowmin = false) ELSE return (belowmin = true); Case 2. pt is not a leaf find a key ki in pt such that ki-1 ≤ k < ki; Delete(pi , (k, p), belowmin'); IF (not belowmin') THEN return(belowmin= false); ELSE IF pi has an adjacent sibling p' that has more than the required min # of pointers THEN move one key-pointer pair from p' to pi; ELSE combine pi with an adjacent sibling of pi into a single node; IF pt is the root with only one pointer pi THEN pt = pi; return(belowmin= false); IF pt has at least é(n+1)/2ù pointers OR pt is the root THEN return(belowmin= false) ELSE return(belowmin= true); 145 Application of B+ Tree • Range query 19. Range Search in B+tree • (a,b) • What if b is infinite? - to the end of the chain of leaves. • What if a is –infinite? – start from the first leaf. • Duplicate keys • B-tree on multiple attributes • Scan efficiency B-tree with Duplicate keys • A non-leaf node stores the first key of the first node that is not already present in the previous left sibling. • If there is no such key, then a null (placeholder) value is stored at this location. B-tree with Duplicate keys • A non-leaf node stores the first key of the first node that is not already present in the previous left sibling. • If there is no such key, then a null (placeholder) value is stored at this location. B-tree with Duplicate keys • A non-leaf node stores the first key of the first node that is not already present in the previous left sibling. • If there is no such key, then a null (placeholder) value is stored at this location. B-tree with Duplicate keys • A non-leaf node stores the first key of the first node that is not already present in the previous left sibling. • If there is no such key, then a null (placeholder) value is stored at this location. B-tree with Duplicate keys • A non-leaf node stores the first key of the first node that is not already present in the previous left sibling. • If there is no such key, then a null (placeholder) value is stored at this location. Any key search becomes range query. Index on multiple attributes • An index on multiple attributes like A,B will sort tuples first by A and then by B Index on R(X, Y) X = 'C' & Y = 8 X = 'J' & 3 <= Y <= 8 Search for J,3 - J,8 'C' <= X <= 'E' & Y = 4 Search for C,4 - E,4 'C' <= X <= 'E' & 3 <= Y <= 8 Search for C,3 - E,8 Y <= 5 B-tree indices • Each node in a B-tree is a disk page (stored on disk) • Each B-tree is created on one or more key values • Leaf nodes store one entry for each tuple: each entry has a key value and a pointer to a tuple (dense index: one leaf node for each tuple) Example • 92,000 tuples in instructors Index on instructors(coursecode, instructor) • To index each tuple: 9 (course code) + 16 (instructor) + 20 byte (for tuple address) 45 bytes per tuple! • Index page (assume) is 8KB Store: 8000/45 = about 177 tuples per node • Each node must have: 177 tuples max (88 tuples min) Assume each node is 100% full (mostly): • At leaf level: ceil(92,000/177) = 520 nodes in the leaf level • The level above: ceil(520/177) = 3 nodes in the next internal level The level above: ceil(3/177) = 1 node (root) • Assume each node is 50% full (mostly): • At leaf level: 1040 nodes At a level above: 12 nodes At a level above: 1 node Cost of queries/index search • Number of disk pages read/written • For any search, find the first leaf node in the search and then scan leaf nodes left to right following sibling pointers until the end of the range is reached. This will find all the matching tuples for a query. • Depending on the query, these tuples may need to be read from disk as well. Example • R(A,B,C,D,E), Assume: 100,000 tuples 10,000 pages ( 10 tuples per page) • Index I1 on R(A): 400 pages at leaf, 3 levels (root+internal+leaf) • Index I2 on R(B): 400 pages at leaf, 3 levels (root+internal+leaf) • Index I3 on R(A,B): 1,000 pages at leaf, 3 levels (root+internal+leaf) • Index I4 on R(B,A): 1,000 pages at leaf, 3 levels (root+internal+leaf) Example • SELECT A,B,C,D,E FROM R WHERE 10 <= A and A <= 20 ; (returns: 100 tuples) Query plan 1 -> Scan the whole relation: Cost = 10,000 Query plan 2: Scan I1 (find tuple pointers that match 10<=A<=20) + Read all tuples that I find based on the index from relation R 100,000/400 = 250 tuples per node 100 tuples for this query will fit in a single leaf node! Cost to scan the index = 1 (root) + 1 (internal) + 1 or 2 (leaf) = 4 max To read the matching 100 tuples: I need to read pages of R How many pages do I need to read? Best case: 10 pages (all packed sorted by A) + 4 = 14 pages Worst case: 100 Cost of this query (max) : 4 + 100 = 104 pages Example • Query plan 3: Use index I3: • 100,000 (tuples of R) / 1,000 = 100 tuples per node • Cost of index search = 4 (max) • Cost of searching the relation = 100 Total cost = 104 (max) • Query plan 4: Use index I4: Index I4 on R(B,A): 1,000 pages at leaf, 3 levels (root+internal+leaf) • Scan all leaf, cost of index search = 1 (root) + 1 (internal) + 1,000 (all leaf) = 1002 • Scan the tuples = 100 • Cost = 1,102 Example • SELECT B FROM R WHERE 10 <= A and A <= 20 ; (returns: 100 tuples) • Query plans: • 1. Scan all of R = 10,000 • 2. Use I1 on R(A) = 104 (max) • 3. Use I3 on R(A,B) = 4 (index only search) • 4. Use I4 on R(B,A) = 1,002 (index only search)
学霸联盟