0.2.5 B+-Trees
A B+-tree is a data structure to store vast amounts of information.
Typically B+-trees are used to store amounts of data that will not fit
in main system memory. To do this, secondary storage (usually disk)
is used to store the leaf nodes of the tree. Only the internal nodes
of the tree are stored in computer memory. In a B+-tree the leaf
nodes are the only ones that actually store data items. All other
nodes are called index nodes or i-nodes
and simply store ``guide'' values which allow us to traverse the tree
structure from the root down and arrive at the leaf node containing
the data item we seek. Because disk I/O is very slow in comparison to
memory access these leaf nodes store more than one data item each. In
fact, the data structure will perform best when the size of the leaf
nodes closely approximates the size of a disk sector under most
operating systems. Thus, when we search a B+-tree (by traversing from
the root node down to the proper data node) we still must read that
data node from the disk and search its contents. Another way to
improve the speed of a query operation is to keep a memory cache of
recently read leaf nodes.
The ancestor of the B+-tree is a structure known as a B-tree in which
data items can be stored in any node on the tree. A more complicated
and slightly more robust varient of the B-tree is called a B*-tree.
While conceptually simple in nature, the challenging aspect of B+-trees
is effecting their growth and collapse. When data nodes in the tree
become too full they split into two nodes. This process demands that
a new guide value be added to the parent index node so that future
queries can arbitrate between the two new nodes. However, adding this
new guide value to the parent may cause it, in turn, to split. In
fact, all the nodes on a path from a leaf to the root may split when
you add a new value to a leaf node. If the root node splits, a new
leaf node is created and your tree grows by one level.
Conversely, the deletion of data items in a leaf node may cause that
node to become too empty. When a data node becomes too empty,
neighboring nodes are examined and merged into underfull node. This
causes the deletion of a guide value in the parent index node which,
in turn, may cause it to become too empty. Much like the ripple
caused by an insertion operation, a key deletion may cause a
merge-delete wave to run from a leaf node all the way up to the
root. When the children of the root are merged into one and the root
node's only value is deleted the root index node is deallocated, its
child node becomes the new root, and the tree shrinks by one level.
As you can see, insertion and deletion processes are recursive in
nature and can cascade up or down the B+-tree affecting its shape
dramatically.
|