December 24, 2012

This Week's Coding Practice - Binary Search Trees

This week's topic, as you may have guessed already, was Binary Search Trees (BST).  A Binary Tree (BT) is a hierarchical structure in which each node has one parent (with the exception of the root node, which is an orphan) and at most two children.  Nodes without children are called leaf nodes.  A BST is a special kind of BT that satisfies the following invariant: an in-order traversal of the tree yields nodes in ascending order.  An auto-balancing BST is a BST that also satisfies another invariant: the height of the tree (the maximum number of links from the root to a leaf node) needs to be relatively small.

The first invariant allows the BST to be searched efficiently: you start at the root, and at each node, you compare the value you're searching for and the value of the node.  If they are equal, you've found your node and you're all done. If the former is less than the latter, then you continue down to the left child of the node; otherwise, you go down the right child of the node.  If you can't continue because a particular child doesn't exist, then the BST doesn't contain the value you're looking for.  How long does this retrieval operation take?  Well, at each step you go down one level, so you will need at most h steps, where h is the height of the tree.

This is where the second invariant comes in.  Consider this example: an ordered list of n numbers is being added, one by one, to a BST that does not satisfy the second invariant.  What will the resulting tree look like after the nth addition?  It will resemble a linked list n elements long.  Searching through this list will be at most O(n).  The benefit of using such a BST over a simple linked list would be lost.  If, on the other hand, the BST "magically" kept its height relatively small, then you would be able to perform more effective lookups.  How small is "relatively small?"  Well, consider the number of nodes n (unknown) for a simple BT with height h: you would have 1 node at the top (the root), 2 nodes on the second level (children of the root), 4 nodes at the third level (grandchildren of the root) and so on.  This is a geometric progression with a common ratio of 2 and an initial value of 1, which sums to $2^h - 1$.  This is the value of n.  Obviously, the relationship between h and n is: $h = \lceil \log_2(n+1) \rceil$ .  The ceiling is there to account for the fact that not all non-leaf nodes will have two children.  This is the optimal height of a BST.  In practice, keeping the height of the tree at precisely this value is overkill, so the invariant is considered satisfied if the height is within some factor of this optimal value.  The benefit of satisfying these two invariants is assurance that retrievals will take at most O(log(n)) of the number of nodes in the tree.  The above condition implies that the left and right subtrees must be of similar heights (i.e. balanced).  This is the reason that such BSTs are called auto-balancing.

One main question remains: how does an auto-balancing BST maintain its height to satisfy the second invariant?  The answer depends on the implementation, of which there are several.  I'll discuss this in more detail in the next post.  For now, I leave you with the solution for this week's problem: how do you check that a BT is a auto-balancing BST?

As an aside, yet another hooray for StackExchange, which got me up and running with LaTeX on Blogger in no time.