## 30 October 2007

### Optimal alphabetic binary trees

An astute reader pointed out that my "proof" below cannot be correct. See the comment.

I was thinking about how I would solve the two exercises that I proposed in the previous post. (I should stop this habit of proposing exercises' for which I have only a hunch that they are doable.) Anyway, I bumped into this interesting problem related to Huffman coding.

Problem. We are given the weights w1,…,wn and we must construct a binary tree that has them as leafs and minimizes ∑k wk lk, where lk is the length of the path from the root to the leaf with weight wk. The order of the leafs cannot be changed. Prove the correctness of the greedy algorithm that always glues the lightest leftmost pair of trees.

Comment. If the weights are 1,2,3,4,5,6 then the algorithm proceeds as follows: (1,2),3,4,5,6 ↝ ((1,2),3),4,5,6 ↝ ((1,2),3),(4,5),6 ↝ (((1,2),3),(4,5)),6 ↝ ((((1,2),3),(4,5)),6). At each step a pair of adjacent trees is glued. The cost of the result is 54 and you can't do better. There are two differences between this and Huffman's algorithm. The important difference is that we are not allowed to shuffle the numbers. The trivial difference is that in Huffman's algorithm the numbers are usually probabilities that add up to 1. If you thought Huffman coding is about compressing data then you were right. But it's also about doing binary search well. Read on.

Proof. The cost ∑k wk lk can be written as ∑x W(x), where x ranges over all the internal nodes and W(x) is the weight of the tree rooted in x (which you get by adding the leaf weights). The children of an optimal binary tree must be optimal binary trees themselves. (Otherwise the overall cost could be improved by replacing them with subtrees that are optimal.) So far, we know that a dynamic programming solution would work.

The tree ((x,y),z) has cost 2 W(x) + 2 W(y) + W(z) and the tree (x,(y,z)) has cost W(x) + 2 W(y) + 2 W(z). (Here x, y, and z are roots of arbitrary trees.) This shows that if W(x)+W(y)>W(y)+W(z) then it is possible to improve the second tree is better than the first. In other words, an optimal solution of the form ((x,y),z) always has W(x)+W(y) ≤ W(y)+W(z). At this point we know that if the lightest pair of adjacent trees is unique then that's what we should glue.

What remains to be proved is that, in case of a tie, choosing the leftmost lightest pair is enough.

If the lightest pairs are disjoint then they should all be glued and the order does not matter. Otherwise the weights of of the trees must have one of the following configurations:

1. l a b a ba b r, where l>b and r>a
2. l a b a ba b a r, where l>b and r>b

possibly with a=b. (Also, l and r might be missing completely. That is, we may be at the edge of the forest.) The leftmost' strategy leads to

1. l(a+b)(a+b)⋯(a+b)r
2. l(a+b)(a+b)⋯(a+b)br

Now I have a good hunch that this is close to the end of the proof using an argument of the type “for a given number of leafs the optimal cost is a monotonic function of the sum of their weights,” but I'm a little tired. End of partial proof.

The algorithm described above can be implemented in O(n2) time, but I believe an implementation in O(nlgn) time is possible. In fact Wikipedia mentions that Hu-Tucker gave an algorithm with this upper bound and that algorithm might be very well close to what I have in mind (you put pairs of adjacent trees in a linked list and in a priority queue). For now I'm resistent to look up what others have done because I want to think about it myself a bit more.

How does this apply to solve the second exercise you ask? It's pretty simple. This tree corresponds to the optimal binary search strategy, if you set the weights to be the probabilities that the answer takes various values. So the next step is to analyze what happens when the weights are a geometric progression 1,q,q2,…

(For the curious, all this suggests that if n is not huge then the best strategy on average is to check ψn and then ψ12,… in order. This seems to give an average case lgn times better than the code I gave but at the cost of messing up the worst case. So it's not really a better solution. Note that I only expect people who tried to solve that exercise to understand this parenthetic paragraph.)

PS: This post is dedicated to my wife on her 28th birthday. Oh, and my grandmother had her 81th birthday four days ago. Happy birthday!