31 October 2007

David Parnas is wrong!

The article “Stop the Numbers Game” from the last CACM is an interesting read. But David Parnas is wrong.

Here is what he says.

Counting papers encourages (1) writing hastily written papers, (2) overly large groups, (3) self-plagiarism, (4) small and insignificant studies, and (5) half-baked ideas. Even more, it leads to (1) publishing pacts where a groups appears on all individual papers, (2) clique building, (3) `anything goes', (4) `bespoke research', (4) minimum publishable increment, and (5) organizing workshops and conferences for the only purpose of publishing a clique's papers. Even even more, authors are not always authors. Many of these result from referees not doing a good job.

These are good points.

Parnas then proposes a better strategy for evaluation: “When serving on recruiting, promotion, or grant-award committees, read the candidate's papers and evaluate the contents carefully.” In other words, the evaluation should be a thorough review. Here is what the author says about reviews in the same article: “Anyone with experience as an editor knows there is tremendous variation in the seriousness, objectivity, and care with which referees perform their task.” (Let's ignore the obnoxious tone of this sentence that implies that the reader is inexperienced if she disagrees.) Well, here's the crux: The same observation applies to Parnas' proposed solution! If the real reviews today don't work then why should we expect that some idealized review process will work better in the future? The only difference is that the review targets a bunch of papers of the same author instead of a single paper. That's hardly a reason to have different expectations for the quality of the review. If anything, it indicates that reviewing an author will be done even worse than paper reviewing is done today: It requires more time and researchers care a lot about their time.

Should we despair? No: The numbers' game is here to save us!

Forget for a while about papers and think about the web, Google, and PageRank. Why does it work? Because numbers are used in a smart way. First, ditch supercomputers. Many cheap computers can do a better job than a couple of very high-quality super-servers. Second, ditch librarians. If you remember, most of the Google concurrents thought portals are the future — web information organized by professionals. Well, they were wrong. Many cheap computers that are crunching numbers aggregating the opinions of many people in a smart way are doing a much, much better job. What's the essential piece of information PageRank uses? Links. That's like citations. Parnas complains that some citations are negative. So what? Some links are negative too and still Google works best. I would even say that negative or positive is less important than interesting. Even bad papers can be interesting and can teach us something!

Frankly, I find it alarming that a mathematically inclined mind (as a computer scientist should be!) advocates not using numbers.

30 October 2007

Optimal alphabetic binary trees

An astute reader pointed out that my "proof" below cannot be correct. See the comment.

I was thinking about how I would solve the two exercises that I proposed in the previous post. (I should stop this habit of proposing `exercises' for which I have only a hunch that they are doable.) Anyway, I bumped into this interesting problem related to Huffman coding.

Problem. We are given the weights w1,…,wn and we must construct a binary tree that has them as leafs and minimizes ∑k wk lk, where lk is the length of the path from the root to the leaf with weight wk. The order of the leafs cannot be changed. Prove the correctness of the greedy algorithm that always glues the lightest leftmost pair of trees.

Comment. If the weights are 1,2,3,4,5,6 then the algorithm proceeds as follows: (1,2),3,4,5,6 ↝ ((1,2),3),4,5,6 ↝ ((1,2),3),(4,5),6 ↝ (((1,2),3),(4,5)),6 ↝ ((((1,2),3),(4,5)),6). At each step a pair of adjacent trees is glued. The cost of the result is 54 and you can't do better. There are two differences between this and Huffman's algorithm. The important difference is that we are not allowed to shuffle the numbers. The trivial difference is that in Huffman's algorithm the numbers are usually probabilities that add up to 1. If you thought Huffman coding is about compressing data then you were right. But it's also about doing binary search well. Read on.

Proof. The cost ∑k wk lk can be written as ∑x W(x), where x ranges over all the internal nodes and W(x) is the weight of the tree rooted in x (which you get by adding the leaf weights). The children of an optimal binary tree must be optimal binary trees themselves. (Otherwise the overall cost could be improved by replacing them with subtrees that are optimal.) So far, we know that a dynamic programming solution would work.

The tree ((x,y),z) has cost 2 W(x) + 2 W(y) + W(z) and the tree (x,(y,z)) has cost W(x) + 2 W(y) + 2 W(z). (Here x, y, and z are roots of arbitrary trees.) This shows that if W(x)+W(y)>W(y)+W(z) then it is possible to improve the second tree is better than the first. In other words, an optimal solution of the form ((x,y),z) always has W(x)+W(y) ≤ W(y)+W(z). At this point we know that if the lightest pair of adjacent trees is unique then that's what we should glue.

What remains to be proved is that, in case of a tie, choosing the leftmost lightest pair is enough.

If the lightest pairs are disjoint then they should all be glued and the order does not matter. Otherwise the weights of of the trees must have one of the following configurations:

  1. l a b a ba b r, where l>b and r>a
  2. l a b a ba b a r, where l>b and r>b

possibly with a=b. (Also, l and r might be missing completely. That is, we may be at the edge of the forest.) The `leftmost' strategy leads to

  1. l(a+b)(a+b)⋯(a+b)r
  2. l(a+b)(a+b)⋯(a+b)br

Now I have a good hunch that this is close to the end of the proof using an argument of the type “for a given number of leafs the optimal cost is a monotonic function of the sum of their weights,” but I'm a little tired. End of partial proof.

The algorithm described above can be implemented in O(n2) time, but I believe an implementation in O(nlgn) time is possible. In fact Wikipedia mentions that Hu-Tucker gave an algorithm with this upper bound and that algorithm might be very well close to what I have in mind (you put pairs of adjacent trees in a linked list and in a priority queue). For now I'm resistent to look up what others have done because I want to think about it myself a bit more.

How does this apply to solve the second exercise you ask? It's pretty simple. This tree corresponds to the optimal binary search strategy, if you set the weights to be the probabilities that the answer takes various values. So the next step is to analyze what happens when the weights are a geometric progression 1,q,q2,…

(For the curious, all this suggests that if n is not huge then the best strategy on average is to check ψn and then ψ12,… in order. This seems to give an average case lgn times better than the code I gave but at the cost of messing up the worst case. So it's not really a better solution. Note that I only expect people who tried to solve that exercise to understand this parenthetic paragraph.)

PS: This post is dedicated to my wife on her 28th birthday. Oh, and my grandmother had her 81th birthday four days ago. Happy birthday!

25 October 2007

Reachability analysis — part 2

In a previous post I described the motivation for doing reachability analysis on annotated code. That post ends with an algorithm design challenge. Suppose that we are given the simple path φ0→φ1→⋯→φn. The precondition of node i is ψi0∧⋯∧φi−1. Notice that ψ0=⊤. For simplicity, let us assume that φn=⊤. We wish to know which preconditions are satisfiable and which are not. We are allowed to use a prover, which is a big black box with two bulbs on it: the green one flashes if the last precondition fed to the prover is satisfiable and the red one flashes if the precondition is unsatisfiable. The bulbs consume a lot of electricity and we want to be environment friendly, so we try to reduce the number of prover queries. This instance of the problem can be solved using binary search.

int firstUnsat(vector<Formula> pre) {
  if (isSat(pre[pre.size() - 1])) return pre.size(); // all SAT
  int left = 0, right = pre.size() - 1;
  while (left + 1 < right) {
    // isSat(left) && !isSat(right)
    int middle = (left + right) / 2;
    (isSat(pre[middle]) ? left : right) = middle;
  }
  return right;
}

Exercise. Let's consider a Java program that is made of a sequence of n+1 statements. Each statement contains a `reachability bug' with a small (independent) probability є. The algorithm above essentially detects the earliest reachability bug. What is the average number of prover queries? (In the code, prover queries are represented by calls to isSat.)

Exercise. Prove that it is not possible to do better (asymptotically) than the algorithm above. (Edit: Perhaps a better question: "Is it possible to do better on average without sacrificing the worst case behavior?")

Now let's move on to a more general problem. Instead of starting with a simple path we start with a dag flow graph. For simplicity we can assume that there is only one source node, which we can take to be 0. The precondition of node 3 in the graph {0→1, 0→2, 1→3, 2→3} is ψ3=(φ0∧φ1)∨(φ0∧φ2). In other words, we look at all paths 0↝ 3 and take the disjunction. (Of course, there are better ways of computing something equivalent than looking at all paths.) What would be a good algorithm for finding which preconditions are satisfiable and which not when the flow graph is a dag? I'll write about it in the next part of this post series.

06 October 2007

Computer Science paper titles

Take the DBLP database of papers in Computer Science, grab all the titles, do a frequency count for the words, keep only nouns. You get the list of the 20 most used nouns in CS paper titles:

  1. systems
  2. system
  3. data
  4. analysis
  5. networks
  6. model
  7. design
  8. algorithm
  9. approach
  10. information
  11. time
  12. software
  13. distributed
  14. learning
  15. network
  16. performance
  17. parallel
  18. web
  19. control
  20. algorithms

(Note that I did not conflate singular with plural usages.) Remembering Shannon's theory, next time when you see a title made out of almost only these words you should realize that it is a CS paper, but not much more is revealed by the title.

02 October 2007

Car drivers versus computer scientists

Worse, today's national curriculum teaches pupils to use computers but not to understand them. It's training, rather than education. As Mr Rodd's BCS colleague David Evans puts it, it's as if we thought that "educating a nation of car drivers will create an automotive engineering industry."

telegraph.co.uk