Theory and Practice: Reachability analysis

Compilers do reachability analysis (also known as dead code elimination) as an optimization that cuts down the size of the produced (machine/byte)-code. Mikoláš had the idea that dead-code should result in a warning because it is likely a bug and, moreover, if we use code annotations we can do a more precise analysis of what is reached and what is not. So we implemented this in the context of JML-annotated Java.

//@ requires x >= 0;
int f(int x) {
  if (x < 0) return 1 / 0;
  return 1;
}

The bytecode produced by the Sun Java compiler is the following.

int f(int);
  Code:
   0:   iload_1
   1:   ifge    8
   4:   iconst_1
   5:   iconst_0
   6:   idiv
   7:   ireturn
   8:   iconst_1
   9:   ireturn

As you can see, the bad division ends up in the bytecode and no warning is produced. But from the comments we see that the intention of the programmer is to push the obligation of providing a non-negative x to the clients of f. What a smart compiler should do is check at all call-sites that this obligation is fulfilled, detect that the condition in the if statement will always be false, and provide a warning saying so. That's basically what we do. (Note that we don't generate bytecode; that's left to the Java compiler.)

The first (boring) phase is to reduce the Java code to an intermediate representation that is much simpler. I'll handwave over this part. Suffice to say that there are many smart people working on how to do it and that current implementations (in ESC/Java and Spec^#, for example) are satisfactory.

The intermediate representation we get to work on is an acyclic control flow graph labeled with first order logic formulas. A node in this graph represents a statement in the program (roughly). The state of a program that executed a certain path in the flow graph is described by the conjunction of all formulas on that path. If we don't care about how a program gets to statement s we can take the disjunction (logical or) of all execution paths that lead to s. The result describes the state of a program about to execute s, irrespective of how the control flow got there. Such a formula is called the precondition of s. So if the graph has the edges a→ b, a→ c, b→ d, and c→ d, then we know that the statement (a∧ b)∨(a∧ c) is true about a program that is about to “execute” the statement in the node labelled d. (Of course, there are more efficient ways than looking at all paths.)

We then use a theorem prover to see if some statement precondition can never be true. (Alternatively, the set of states described by it is empty.) Theorem provers are reasonably good at answering this kind of questions. Still, they are quite slow, and asking the question for each statement results in an unusable extended static checker. We should be able to do much better. For example, if the flow graph is a chain a₁→ a₂→⋯→ a_n and we ask about a_n and the theorem prover says it is reachable, then we might as well conclude that everything is fine, solving the problem with 1 query instead of n. More generally, the answers given by the prover can be propagated in the flow graph using some simple rules. It should then be possible to use these rules to decide about which statements to ask the prover in order to minimize the expected total number of prover queries. I'll describe this in a subsequent post.

Theory and Practice

17 September 2007

Reachability analysis — part 1

No comments:

Post a Comment