Theory and Practice: Program Analysis, from 2000 to 2009

Very extremely roughly, static program analysis is the area concerned with looking at the text of computer programs and deciding based on that whether they are crap or not. There is, of course, a very easy way to write a sound program analyzer. In Python3 it would look like this: print('crap'). It's sound because it never lets bad programs go without calling them out. But you often want something a bit more friendly (aka complete, aka precise). Being precise is difficult, and that's why this is still an area of active research. Here are few papers from the decade 2000–2009 that are influential, and pointers to associated tools.

Extended Static Checking for Java. This paper presents a tool, ESC/Java, that finds bugs in annotated Java programs. An annotation could say, for example, that an array should be sorted. The basic technique, verification-condition generation, was developed in a community that loathes false negatives; that is, bugs that go uncaught. But ESC/Java recognizes that a zero-tolerance for false negatives leads to a deluge of false positives, which in turn turns off users.

ESC/Java has a pipeline architecture. Java source code with annotations written as special comments is first translated to a simple imperative programming language, similar to Dijkstra's guarded commands. From this, a formula in first-order logic is built, which is fed to an automatic theorem prover. If the prover returns a counterexample, then it is translated into a trace through the original program, and printed as a warning. (The frontend/backend split makes it easy to reuse code. For example, I helped change the backend to also do reachability analysis, without having to care about any fancy feature of Java.)

When a method is checked, the bodies of the methods it calls are not visible, for speed. But the contract annotations of the called methods are used. (This is similar to how a Java compiler doesn't need to know the bodies of the called methods, but does need to know their signatures.)

ESC/Java has quite a few children and grandchildren, most of them living in Seattle. One of them is Spec#, which does for C# what ESC/Java did for Java. There are a few differences. For example, Spec# goes back to insisting on soundness, has a fancier intermediate representation (Boogie), comes with a more developed methodology for specifying object invariants, and uses an SMT solver (Z3).

Formal Certification of a Compiler Back-end or: Programming a Compiler with a Proof Assistant This paper reports the progress on a certified compiler from C to PowerPC. In 2006, at least, the limitations of that compiler were severe. The input language is not quite C, but rather a tiny C-like language called Cminor. The output uses less than half of PowerPC's instructions, plus a handful of unverified macros. The frontend of the compiler isn't verified. The compiler is implemented and verified in Coq. But, you don't run the verified code: You extract OCaml from Coq, compile it, and then run it. So, the compiler is not verified verified.

Despite these limitations, this is still impressive work. I'll give just two reasons for it. There is a group of researchers that devised a technique to generate evil tests for compilers (Csmith). The technique is so effective that it discovered hundreds of bugs in each C compiler it was tried on. Here is what those researchers have to say about CompCert, as the compiler presented in this paper is known:

The striking thing about our CompCert results is that the middleend bugs we found in all other compilers are absent. As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible benefits for compiler users.
(from Finding and Understanding Bugs in C Compilers, via CompCert's website)

The second reason is simply the size of the project: more than 46000 lines. What do those lines do? A little over 1000 do what a compiler does. The others say what the compiler should do, and prove it.

So, what does a compiler do in 1000 lines? It seems to me that most of the work goes into coping with the finite limits on various hardware resources. For example, in C you don't have a limit on the number of variables, or at least not a low limit. But, hardware processors have a very small number of registers. Similarly, in C you don't have to worry much about what happens when you call a function, but in hardware you may need to explicitly move data around from the caller to the callee.

I should mention that there exist alternative approaches for ensuring that the output of a compiler is correct, such as proof-carrying code and translation validation. Section 2 of this paper gives a very nice explanation of these alternatives and of how they relate to certifying the compiler itself.

In the meantime CompCert evolved, and is a compiler you can use now. One interesting development is the addition of support for concurrency (CompCertTSO).

The SLAM Project: Debugging System Software via Static Analysis. This project increased the quality of Windows device-drivers. SLAM checks at compile time properties like ‘functions f and g must be called alternatively’. In theory, the applicability of such a tool is not limited to Windows device-drivers. But, it is easier to engineer ideas into use if you focus on a niche application. One of the main contributions of the SLAM project is the enormous engineering behind it. Before SLAM, people believed that some ideas will be useful in practice; after SLAM, people know that some ideas are useful in practice.

SLAM used several existing ideas, such as predicate abstraction and model checking. In predicate abstraction, the program is transformed into one that has only boolean variables. For example, instead of keeping track of an integer variable, we keep track only of whether the integer is 0. Presumably, we'd be interested in this predicate (‘is the variable 0’) because it affects the control flow. In model checking, the program (now with only boolean variables) is seen as a huge digraph. Roughly, each vertex/state corresponds to a place in the program text, and an assignment of (boolean) values to variables. The question is whether the error state is reachable from any initial state. To solve this question, SLAM uses binary decision diagrams.

In many cases, the existing ideas needed to be tweaked before being implemented. For example: Predicate abstraction existed, but it was not clear how to apply it in a modular way, in the presence of function calls. Model checking existed, but not for programs with recursive functions. And so on.

SLAM's industrial incarnation is the Static Driver Verifier. SDV comes with a set of specifications that all Windows device drivers should satisfy, a model for the environment in which device-drivers run, and scripts that make it easy to use. I think these parts are interesting because, in contrast to the rest of SLAM, they are not automated. Specifications must be written by hand, and the environment of your code must be written by hand. Could some of these task be automated? I'm sure some people are looking into it.

Finally, a key ingredient of SLAM is refinement. Here is SLAM thinking: ‘OK, I kept track of whether this variable is 0, but that didn't work. Let's see what happens if I also keep track whether it is positive.’ But, more on refinement in the comments on the next paper.

Lazy Abstraction. Building on the ideas of SLAM, the model checker BLAST gains speed by being lazy. As the adage says, the best way to optimize a piece of code is to remove it. The second best way is not to run it. BLAST uses the third best way: not to run it too often. Viewed from a different (but similarly high-level) perspective, BLAST illustrates another tension pervasive in computer science — the tension between correctness and efficiency.

CEGAR (counter-example guided abstraction refinement) is the basic architecture of both SLAM and BLAST. The simplest way to explain CEGAR is this: build model, verify model, refine model, go back to verify. (Here, the model is some abstraction of the real program.) The simplest way to explain it is also the simplest way to see that it is correct, and the simplest way to implement it. But, it is not the most efficient way. Refinement, you may guess, does not produce radically different models. To be efficient, you don't want to redo much verification work in every iteration.

This is what lazy abstraction means: refine the model locally, and reverify it locally. Instead of replacing a model with a new model, you just change a bit of it. Instead of verifying a new model, you just check the bit you changed, and perhaps a little around it. Here's a quote from the paper's introduction:

It is easy to envision extreme cases in which our algorithm achieves large savings. Suppose, for example, that the program flow graph make an initial choice between two unconnected subgraphs. Then, once one of the subgraphs has been verified with a given set of abstract predicates, it does not need to be revisited even if the other subgraph requires additional predicates.

I'm a fan of extreme cases. To see what happens in non-extreme cases you'll have to see the paper. If you are more into tools, check out BLAST and one of its ideological descendants, CPAChecker.

Mobile Values, New Names, and Secure Communication. There is flavour of pure theoreticians that brood over a mixture of programming languages and concurrent systems. Their favorite toys include the calculus of communicating systems (CCS), communicating sequential processes (CSP), and π-calculus. These are tiny programming languages, with clear semantics and nice accompanying theorems. But they are tiny. In general, there are good reasons to organize programming languages into a core, together with desugarings. This paper starts from the premise that having a tiny core is bad. Yes, you don't need to name functions in λ-calculus, but not doing so is cumbersome. Similarly, certain things are cumbersome to do in π-calculus.

The paper presents a version of π-calculus, called applied π-calculus, enhanced with several constructs: function names, equality, and a ‘floating let’. Applied π-calculus aims to be a language that is simple enough to make theoretical analysis feasible, but versatile enough to express realistic concurrent programs. The main source of realistic programs that the authors used are security protocols, including things like Diffie–Hellman key exchange.

Conclusion. Out of five papers, four are closely associated with tools. Programming language theoreticians like tools. Or, at least, they like to cite papers that talk about tools.

Theory and Practice

11 February 2014

Program Analysis, from 2000 to 2009

No comments:

Post a Comment