## 19 June 2014

### PLDI 2014 — Compiler Validation via Equivalence Modulo Inputs

The title is stolen from a paper that was presented at PLDI. I did not see the presentation, so this summary is based on looking at the paper.

The evolution of a programmer includes several predictable phases. Somewhere towards the beginning, there is a phase of extreme conceit. In this phase, the programmer believes that he is not the cause of any bug. Rather than concede this, the programmer would rather believe that a cosmic ray hit the processor and caused it to malfunction. Repeatedly. Or that the compiler is wrong. For some people this phase never ends, and they never become programmers. Thus, anyone who survived a non-negligible amount of time as a programmer has a healthy attitude towards bugs. Bugs are everywhere, but especially in newly written code. Old code that is fairly stable and has many users is pretty much bug-less by comparison. For example, most programmers spend their whole lives without finding even one bug in the compiler. And virtually all programmers spend their whole lives without finding a single bug in hardware. And then there's another phase in one's career that you attain after reading papers like the one I review here. Or, alternatively, after you spend an afternoon browsing the bug-tracking system of some widely used compiler. Yes, compiler bugs are rare by comparison. Still, there are many such bugs.

Csmith is a tool that found hundreds of bugs in C compilers. The idea is simple: generate many programs and try to compile them. The generated programs typically take no input, do some computation, and print a result. You then compile and run these programs. If the compiler crashes, then that's a bug. If two different compilers generate executables that behave differently, then that's a bug.

Le, Afshari, and Su build on top of Csmith to find even more bugs. It turns out that many of the programs generated by Csmith have unreachable statements. Since there is no input, it is pretty easy to detect unreachable statements — just run the program once and observe statement coverage. Removing any set of unreachable statements should not change the semantics of the program. The authors compile and run these variants. It turns out that often enough they do behave differently even if the same compiler is used. This, of course, happens only if there is a bug in the compiler.

It is rather remarkable that any bug is found by the procedure described above. It is even more remarkable that more than 100 bugs were found this way, in compilers such as gcc, clang, and icc. Why does this procedure expose bugs at all? Here is a fictional example inspired by one in the paper.

int a, b;
void f(int c) {
for (a = 0; a < 2; ++a) {
if (c) for (;;) { b = 2147483647 + a; }
}
}
int main() { f(0); }


If you are a compiler, you might notice that 2147483647 + a keeps being evaluated in the inner loop, has no side-effects, and always computes the same value. So, you decide to move it outside of that loop, between if(c) and for(;;). Next, you decide to also move it before the test if(c). Why not? Now you unroll the outer loop twice. At this point, function f looks, roughly, like this:

void f(int c) {
b = 2147483647 + 0; if (c) for (;;) {}
b = 2147483647 + 1; if (c) for (;;) {}
a = 2;
}


Finally, you look at 2147483647 + 1 and see a signed overflow. Yay! Now you can make demons fly into the programmer's nose.

What went wrong? The problem was the ‘why not?’ part. The code if(c) {b=...; for(;;) {}} does indeed exhibit undefined behavior most of the time, but not when c is 0. The code b=...; if(c)for(;;){} always exhibits undefined behavior. Only after this transformation, the compiler is correct to conclude that calling f(0) may do anything.

How could an unreachable statement interrupt this chain of events and hence hide the bug? Here is an example:

int a, b;
void f(int c) {
for (a = 0; a < 2; ++a) {
if (c) for (;;) { b = 2147483647 + a; ++a; }
}
}
int main() { f(0); }


The first step in the previous chain of events was to notice that b = 2147483647 + a was repeatedly evaluated in a loop, and did the same thing every time. Now that's not the case, so this statement cannot be moved before for(;;).

The next question is how come this happens so often? The authors found 147 confirmed bugs in 11 months. That's a new bug every $2.3$ days! The short answer is that I don't know. However, it must have something to do with how the programs generated by Csmith look. From what I wrote above, it sounds like this procedure was only applied to programs produced by Csmith. In fact, the authors also applied it to open source projects and to the existing compiler test-suites. But, in these other settings, the procedure is not nearly as effective. Unless I misunderstood something, when seeded with non-Csmith programs, this approach found only $4$ bugs in $11$ months.

In case you wonder what other things I left out or over-simplified, I recommend you look at the paper — it is a pleasure to read. Link: [Le, Afshari, Su, Compiler Validation via Equivalence Modulo Inputs, 2014].

John said...

Yeah, the success of their method mainly on Csmith output is a fun puzzle. Part of the explanation is that Csmith programs don't just contain dead code, but rather they contain A LOT of dead code. Random logic tends to be mostly dead for whatever reasons.