## 18 June 2014

### PLDI 2014 — Presentations I've Seen

These are tiny summaries of some of the talks from PLDI 2014.

I wish I could say these are taken from my notes, but I can't. Although I took notes during most of the talks, I also lost the notes. That's not too bad: I take notes mostly because it helps me focus, not because I read them after. Sometimes I do take a brief look at them after, but this is a secondary matter.

What exactly is inexact computing good for? Many people are familiar with randomized algorithms. The most widespread ones are probably universal hashing and quicksort. So, many people understand that randomization is useful in software. But few people understand that randomization can also be useful in hardware. This keynote address gave several examples. If you give an integrated circuit less power than it was designed for, then it starts to err. The probability of error does not vary linearly with the power: If you are willing to accept wrong results half of the time from some switch, then it is sufficient to consume much less than half the power. The second observation was that some switches are more important than others. For example, in an adder circuit, you'd probably want to reduce power mostly in the parts that process the less significant bits. The examples got more and more complicated, but I'll stop here.

Test-Driven Repair of Data Races in Structured Parallel Programs. The idea of this paper is very simple, and therefore nice. Suppose you have a program that is not supposed to have data races. At this point, most people worry about how to make sure that data races can't happen. In contrast, this paper says: Look, this particular input clearly exhibits this data race — how do I get rid of it in the cheapest way possible? One obvious question (which, being obvious, was asked) is how many tests/inputs does it usually take until you get rid of all data races. The answer: ‘Usually one test case is enough.’ Sounds like an extraordinary claim that requires extraordinary evidence. [Update: Rishi Surendran clarified in an email that what was said in the presentation (and what is in the paper) is weaker: One test was enough for the 12 benchmarks mentioned in the paper.]

How you rule out races depends on the underlying concurrency primitives. In this case they were async and finish. The first one, async, starts a child thread, which runs in parallel with the current one. The second one, finish, waits for all children to finish. The example used during the talk was cute, and you may enjoy viewing it as a puzzle. Here it is. Consider the following two pieces of (pseudo)code:

void merge_sort_rec(a, i, k) {
j = (i + k) / 2;
async merge_sort_rec(a, i, j);
async merge_sort_rec(a, j, k);
merge(a, i, j, k);
}
void merge_sort(a) { merge_sort_rec(a, 0, len(a)); }

void quick_sort_rec(a, i, k) {
if (i + 1 >= k) return;
j = partition(a, i, k);
async quick_sort_rec(a, i, j);
async quick_sort_rec(a, j, k);
}
void quick_sort(a) { quick_sort_rec(a, 0, len(a)); }

Question: Where should you insert finish? Well, this article solved this question automatically, with some heuristics.

Stochastic Optimization of Floating-Point Programs using Tunable Precision. The idea of this paper was to use a Markov Chain Monte Carlo simulation to optimize the inner loop of numeric programs. If you are not familiar with MCMC, here is a crash course. Imagine a huge digraph that comes with a cost function for vertices. ‘Huge’ means that you don't actually store the whole graph, but you can compute the neighbors of a given vertex. The goal is to randomly walk on this graph such that the time you spend on a vertex is proportional to the cost of that vertex. To do so, you pick probabilities for arcs. Imagine some sort of flow on the digraph: For arc $x \to y$, the flow is $p(x)p(x \to y)$. Think of $p(x)$ as the label of vertex $x$, and of $p(x \to y)$ as the label of the arc $x \to y$. Both labels are probabilities, so they are in the interval $[0,1]$. In the stationary case, you expect the flow going into a vertex to equal the flow going out of a vertex. (Buzzword: ‘global balance condition’) This holds if for any two vertices the net flow between them is zero: $p(x)p(x \to y)=p(y)p(y \to x)$ (Buzzword: ‘detailed balance condition’) If you are given the probabilities for vertices and you want to pick the probabilities for arcs such that the detailed balance condition holds you can do as follows. Let's assume, without loss of generality, that $p(x) \ge p(y)$. You can pick $p(y \to x)=1$ and $p(x \to y)=p(y)/p(x)$. If you then do a random walk with these probabilities, then you can use another buzzword: What you do is using the Metropolis algorithm. By the way, the interpretations of these choices are: $p(y \to x)=1$ means that you always go to a neighbor with higher score; $0<p(x \to y)<1$ means that there is some chance to go to neighbors that have worse scores.

But I digress. In this paper the vertices are programs. Their cost function is defined by comparison with a reference implementation. The score is higher when a certain piece of memory is as close as possible to the reference, while interpreting the bits as representing floating point numbers. But, also, the score is higher when the program is shorter and faster. By tweaking the score function, and then doing a MCMC search the authors were able to find implementations that are better than those designed by humans. The way I see it, this is a simple but cool application of methods from statistical physics. And, just to be clear, I mean ‘simple’ in a positive way. I like simple.

Expressing and Verifying Probabilistic Assertions. For some reason, the presenter seemed to claim that this paper introduces the notion of probabilistic assertion. I suspect that people doing probabilistic model checking would disagree. However, the method of verifying the assertions did not ring a bell, so it is novel. The process works in three phases.

1. VCgen. Actually, the authors called it ‘symbolic execution’. Of course, symbolic execution and VCgen (verification condition generation) are rather similar; see, for example, [Matthews et al., Verification condition generation via theorem proving, 2006]. Still, I would say VCGen for what they do in this paper because the result is an expression, whose atoms are random variables. This expression is a Bayesian network.
2. Simplification. The next phase applies various algebraic simplifications. For some reason, the authors split this phase into two: partial evaluation of non-random parts, and the others. Also, the partial evaluation was slotted in the first phase!
3. Sampling. If simplification doesn't result in something really simple, then the expression is sampled. Since you do sampling, you conclude things like ‘with confidence $c$, the assertion holds with probability $p$’

The authors implemented a tool that works on LLVM bitcode. I think it is very nice to have such a tool. I'll try to post a link to it. (By googling, I failed to find a link.)

FlashExtract: A Framework for Data Extraction by Examples. Suppose you have a text file with many addresses, each taking four lines. You would like to extract the postcode of each. One option is to write a small parser. Another option is to point at a few postcodes with your mouse, and let FlashExtract synthesize a parser. How this works remains a mystery. I suppose I need to read the paper.

Web scrapping seems like an obvious application of FlashExtract.

Laws of Concurrent Programming. Tony Hoare talked about some algebraic laws that capture the essence of many other formalisms. In particular, it captures the essence of Hoare logic and of Milner's processes. I was a bit worried about the 1-minute teaser I was supposed to make afterwards so I couldn't focus much. Three years ago I saw a similar presentation, and I posted a summary. I can tell, however, that the presentation at PLDI 2014 went into much more detail. Do you know of a (free) write-up of these ideas?

Tracelet-Based Code Search in Executables. I am very not confident about what this presentation was about. The fault is entirely mine: During the presentation I kept drifting to other thoughts. So, the summary here may very well be based more on my imagination than on the presentation. Nevertheless, here is what I think was said.

The goal is to find code similar to some reference. Instead of working at source level, the authors work on executables. It is unclear to me what is the main reason. It could be that it is easier: There are fewer processor architectures than high-level languages. It could be that working on executables has other use-cases; for example, one day perhaps this technology could be used to identify malicious code like viruses. Or it could be both. The example that was given by the presenter was finding which versions of OpenSSL have the heartbleed vulnerability. (Lately, it seems many brain cycles are devoted to bleeding hearts.)

OK, back to the subject. The metric works roughly like this. First, you extract tracelets out of each program. A tracelet is simply a piece of the program (a substring, where letters are instructions). It sounded like they might have some strange requirement that tracelets don't contain control flow instructions. (One argument was that if you look at the flowgraphs, then it's really hard to tell how to match them. Why would that imply that you have to limits tracelets to basic blocks?) Now you have these pieces of code. The next thing to do is to look at all pairs of tracelets, one from the first program, the other from the second program. For each pair of tracelets you compute an edit distance, in a way that is robust in the face of renaming registers and shifting offsets. Finally, you aggregate all these distances into one grand similarity metric.

I think most of my complaints from above are caused by me dozing off. However, something nags me a bit more seriously. I really have no recollection whatsoever of how the search is performed. Sure, having a metric is a prerequisite. But then you have to use that metric in an efficient way. For example, do they embed their metric in an Euclidian space and then use some nearest-neighbors algorithm?

Resugaring: Lifting Evaluation Sequences through Syntactic Sugar. To understand this paper, I think, it helps to focus on its most direct application: a systematic way of implementing debuggers for functional languages. Functional programmers usually think of the execution of their program as a term rewrite sequence $x_0 \to x_1 \to \cdots \to x_n$. Each $x$ is a program. Each transition $x_{k-1} \to x_k$ is an instance of a rule. Implementations of functional programs distinguish between two kinds of rules: desugaring and evaluation. The desugaring rules are applied first, and only then the evaluation ones. One could take the view that desugaring rules are just a special case of evaluation. (For example, my notation above doesn't distinguish between the two kinds.) The more traditional view of desugaring is that it gets rid of certain syntactic constructs. Desugaring rewrites the program in some core language. Everything else, including evaluation, happens in the core language. The non-core parts are there only to sugar-coat programs for humans. Which brings us to debugging.

Debugging means that you get to see not only the program you wrote $x_0$ and its result $x_n$, but also the intermediate steps $x_1,\ldots,x_{n-1}$. The problem is that you like sweet things, and, although the program you wrote is sweet, it soon ceases to be so after you hand it over to the interpreter. We'd like to pretend the desugaring steps never happened. We must do so while keeping in mind that the evaluation steps do not apply to sweet programs. The generic strategy is to find programs $y_0,\ldots,y_n$ such that $y_k \leadsto x_k$. Clearly, $\leadsto$ is supposed to designate some sort of correspondence, but I didn't say what exactly it stands for: it stands for repeated application of desugaring rules that were used somewhere in the (prefix of) the original sequence $x_0\to\cdots\to x_n$. As an example, suppose that the first transition $x_0 \to x_1$ is a desugaring one. Then it is OK to pick $y_0=y_1=x_0$: the desugaring $y_0 \leadsto x_0$ is obtained by applying $0$ desugaring rules, and the desugaring $y_1 \leadsto x_1$ is obtained by applying the same rule as was used by the transition $x_0 \to x_1$.

You need some other criteria as well. For example, what is there to stop you from choosing $y_k=x_k$? Somehow, you need to ensure that the desugaring rules are used as much as possible in getting from $y_k$ to $x_k$. Or rather, you want to ensure that $y_k$ is as similar as possible to $y_{k-1}$. Or perhaps the criterion is that the whole sequence is somehow good. To me it seems that some proper optimization problem might hide in here. The presentation seemed to suggest that there is a clear solution that qualifies as ‘best effort’. Let me state this differently. The presenter did a very good job (better than I did above) at presenting criteria that a solution should satisfy. (These included things like $y_k\leadsto x_k$, expressed in the language of lenses.) But these criteria are binary, while the problem feels more like an optimisation problem. An optimisation problem should come with a cost function. One should be able to tell not whether a solution is good, but also how good. I find it hard to believe that for all $x$-sequences there is exactly one $y$-sequence that all humans would describe as the best desugaring.

Fast: a Transducer-Based Language for Tree Manipulation. I went to this talk because it contained the keywords ‘infinite alphabets’. (I've been interested in register automata.) The simplest way to see what they do is to take a look at Fast's web interface. What you'll see is basically a very simple language that is specialized for transforming trees into other trees. The syntax is very much inspired by the syntax of functional languages, with some quirks. Why would you use a language with quirks that also lacks things like first-class functions? Because you get the ability to decide certain properties automatically.

Look at the preimage example in a different tab, so you understand what I say below. The example defines a tree data structure BT. The nodes are either Nil, or are Bin and have two children. In general, tree nodes might also have some values attached to them, and these values might come from infinite sets/types. In this example, there is one integer value. There is an asymmetry here: The children of a Bin are not named, but the integer value is named i by the Alphabet BT definition. The code also defines a function mapInc that increments the i-values of all the nodes of a tree. Finally, the code defines two subsets of values of the type BT, namely those trees whose i-values are all odd, and those trees whose i-values are all even. Note that in the definition of oddTree there is no need to introduce the name i, but the subtrees of Bin need to be given names. The asymmetry again. But, don't worry about it, for we are about to get to the cool stuff.

Now you can ask whether it is true that the result of mapInc is an oddTree exactly when mapInc is applied to an evenTree. The Fast tool handles trees itself, and discharges the part about arithmetic to Z3: Decidability modulo theories — I think this is how Loris D'Antoni described it.

Another interesting thing you can do with functions written in this language is compose them. This is possible if some extra conditions are met. You may know an optimization technique called deforestation, which removes intermediate results of a composed function. That is roughly how composition works here. The result could be further optimized, as follows. As you noticed, functions do a pattern match on the tree part, and a predicate check on the value part. Composition may lead to branches whose predicates are unsatisfiable. The solver Z3 is asked if this is the case. If it is, those branches are cut.

Automating Formal Proofs for Reactive Systems. This work is very much in line with the work on Fast. Both papers attack the verification problem by designing a specialized language. The specialized language is good for implementing a certain class of algorithms, but it is also easy to verify. This paper also has a language with a functional flavor. The application domain is that of reactive systems. I think this means, roughly, that the type of functions written in this language is $\alpha^\omega \to \beta^\omega$. Unlike the work on Fast, here there is no algorithm that always works when asked to verify a program. Instead, there is an algorithm that handles all (or almost all?) programs on which it was tried. As I understood, this algorithm takes the form of a bunch of Coq tactics.

Comment. The first and only time I attended POPL was in January 2014. I was blown away by the quality of the presentations. I was used to presentations from other conferences, which shall remain unnamed, where 90% of the talks were completely incomprehensible to me. At POPL, there were far fewer incomprehensible presentations, maybe even less than half. Now I also attended PLDI for the first time. The presentations were even better. I have a hard time recalling even one that I would classify as incomprehensible.

In Finding Moonshine there is a short discussion on mathematical presentations. Apparently, there are two traditions. The French tradition values presentations where the goal is to teach the audience something new, even if the new thing was not discovered by the presenter. The German tradition values presentations that make the presenter look smart. The German tradition won, and most math seminars are incomprehensible. Or so says Marcus du Sautoy, and I have no reason to doubt. It's a bit funny that Americans can be seen as following the French tradition, in PLDI.

I should stress that all of the above is based on the presentations, not on the papers. I hope you enjoyed it nevertheless.

Changes:
1. Yaniv David pointed out that the heartbleed bug is in OpenSSL, not OpenSSH as the text said initially.
2. Rishi Surendran clarified that the ‘extraordinary claim’ I mentioned is not quite what he said. I added a note. Also, a previous version used the name join for the concurrency primitive that should have been finish.

Jules said...

Re: Laws of Concurrent Programming, here's a reference that mentions the different kinds of triples (Hoare, Plotkin, Dijkstra):

On Locality and the Exchange Law for Concurrent Processes.
CAR Hoare, A Hussain, B Moller, PW O'Hearn, RL Petersen and G Struth. CONCUR 2011
http://www.eecs.qmul.ac.uk/~rusmus/articles/LocalityBimonoid.pdf

It is true that it's difficult to find a more in-depth reference for just these particular concepts, even though I've seen them presented in talks quite a few times.

Here's the paper associated with the RAMICS invited talk by Tony Hoare; perhaps the PLDI talk was similar: