Theory and Practice: November 2005

A lot has happened since my last post. My weblog is the target of spammers now. This is why you need to recognize a text before posting comments. I don't work for Nobug anymore. I didn't have time to participate in TopCoder competitions in the last two months. I am a PhD student here. I will also give a guest lecture for a "Software Engineering for Research Teams" course on Tuesday. Here are my notes.

When users say that a program is good they mean that it is friendly, fast, small or correct. When programmers say that a program is good they may also refer to how maintainable it is. A program is friendly if its (user) interface works well; it is fast if it uses small amounts of CPU time; it is small if it uses memory efficiently; it is correct if it meets its specification. A maintainable program is one that is easy to understand, to modify and to extend. All these are strongly influenced by how readable it is.

Programmers have various goals when they code; otherwise they are not (good) programmers. The goals correspond to various notions of goodness. We can distinguish between these programmer types:

The Diplomat. She is constantly thinking about the user of her code, tries to be empathetic and to design the best interfaces. The whole program is structured around an I/O core that is designed from the start.
The Racer. He is obsessed with speed. He wakes up in the morning thinking about the fastest strategy for brushing his teeth (while ensuring some level of quality). Every line is carefully tweaked for performance. Bitwise operations are common.
The Smurf. He wants to work on embedded systems. No. He wouldn't work on anything else. The program he's most proud of is a FORTRAN compiler that works on a computer with 8192 bytes of memory.
The Formalist. He is obsessed with correctness. He says that it is impossible to know if a program works unless it has specifications and refuses to write code without them. For even the simplest of the loops he thinks of what is the invariant and makes it explicit. He doesn't have time to worry whether his program is useful.

Only bad programmers fit into one of the categories above. Good ones know to migrate when there is a need to do so. To do that they use a meta-goal: readability. Keeping readability in mind helps them decide what are the most important things to watch for in any given situation. Is it efficiency? Is it correctness? Is it something else? Keeping readability in mind puts one in a mood similar to the one you are in when you prepare a presentation. You will do your best when you talk in front of people because they will judge you by that. Readability works the same way. You will write better code if you think of your reader. Practices like peer review and pair programming are corollaries of concentrating on readability.

There is yet another way to look at readability.

For most real projects, even for research projects, the most time consuming tasks are trivial and not extremely demanding. A prime example is maintenance. That is good to some extent. No one can be creative all the time. Still, it would be good if we could save ourselves some grief by reducing the time needed by boring tasks. If you write readable code you will spend much less time maintaining it. It will have less bugs in the first place; and it will be far easier to understand later, when you've forgotten the details. This is also good from a project management perspective. What is the bottleneck for most programming projects? Well, it certainly isn't the coding. That consumes only a very small fraction of the overall time. Maintenance on the other hand is a very good candidate to the title of the most common bottleneck in software development projects. So if we apply basic performance-enhancing principles, anything we can do to attack the bottleneck is worth doing. Concentrate on readability.

But enough with dry generalities. Let's see a simple example. I have asked some of my friends to solve this problem.

Given a list, say for each element the 0-index where it appears for the first time. So the list x = [0, 1, 4, 2, 4, 1, 0, 2] is transformed into y = [0, 1, 2, 3, 2, 1, 0, 3]. Notice that for all i we have x_{y_i} = x_i. Use any programming language and any data representation you want. The exact phrasing I used in my email is: Write a program that gives the 0-index of the first occurrence of elements in a list. And then I added the example and the property specified above. Not a very clear description. But that's how life is. (I'm sorry - I can't seem to be able to stop finding excuses for myself)

I want to spend some time analysing the replies I got plus some of my own implementations. You say it is a trivial problem? Well, sure it is. I'm not talking about problem solving abilities here. I want to concentrate on the coding. I want to try to guess the goals each programmer had in mind when he wrote a piece of code. Is he a diplomat? A racer? A smurf? A formalist? I also want to pay attention to features that make it easy or hard for me to read the program.

Try to do the same. It may very well be that something I find hard to read will look perfectly natural to you. The point is that such an exercise makes you think about goals and readability on a concrete example. You can get a feel of what they mean. In fact, please stop reading for a while and try to solve this problem yourself. This way you'll be able to do the same kind of analysis on your own solution.

Now the solutions; in random order.

Solution 1 - C++

vector<BYTE> x;
map<BYTE, BYTE> pozitii;//harta cu x_i careia ii corespunde pozitia din x (<x_i, i>)
//x_i parca e ... cheia si pozitia e ... valoare? (nu imi amintesc care e terminologia) ... msdn.microsoft.com mi-a dat dreptate!


//se umple vectorul x cu ce valori se doreste...
push_back sau constructor pe baza de BYTE xs[] ={0, 1, 4, ...}
//pot umbla in vector cu un iterator ... dar nu imi amintesc cum obtin pozitia pe baza iteratorului asa ca ... sarim peste detalii

for (BYTE i = 0; it < x.size(); i++)
{
  //caut x.at(i) in harta pozitii
  map<BYTE, BYTE>::iterator it = pozitii.find(x.at(i));
  if (it == pozitii.end())
       {//daca nu il gasesc, adaug in y pozitia din vectorul x ...
               y.push_back(i);
       //... si in harta pozitii cheia x.at(i) cu valoarea i
               pozitii[x.at(i)] = i;
       }
       else
       {//daca il gasesc in harta pun in vectorul y valoarea corespunzatoare cheii x.at(i)
               y.push_back(it->second);
       }
}

The idea behind this solution is to traverse the input list from head to tail and keep an associative table that says for each value at what index it was first seen.

This solution looks more like pseudo-C++-style-code than actual C++ code. I conclude from here that the author is certainly not a Formalist. He might be a Smurf: he uses very small data types (BYTE). By looking at the Romanian comments I can tell he's not a Diplomat. He is however a Racer: this solution is the only O(n lg n) solution I have received. This is done at the expense of bending the problem statement. Sure, the example uses a list of integers as input but the text doesn't say anywhere that the input list will contain integers. This looks like another argument for not considering this guy a Formalist. But the same misunderstanding appeared in all but one of the solutions I have received. So I'm contemplating the possibility that this is a result of a not very clear problem statement.

All in all the characteristics that seem to fit this programmer (when he wrote this code) are: Racer Smurf. That's no surprise: in the real life he writes software for embedded system in the industry.

Personally I don't like the comments written in Romanian. Centuries ago Latin was the language of culture. Now English (and Mathematics) is the language of science. The advantage in ease of communication brought by the adoption of a single language outweighs all other concerns. Your native language is one of your main thinking instruments. It is also very important for areas like Literature. But its place is not inside a computer program. That is more so especially if you believe open source is a good way of writing programs. So, for me, the comments detracted from the readability of this piece of code. Even if they were in English I wouldn't like them very much: some say things that are of no interest to me in relation to this problem (like the fact that pairs in a map are made of a key and a value) and some are mere repetitions of what the code already says (that is, they are at the same level of abstraction as the code).

A technical matter that bothers me nevertheless is that the key of the map is a BYTE. That makes no sense.

We can see some concern for safety because x.at(i) is used instead of x[i]. That looks slightly uglier but makes it easier to find bugs. Readable code makes it easier to prevent bugs. So it's a question of how confident you are.

Solution 2 - Java

int size = 10;
int var[] = {1, 3, 5, 1, 2, 4, 2, 5, 3, 66};
int ix[] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
      
for (int i = 0; i<size; i++) {
 if (ix[i] < 0) {
  ix[i] = i;
  for (int j = i+1; j<size; j++) {
   if (var[j] == var[i]) {
    ix[j] = i;
   }
  }
 }
}
      
for (int i = 0; i<size; i++) {
 System.out.println(i + ": " + var[i] + " - " + ix[i]);
}

The idea of this solution surprised me a bit. It goes through the input list from head to tail and whenever it finds a value that wasn't seen so far sets all the corresponding positions in the result list.

This guy clearly has some concern for correctness since he tests the code. I would call him a Semiformalist, others would call him an eXtremist, while others would say he's a Scaffolder. Yes I know that word doesn't exist. What I mean is that he writes scaffolding code, in the sense this word is used by Bentley (Programming Pearls) and Kernighan&Pike (The Practice of Programming). I believe he is also a Semiracer. That is because he wrote me a second optimized version, which I lost. There is no obvious trace of smurfness here. But we can see a certain degree of diplomacy: I believe the variable names are particularly good. So he is a Semidiplomat Semiracer, Semiformalist. A very balanced guy.

The variable names make the program easy to read even if the idea of the solution was a bit surprising to me. I continue to wonder why did I found this approach surprising? The code is clearly layed out, with a good usage of vertical spaces to separate sections of code. Another thing that makes the code easy to read is that he avoids magic numbers (size).

Still, the code is still not as general as it could be and is not packaged in a reusable form. In fact I was very surprised that no one packaged the code in a reusable form. And the appropriate package here is very lightweight: a function.

Solution 3 - Java

public static void main(String args[]){
    int alfabet[] = new int[10];
    System.out.println("Sir = " + sir);
    for (int i=0;i<10;i++) {
        alfabet[i] = -1;
    }
    int[] zeroIndex = new int[sir.length()];
    for (int i=0;i<sir.length();i++) {
        int value = Integer.parseInt(sir.substring(i,i+1));
        if (alfabet[value] == -1) {
            zeroIndex[i] = i;
            alfabet[value] = i;
        }
        else {
            zeroIndex[i] = alfabet[value];
        }
    }
    System.out.println("Sir = " + sir);
    System.out.println("ZeroIndex = ");
    for (int i=0;i<zeroIndex.length;i++){
        System.out.print(zeroIndex[i]);
    }
}

The idea is very similar to the one used in solution 1.

This guy is definitely a Racer and not a Formalist. He bended the statement up to the point he was able to solve very efficiently (O(n) time) a very specific sub-problem of what I've asked. There is also a very small amount of smurfness: the source code is incredibly space-less; yet, not up to the point where it would impact readability significantly. Test code is present and so I'd say he's some sort of minimal formalist: the eXtremist kind. I'm really not sure what to say about his diplomacy. This is one of the few solutions that included some kind of a wrapper. yet it is a trivial one (main). So he's a Racer with vague smurfness tendencies.

From a readability point of view this code seems pretty average. There are no comments. The layout is OK. The standard case and bracket conventions are observed. There is only one little thing that bothers me. Two variables have Romanian names (sir = string, alfabet = whose meaning should be obvious).

A good point compared to the previous solution is that it lets the compiler to find the length of the array instead of giving it by "manual counting". This is a big plus for maintainability and is also a bit easier to read. When you see sir.length() you know immediately whose length we are talking about. Compare that with a simple size.

Solution 4 - Java

public  static void main(String[] args) {

    int[] intInput={6, 8, 5, 2, 6, 7, 8, 8, 8};
    int[] intOutput=new int[intInput.length];
    int i,j;
                       
    for (i=0;i<intInput.length;i++){
        j=0;
        while (intInput[j]!=intInput[i]){
            j++;
        }
        intOutput[i]=j;
    }
    for (i=0;i<intOutput.length;i++){
        System.out.print(intOutput[i]+" ");
    }
}

Please don't judge me by the languages in which my friends code :-) Oh, shit, I use the same language!

This solution uses what is in my opinion the most straightforward idea. Construct the output element with element and search the input each time for the first occurrence. Like solution 2 it reduces the problem to the case when the input contains integers (so is more general than solution 1 and 3).

This guy is not a racer. He just implemented the most straightforward solution. He is a Semi diplomat for the same reasons as 2 (good variable names, spacing, etc.). Smurf? I see no evidence of that. Formalist? Yes. A bit. There is test code and the simplest solution is also the hardest to get wrong. So he's a Semiformalist semidiplomat.

The fact that I consider this to be the most natural idea made it very easy for me to read the code. The horizontal layout is a bit too crowded for my taste. But it helps that it observes Sun's coding conventions.

Solution 5 - Haskell

Finally we get to my own solutions. I won't comment on them. Here is the first try. Try to do for them what I did for the others. Better yet, leave a comment on my weblog so I know what you think.

We are asked to write a function that given a list of Eq
elements like x = "baubaulerau" gives a list of integers
that indicate the index of the first occurence of the
corresponding element y = [0, 1, 2, 0, 1, 2, 6, 7, 8, 1, 2].

For that we use an auxiliary function that gives the index
of the first occurrence of an element in a list. (TODO: see
if such a function isn't already available from a standard 
library).

> first_idx :: Eq a => [a] -> a -> Int
> first_idx (x:xs) e
>     | x == e    = 0
>     | otherwise = 1 + first_idx xs e

With this definition a O(n^2) time solution is simple.

> first_occ :: Eq a => [a] -> [Int]
> first_occ x = map (first_idx x) x

Sure, if we where guaranteed that the elements of the input
list have a total order relation then we could use a set
data structure to solve the problem in O(n lg n) time. Also,
hashing would result in a O(n) solution. Unfortunately
hashing doesn't play too well with extreme functional purists.

Solution 6 - Haskell

This is my second try. I got back to the code because I knew there is a TODO in there somewhere.

We try to solve here a toy problem, emphasizing program
readability. Given a list of elements return an integer
list of the same size. Each integer should be the 0-index
of the first occurrence of the corresponding element in the
input list.

For example the integer list [0, 1, 4, 2, 4, 1, 0, 2]
should produce [0, 1, 2, 3, 2, 1, 0, 3].

Above I gave a slight reformulation of the original statement
that I've sent to my friends.

All we are guaranteed is that its elements can be compared;
this is implied by a hidden assumption behind the expression
"first occurrence". In these conditions the best solutions
consume O(n^2) time in the worst case.

The idea of the solution is simple: we just need to transform
every element into the index of its first occurrence. So we
need a function that gives the index of the first occurrence
of a certain element and then use [map].

We put the function in a module and import other modules to
make the code more readable.

> module ToyProblem where
> import Data.Maybe
> import Data.List
> import Debug.QuickCheck

With this prelude the solution is a trivial translation into
Haskell of the idea presented earlier.

> firstOccurrence :: Eq a => [a] -> [Int]
> firstOccurrence xs = map (indexIn xs) xs
>     where indexIn xs e = fromJust (elemIndex e xs)

Here are some simple tests.

> simpleTests :: Bool
> simpleTests = 
>     firstOccurrence [0,1,4,2,4,1,0,2] == [0,1,2,3,2,1,0,3] &&
>     firstOccurrence "baubaul e rau" == [0,1,2,0,1,2,6,7,8,7,10,1,2]

If we denote [y = firstOccurence x] we see that a simple 
property that has to be satisfied by [firstOccurence] is
that forall i: x_(y_i) = x_i. Based on this we can throw
in same random tests as well to increase our confidence
that the code written above is correct.

> indexIn xs = choose (0, length xs - 1)
> prop_SameElement xs = 
>     xs /= [] ==> forAll (indexIn xs) $ \i -> xs!!(ys!!i) == xs!!i
>         where
>         ys         = firstOccurrence xs
>         types      = xs :: [Int]

Another simple property that we can check is that forall i: y_i <= i.

> prop_GoodIndex xs =
>     xs /= [] ==> forAll (indexIn xs) $ \i -> ys!!i <= i
>         where
>         ys    = firstOccurrence xs
>         types = xs :: [Int]

Hooray. It looks like the implementation works. Well, kind of.
One thing that I have noticed during testing is that my definition
is a little shaky when applied to an empty list. Due to lazy 
evaluation it should work fine. But it works fine only in the
toplevel. Otherwise it complains about some type errors for
expressions like [firstOccurrece [] == []].

Solution 7 - Java

And, if everybody's doing it in Java, why can't I?

import java.util.ArrayList;
import java.util.List;

class Util<T> {
    /**
       @pre lst != null
       @post result != null
       @post \forall i : lst.get(\result.get(i)) == lst.get(i)
     */
    public List<Integer> firstOccurence(List<T> lst) {
        List<Integer> result = new ArrayList<Integer>(lst.size());
        for (int i = 0; i < lst.size(); ++i) {
            result.add(lst.indexOf(lst.get(i)));
        }
        return result;
    }
}

Solution 8 - MATLAB

for k=1:length(a)
 index=find(a==a(k));
 b(index)=index(1);
end

This uses the same idea as solution 4.

The code is very succinct and therefore hard to get wrong. This guy is a Formalist. It is also the only solution I've received that does not restrict itself to integer input. So more reasons to believe he's a Formalist. Aside from that we can see vague smurfness tendencies (look at the variable names). I would give the verdict: Formalist.

The use of the library function find improves readability a lot. I have also received an accompanying comment that clearly proves the author was actively thinking about readability: "To my shame, I was unable to get rid of the for". I love this solution. I don't know MATLAB that well so I wonder if there is any map or apply function there as you find in Mathematica and Haskell (and even in C++ but with horrendous syntax).

I will kill my objectivity completely by confessing that this is my favourite. Which is yours?

Conclusion

No. I'm not crazy. This type of "coding competitions" happen all the time on mailing lists. See for example this one. In the past I have also written literate programs that are now on the Net: this and this.

What I would like you to remember from all this is that it is important to have your programming style. It is even more important to be able to adapt your style to the style of the team. One way to achieve these is by actively thinking about your reader and approaching the task of programming in the same way a writer writes a novel and a professor gives a lecture.

Theory and Practice

04 November 2005

Writing Readable Code