## 01 April 2011

### Some Random Code Metrics

How compressible is source code? What is the average size of a file? The answers to these and other useless questions are born from too much caffeine.

The projects:

• Ant: A build system used by many Java projects, which is itself written in Java.
• CPython: The reference Python implementation, written in C.
• Frama-C: Analyzers of C programs, written themselves mostly in OCaml.
• FreeBoogie: A project of mine written in Java.
• jEdit: A nice source code editor, written in Java.
• Linux: An operating system macro-kernel, written in C.
• OCaml: Compiler, standard library, and related tools for the language OCaml, written in OCaml.
• SGB: A library for handling graphs and a few example algorithms written by Knuth in CWEB.

Methodology. Get the repo, spend $\le 1$ min choosing a subset of files that look like "the source", run a few quick commands in the shell, don't check the results. That's so you know how much you can trust what follows. Nevertheless, it's likely I got the orders of magnitude right.

Project size contest. The most common measure for a project's size is the number of lines of code. This comes with plenty of caveats. In any case, if we define "lines of code" to be "number of '\n' characters in the files that Radu happened to choose as the 'source'" then here are the results:

1. Linux: 9.1 million
2. GHC: 460 thousand
3. CPython: 450 thousand
4. OCaml: 260 thousand
5. Ant: 200 thousand
6. jEdit: 160 thousand
7. Frama-C: 126 thousand
8. SGB: 19 thousand
9. FreeBoogie: 8 thousand

There's an alternative measure that I like much better and that is not used much: the compressed size of the source code. Why do I like it? Because it should make a good proxy for the information content in the source code. For example, it doesn't matter much if coders use space indentation or tab indentation, long lines or short line, etc. There are plenty of caveats here too. For example, it is likely that the true information content (say, Kolmogorov complexity) is much lower, and that would be apparent if compressers would exploit the structure of the language in which the code is written.

Anyway, here is the bzip2 size of the projects.

1. Linux: 45 MB
2. GHC: 4.3 MB
3. CPython: 2.0 MB
4. OCaml: 1.2 MB
5. Ant: 930 kB
6. Frama-C: 735 kB
7. jEdit: 690 kB
8. SGB: 190 kB
9. FreeBoogie: 50 kB

File size contest. All these projects are broken up into files, which roughly correspond to modules or abstraction boundaries. The idea is that you should be able to focus on the internals of one file at a time without needing to know too much about the other files. And that is true of the compiler too, not only of you. Or, at least, that's one way to look at it.

So, lines per file contest:

1. CPython: 860
2. Linux: 680
3. SGB: 580
4. OCaml: 330
5. jEdit: 330
6. Frama-C. 300
7. GHC: 280
8. Ant: 260
9. FreeBoogie: 140

And compressed bytes per file contest:

1. SGB: 5900 B
2. CPython: 3700 B
3. Linux: 3300 B
4. GHC: 2600 B
5. Frama-C: 1700 B
6. OCaml: 1500 B
7. jEdit: 1400 B
8. Ant: 1200 B
9. FreeBoogie: 780 B

Information density. So, which project should I read if I want to get most per byte? And which one can be read on the bus without missing much? Well, here's information per character (where information is "measured" as earlier: compressed size, so these are basically inverses of compression ratios).

1. SGB: 0.25
2. GHC: 0.21
3. FreeBoogie: 0.19
4. Linux: 0.18
5. Frama-C: 0.16
6. jEdit: 0.16
7. CPython: 0.15
8. OCaml: 0.14
9. Ant: 0.14

Looks like GHC is almost as incompressible as the code Knuth writes.