Theory and Practice: Some Random Code Metrics

How compressible is source code? What is the average size of a file? The answers to these and other useless questions are born from too much caffeine.

The projects:

Ant: A build system used by many Java projects, which is itself written in Java.
CPython: The reference Python implementation, written in C.
Frama-C: Analyzers of C programs, written themselves mostly in OCaml.
FreeBoogie: A project of mine written in Java.
GHC: The most used Haskell compiler, written in Haskell.
jEdit: A nice source code editor, written in Java.
Linux: An operating system macro-kernel, written in C.
OCaml: Compiler, standard library, and related tools for the language OCaml, written in OCaml.
SGB: A library for handling graphs and a few example algorithms written by Knuth in CWEB.

Methodology. Get the repo, spend $\le 1$ min choosing a subset of files that look like "the source", run a few quick commands in the shell, don't check the results. That's so you know how much you can trust what follows. Nevertheless, it's likely I got the orders of magnitude right.

Project size contest. The most common measure for a project's size is the number of lines of code. This comes with plenty of caveats. In any case, if we define "lines of code" to be "number of '\n' characters in the files that Radu happened to choose as the 'source'" then here are the results:

Linux: 9.1 million
GHC: 460 thousand
CPython: 450 thousand
OCaml: 260 thousand
Ant: 200 thousand
jEdit: 160 thousand
Frama-C: 126 thousand
SGB: 19 thousand
FreeBoogie: 8 thousand

There's an alternative measure that I like much better and that is not used much: the compressed size of the source code. Why do I like it? Because it should make a good proxy for the information content in the source code. For example, it doesn't matter much if coders use space indentation or tab indentation, long lines or short line, etc. There are plenty of caveats here too. For example, it is likely that the true information content (say, Kolmogorov complexity) is much lower, and that would be apparent if compressers would exploit the structure of the language in which the code is written.

Anyway, here is the bzip2 size of the projects.

Linux: 45 MB
GHC: 4.3 MB
CPython: 2.0 MB
OCaml: 1.2 MB
Ant: 930 kB
Frama-C: 735 kB
jEdit: 690 kB
SGB: 190 kB
FreeBoogie: 50 kB

File size contest. All these projects are broken up into files, which roughly correspond to modules or abstraction boundaries. The idea is that you should be able to focus on the internals of one file at a time without needing to know too much about the other files. And that is true of the compiler too, not only of you. Or, at least, that's one way to look at it.

So, lines per file contest:

CPython: 860
Linux: 680
SGB: 580
OCaml: 330
jEdit: 330
Frama-C. 300
GHC: 280
Ant: 260
FreeBoogie: 140

And compressed bytes per file contest:

SGB: 5900 B
CPython: 3700 B
Linux: 3300 B
GHC: 2600 B
Frama-C: 1700 B
OCaml: 1500 B
jEdit: 1400 B
Ant: 1200 B
FreeBoogie: 780 B

Information density. So, which project should I read if I want to get most per byte? And which one can be read on the bus without missing much? Well, here's information per character (where information is "measured" as earlier: compressed size, so these are basically inverses of compression ratios).

SGB: 0.25
GHC: 0.21
FreeBoogie: 0.19
Linux: 0.18
Frama-C: 0.16
jEdit: 0.16
CPython: 0.15
OCaml: 0.14
Ant: 0.14

Looks like GHC is almost as incompressible as the code Knuth writes.

Theory and Practice

01 April 2011

Some Random Code Metrics

No comments:

Post a Comment