Theory and Practice: Edit Distance, Benchmarks

You could compute the edit distance of two strings by memoization or by dynamic programming. The latter should be more efficient, at the expense of clarity. But, how much more efficient? This post shows some (micro) benchmarks.

Background. What do I mean by ‘edit distance’, ‘memoization’, and ‘dynamic programming’? Skip to Test Data if you already know.

The edit distance problem is the following: Given two strings, compute their edit distance. The edit distance is the minimum length of an edit script that transforms one string into the other. An edit script is a sequence of edit operations. Let's say that the only operations are ‘insert a letter’ and ‘delete a letter’.

Some problems can be solved by first solving smaller versions of them (subproblems) and then combining the results. A fancy word for that is recursion. Now imagine a dag (directed acyclic graph) with arcs from problems to their subproblems. There could be multiple paths from a problem to a sub-sub-problem. So, we should be careful to solve each subproblem at most once. Recursion plus not repeating work has two fancy names: memoization and dynamic programming.

OK, OK, perhaps the different names reflect some details yet unmentioned. Let's see.

Remember the dependency graph. We must label its vertices with the result of the subproblem they represent. We may only label a vertex if all its successors have been labelled. That is the only constraint. The fancy name for this is that we must process vertices in the reverse of a topological order. Algorithmists say: In dynamic programming, you consciously decide the order; in memoization, you use an order resembling DFS (depth first search), and usually don't even realize this is what you do. Conversely, languages people say: Memoization lets you focus on the important parts (how to decompose the problem at hand); dynamic programming makes you consider irrelevant details (the order). They are all right, of course, and there is no reason for contention.

But, let's get back to our problem, and start with the important part: How to decompose it? We want to compute the distance $d(s,t)$ between strings $s$ and $t$. To decompose the problem, let's first decompose the strings. The obvious way to decompose a string $s$ is to say that it is the concatenation of two strings $s_1$ and$s_2$. The case $|s_1|=0$ and the case $|s_2|=0$ are not very useful. The case $|s_1|=1$ is often sufficient for problem decomposition, as is the case $|s_2|=1$. Let's see if it is sufficient in this case. First, $s$ is either the empty string $\epsilon$ or it is the concatenation $as'$ of a letter $a$ and a smaller string $s'$. Similarly for $t$. If the problem is $(s,t)$, then the potential subproblems are $(s,t')$, $(s',t)$, and $(s',t')$. How can we use a solution for the latter to compute a solution for the former? A moment's scribbling gives the answer: $$d(s,t)=\begin{cases} |s| & \text{if $t=\epsilon$} \\ |t| & \text{if $s=\epsilon$} \\ d(s',t') & \text{if $a=b$} \\ 1+\min\bigl(d(s,t'),d(s',t)\bigr) & \text{otherwise} \end{cases} $$

The C implementation below is not too far from the math notation above.

int min(int a, int b) { return a < b? a : b; }
int d1(char *s, int i, char *t, int j) {
  if (!s[i]) return strlen(t+j);
  if (!t[j]) return strlen(s+i);
  if (s[i] == t[j]) return d1(s, i+1, t, j+1);
  return 1 + min(d1(s, i, t, j+1), d1(s, i+1, t, j));
}
int d1_wrap(char *s, char *t) { return d1(s, 0, t, 0); }

We decomposed the problem. Now we must figure out how to avoid re-solving a subproblem. Let's be sloppy and just do the obvious thing: Before solving a subproblem, first check if we solved it before. This one is called memoization.

#define max_len (1 << 8)
int d2_cache[max_len][max_len];
int d2(char *s, int i, char *t, int j) {
  if (d2_cache[i][j] >= 0) return d2_cache[i][j];
  if (!s[i]) return d2_cache[i][j] = strlen(t+j);
  if (!t[j]) return d2_cache[i][j] = strlen(s+i);
  if (s[i] == t[j]) return d2_cache[i][j] = d2(s, i+1, t, j+1);
  return d2_cache[i][j] = 1 + min(d2(s, i, t, j+1), d2(s, i+1, t, j));
}
int d2_wrap(char *s, char *t) {
  int m = strlen(s);
  int n = (strlen(t) + 1) * sizeof(int);
  for (int i=0;i<=m;++i) memset(d2_cache[i],-1,n);
  return d2(s, 0, t, 0);
}

Above I said that memoization gives an order that resembles DFS. A draft of that sentence said that the order is that of DFS, but I fixed it. Can you see what was wrong?

I transformed d1 into d2 manually. But, it should be clear that I did nothing smart. A computer could do it. Well, almost. There are two issues with asking the computer to do it, one mundane and one less so. The mundane issue is that higher-order code, which takes code as input and produces code as output, is rather painful to do in C. The less mundane issue is that an array won't always do as a cache. A general higher-order memoize function would use some sort of dictionary data structure, usually a hashtable. A hashtable would have different performance characteristics, compared to an array.

Do you expect a hashtable to do better or worse than an array for this problem? What about the case $s=t$?

I wouldn't bet any significant amount of money against the hashtable version, without profiling. So, I wrote a version with a hashtable as well.

struct PairHash {
  size_t operator()(const pair<int,int>& p) const {
    return (p.first << 8) + p.second;
  }
};
unordered_map<pair<int,int>,int,PairHash> h_cache;
int d6(char *s, int i, char *t, int j) {
  pair<int, int> ij = make_pair(i, j);
  auto it = h_cache.find(ij);
  if (it != h_cache.end()) return it->second;
  int& r = h_cache[ij];
  if (!s[i]) return r = strlen(t+j);
  if (!t[j]) return r = strlen(s+i);
  if (s[i] == t[j]) return r = d6(s, i+1, t, j+1);
  return r = 1 + min(d6(s, i, t, j+1), d6(s, i+1, t, j));
}

int d6_wrap(char *s, char *t) {
  h_cache.clear();
  return d6(s, 0, t, 0);
}

The code above is using the hashtable from the standard library of C++. All the other code in this post uses plain C, so that as few things as possible are hidden from plain sight. For the hashtable, I thought it might be reasonable to use a standard implementation, even if this hides some things.

We decomposed the problem, and we made sure we don't re-solve subproblems. Now let's think more carefully about the order in which subproblems should be solved. Subproblems can be identified by pairs $(i,j)$ of indices. Imagine all subproblems laid out in an $m \times n$ grid. To solve $(i,j)$ we need to have $(i,j+1)$, $(i+1,j)$, $(i+1,j+1)$. Each subproblem has three outgoing arcs, to the right, below, and right-and-below. Thus, three orders come to mind: row-by-row, column-by-column, and diagonal-by-diagonal. Others are possible, but these are somehow structured. The code below goes row-by-row.

int d3(char *s, char *t) {
  int m, n;
  int i, j;
  for (m=0; s[m]; ++m);
  for (n=0; t[n]; ++n);
  for (j=n; j>=0; --j) d2_cache[m][j] = n-j;
  for (i=m-1; i>=0; --i) {
    d2_cache[i][n] = m-i;
    for (j=n-1; j>=0; --j)
      d2_cache[i][j] =
        s[i]==t[j]
        ? d2_cache[i+1][j+1]
        : (1 + min(d2_cache[i+1][j], d2_cache[i][j+1]));
  }
  return d2_cache[0][0];
}

It now becomes clear that some space is wasted by the two-dimensional cache array: After we use row $i+1$ to compute row $i$, we never again need it. I won't be profiling space usage. Still, in some situations space is time, because of locality of reference. So, let's try this version as well.

int d1_cache[2][max_len]; /* oh, dear! so, d2_cache meant "2d cache"? */
int d4(char *s, char *t) {
  int m, n;
  int i, j;
  int *now, *nxt;
  for (m=0; s[m]; ++m);
  for (n=0; t[n]; ++n);
  if (n > m) {
    char *x = t; t = s; s = x;
    int a = m; m = n; n = a;
  }
  now = d1_cache[0];
  nxt = d1_cache[1];
  for (j=n; j>=0; --j) nxt[j] = n-j;
  for (i=m-1; i>=0; --i) {
    int *tmp = nxt; nxt = now; now = tmp;
    nxt[n] = m-i;
    for (j=n-1; j>=0; --j)
      nxt[j] = s[i]==t[j]? now[j+1] : (1 + min(now[j], nxt[j+1]));
  }
  return nxt[0];
}

Can you tell why I put the if (n > m) { … } in there?

Hmm, are the memory accesses that jump between two arrays going to be cache-friendly? I wonder what would happen if I interleave their elements in one array? Sure, I'd have to do a bit more computation to get the right indices, but somehow it is more obviously cache-friendly. Let's see.

/* what does d0 stand for? hmm, names in this program are rather random */
int d0_cache[2 * max_len];
int d5(char *s, char *t) {
  int m, n;
  int i, j;
  int nxt;
  for (m=0; s[m]; ++m);
  for (n=0; t[n]; ++n);
  if (n > m) {
    char *x = t; t = s; s = x;
    int a = m; m = n; n = a;
  }
  nxt = 1;
  for (j=2*n+nxt; j>=0; j-=2) d0_cache[j] = n-(j>>1);
  for (i=m-1; i>=0; --i) {
    nxt = !nxt;
    d0_cache[(n<<1)|nxt] = m-i;
    for (j=((n-1)<<1)|nxt; j>=0; j-=2)
      d0_cache[j] = s[i]==t[j>>1]? d0_cache[(j^1)+2] : (1 + min(d0_cache[j^1], d0_cache[j+2]));
  }
  return d0_cache[nxt];
}

I didn't pay too much attention to how I do the bit-fiddling stuff. Is there a better way?

These are the solutions I will benchmark. Here are friendly names for them.

`d1`	stupid
`d2_wrap`	memoized with array
`d6_wrap`	memoized with hashtable
`d3`	DP
`d4`	fringe DP
`d5`	interleaved fringe DP

Test Data. In general, there are two ways to obtain test data: take it from somewhere, or generate it. I did the latter for two reasons. First, I couldn't think of a good source of instances for the edit distance problem. (Do you know any?) Second, I was curious what happens when the size of the input changes.

Once you decide to generate test data, you need to decide from which random distribution you sample. I did the simplest thing that came to mind. Given the length $n$, generate a string by sampling (without replacement) $n$ uniformly from the alphabet. Then obtain the second string by choosing some subset of $\min(n,5)$ letters, and then changing those letters, again by uniform sampling from the alphabet. Thus, the distance between strings is always $10$ (or slightly smaller) for all word lengths $n≥5$.

There are better algorithms for the case in which the edit distance tends to be much smaller than the length of the strings. Can you find one?

For each $n$ from $1$ to $40$ I generate $10^6$ such pairs. This way, the number of seconds for handling all tests is the number of microseconds for handling one test, on average.

Benchmark Results. The graph below summarizes the results. The stupid implementation isn't included for lack of time. (It's too slow.)

OK, so what we see here is that memoized with a hashtable is pretty bad. Let's remove this case, so can better compare the others.

Right. So memoized with array is only about twice as bad as DP. Let's remove it too, so we can better compare the three DP implementations.

Clearly, my bit fiddling didn't help much: interleaved fringe DP is even worse than DP with a two-dimensional array. On the other hand, keeping only the fringe in memory, shaves of about 10% of the running time.

Discussion. What does this all mean? If I had to solve some problem that decomposes into smaller problems (with overlap), then I'd try memoized with a hashtable.

the code of is simper, very close to what I'd write in math on paper
with a hashtable (as opposed to an array), I don't have to think about any bounds, so there are fewer opportunities for mistakes
memoization with a hashtable can be done very conveniently in languages with higher-order functions (see this video for an example in OCaml)

But isn't it slow? It is, but correctness trumps efficiency. When the difference is in constant factors, or even logarithmic factors, I usually prefer to try the simpler code first, and change only if profiling shows that performance is a problem. If performance is a problem, then the good news is that there is a lot of room for improvement.

Theory and Practice

11 December 2013

Edit Distance, Benchmarks

No comments:

Post a Comment