Theory and Practice: merging lists

Literate programming is a nice approach to coding with some shortcomings. One of them is the lack of tools. However three of its central ideas can be applied by every programmer:

Write programs for humans not for computers.
Put links in your code that help the reader navigate the code (similar to the web).
Keep code and documentation close to each other.

The first one is very important. When you imagine that you are preparing a presentation for a group of people you are much more likely to get it right than when you imagine that you write a recipe for a (dumb) computer. Many slogans you read on "software engineering" websites are natural results of the "expository mode" state of mind: choose names carefully, thoroughly comment your API (non-private members in OOP), use pair-programming, conduct code and design reviews, etc. Code that is written with its readers in mind is a pleasure to read. I'll try to pleasure the few (non-existent?) readers of this blog with a series of small programs written with elegance as the ultimate purpose. If you ever find a nicer solution then please post a comment so that we all can see your jewel. A couple of days ago a guy (jartur) asked on the yahoo group ocaml-beginners about a program that would take two lists that are known to be sorted according to a predicate and merges them into one list that is still sorted and, in the process, removes possible repetitions. First of all the specification needs to be a bit clarified. A comparison function takes an ordered pair and returns a boolean: whether the parameters are in the correct order. We assume that it imposes a strict or non-strict total order relation. A list x is sorted only if i < j implies cmp x_i x_j. A list has a repeated element if there is an i such that cmp x_i x_i+1 = cmp x_i+1 x_i (in other words we use the equivalence relation induced by cmp to decide if an element is repeated or not). With these definitions we can restate the problem as follows. We are given two lists that are sorted and have no repeated elements. We must produce a list that has the same two properties and whose set of elements is the union of the set of elements of the given lists. All this implicitly refers to a comparator function cmp. Here is an example. Suppose the comparator function is:

let cmp a b = (String.length a) < (String.length b);;

It imposes a strict total order on strings. If the two input lists are ["x"; "bau"] and ["y"; "yy"; "bau"] then there are two answers that are considered valid according to our specification: ["x"; "yy"; "bau"] and ["y"; "yy"; "bau"]. Because as far as cmp is concerned "x" = "y": they have the same length. Now, the first thing to notice is the symmetry of the problem. We are given two lists: they are not named in any way. It would be better to have a solution that exploits this symmetry. Another thing we can notice immediately is that if one of the lists is empty then the other list is a good answer.

let rec merge cmp l1 l2 = match (l1, l2) with
  | ([], lst) | (lst, []) -> lst
  <both lists are not empty case>

[Note: this nice symmetric pattern matching was suggested by Seth Fogarty.] If both lists have a head then we have two situations: either the heads are equivalent or one of them is smaller. If they are equivalent then the answer list starts with either of them and continues with the merging of the tails (i.e. the rest of the two lists after removing both heads). If one head is smaller then it will also be the head of the answer. And the tail of the answer is the merging of the tail of the list from which we removed the head and the whole other list. That's an awful many words. The code will clear things up.

let rec merge cmp l1 l2 = match (l1, l2) with
  | ([], lst) | (lst, []) -> lst
  | (h1 :: t1, h2 :: t2) ->
      match (cmp h1 h2, cmp h2 h1) with
      | (true, false) -> h1 :: (merge cmp t1 l2)
      | (false, true) -> h2 :: (merge cmp l1 t2)
      | _ -> h1 :: (merge cmp t1 t2);;

. Notice that we should really try to preserve the symmetry of the problem in the text of the program. It makes it much clear. Notice also the order of the parameters: cmp is not the first one by chance. If we think how this function is to be used we should see that specialization for different comparators is very likely. The client can write for example:

let merge_less = merge (<);;

. Since we are looking at cmp we can also see a performance problem. We will have O(min(m,n)) recursive calls and cmp is passed as a parameter although it does not change! Better:

let merge cmp l1 l2 =
  let rec rm l1 l2 = match (l1, l2) with
  | ([], lst) | (lst, []) -> lst
  | (h1 :: t1, h2 :: t2) ->
    match (cmp h1 h2, cmp h2 h1) with
      | (true, false) -> h1 :: (rm t1 l2)
      | (false, true) -> h2 :: (rm l1 t2)
      | _ -> h1 :: (rm t1 t2) in
  rm l1 l2;;

. Another thing to notice here is the function names. The one that is visible from outside, who knows from what other parts of the code, has a longer, more meaningful name: merge. The one that has a "very local" visibility has a much shorter name: rm; it is meant to suggest "recursive merge" but it is not really essential for it to succeed; given it's narrow visibility, the brevity is more important. Now, that is a nice solution! The only problem is that it is a bit slow since it is not tail recursive. The problem comes from the fact that in OCaml lists are singly linked. In other words they have only a cons constructor/destructor, and do not have a snoc constructor/destructor. If you are not familiar with functional languages you might wonder what the heck am I talking about :) Well, it is pretty simple. If you have a cons-list then what you can do with it is (1) split it into the first element and the rest of the list through pattern matching on (head :: tail) and (2) construct a new list by adding an element at its beginning using the same (::) operator. All this means is that the natural way of "reading" lists in OCaml is from left to right and the natural way of constructing them is from right to left. Our solution has poor performace because we go against this second rule-of-thumb: construct lists from last element to the first element. A common trick is to split the processing in two steps: construct the reverse of what you need, and then reverse it. This is a trick I learned from Brian Hurt. Notice that to reverse a list you need to "read" (deconstruct) the input from left to right and "write" (construct) the output from right to left: so it is a "natural" operation in OCaml. The solution becomes:

let merge cmp l1 l2 =
  let rec rm l1 l2 acc = match (l1, l2) with
    | ([], lst) | (lst, []) -> List.rev_append lst acc
    | (h1 :: t1, h2 :: t2) ->
      match (cmp h1 h2, cmp h2 h1) with
        | (true, false) -> rm t1 l2 (h1 :: acc)
        | (false, true) -> rm l1 t2 (h2 :: acc)
        | _ -> rm t1 t2 (h1 :: acc) in
  List.rev (rm l1 l2 []);;

. The use of the accumulator (acc) is also a common trick to help separate the process in two steps. Now, don't you like this function? For comparison look at jartur's original code (he's just learning ocaml...):

let rec merge al bl sf =
        let hl il el = match il with
        | [] -> []
        | hd::tl -> if hd=el then tl else il
       
        and

        z sff a b = if (sff a b) then a else b
       
        in
       
        if bl=[] then al else match al with
        | [] -> bl
        | hd::tl -> let x=(z sf hd (List.hd bl)) in
                        x::(merge (hl al x) (hl bl x) sf)
;;

. Brrrrr.... The whole point of this entry is that if you think about your audience, exploit symmetries in the problem statement, and avoid being overspecific (i.e. adding in the mix details that are not essential) you will produce programs that are much more readable and, usually, more efficient.

Theory and Practice

12 November 2004

merging lists

No comments:

Post a Comment