29 August 2007

Self management &mdash part 1

I'm going to Croatia. I'll be driving at least 8 hours. The other half of the way my father will be the driver. If I let him drive. I didn't decide yet. Driving is a hobby of mine and it is a hobby that was neglected for the past 2 years, since I started the PhD. I like both to be able to give passengers a smooth ride and to be able to drive as in a rally, but mostly the former. I guess it's the age. I remember that I used to drive fast quite often after I got my license. And when I say fast I mean to the limit. For example tires screeched at almost all curves. I had no accident while driving fast. Once I scratched the car. Apart from that I had no accident that was my fault. Once a guy reversed into my car while I was not moving. Lesson: Never assume you are being seen no matter as crazy the alternative seems. Another time I was driving straight ahead on the main road and a guy that was turning left hit the back of my car. I guess he was in a hurry or he thought I'm driving faster. Both these incidents resulted only in broken bumpers. Those are all the incidents involving contacts between my car and other cars in the last 9 years.

What does this mean? The article Culpable versus non-culpable traffic accidents; What is wrong with this picture? by Wålberg and Dorn from the Journal of Safety Research 2007 gives some figures. According to their data Swedish bus drivers have one accident every 4961 hours, while British bus drivers have one every 1518 hours. I drove probably about 50 hours in the last two years. Before that I would guess I averaged about one hour per day. That sums up to about 2600 hours of driving. That is, I have one accident every 867 hours of driving. Which means I suck. I could make it look better by pointing fingers or by saying when I had the last accident. But it's probably better if I get the message and try to improve.

Which brings me to the subject of this post. I believe that there are a few important ingredients that help improve a skill:

  • you should know that there is plenty of room for improvement,
  • you should be confident that you can improve,
  • you should continuously monitor your progress, and
  • you should establish (attainable and objective) next targets.

I now know that there is plenty of room for improvement: I know that, at least by one measure, Swedish bus drivers are 5.7 times better than me. Confidence that I can improve was never a problem of mine. Maybe I'm overconfident sometimes. But I know that for others it can be a problem. The next step is to monitor my progress. One thing to monitor is how many hours I drive. The next big target for me is to drive 1954 hours without any crash. That will make me as good as an average British bus driver. Notice that I said big. To keep me motivated I need to find smaller targets. I used to have a small piece of paper on me with a list of five things I want to improve. That list should evolve. But what I already have is a good starting point. It's fun to look back now. Here is the list:

  1. Don't change gears while steering.
  2. Use the horn to make yourself seen, not to `yell' at others.
  3. ALWAYS know what happens behind you before steering or braking.
  4. Be careful at traffic signs.
  5. Always have clean windows and mirrors.

Nowadays I would say that I seldom change gears while steering. I virtually never use the horn to `yell' at others so I can erase this one. It's annoying to realize that sometimes I still brake without looking in the mirror. My perception of traffic signs is now much better, but the main reason I do not overlook them is probably that I no longer have a standard route. Years ago, habit was the main reason why I missed (changed) traffic signs. I did not improve at all when it comes to windows and mirrors. Still, I absolutely hate things put in the car that hinder visibility.

Unfortunately, I don't expect to improve much in the next two years because the probability that I'll get a car while being a PhD student is 0. This is a dilemma for me.

Now let's see how the same schema applies (or not) to another skill, say, programming.

The first step is to know that there is plenty of room for improvement. There is. The TopCoder ratings show that I suck at solving Aha! problems and at finding good heuristics. The rating for developing well-documented and robust programs comes from only one participation so it is not reliable. But it is says I suck. Another area where I suck is familiarity with contemporary technologies. Since I suck at so many things I need to prioritize. I'm pretty sure I do not want to improve my knowledge of contemporary technologies. On one hand they come and go fast, so at this stage in my life (PhD) it would be an effort badly spent. On the other hand I tend to associate technology with stupid people. I know this is not true but I just can't help it. I've heard the `school doesn't teach modern technologies and you have to learn by yourself' excuse so many times that I feel like running and yelling whenever someone repeats it. (Learn by yourself as opposed to what? Having someone to stuff the textbooks down your throat? Learning is something that students do, by definition.) Finally, I'm yet to see a “modern technology” that takes more than a day to get started with, and more than a week to become reasonably efficient with it. What remains is the set of three TopCoder categories I mentioned: Aha! problems, heuristics for hard problems, and well-documented and robust programs. (These are known as Algorithms, Marathons, and Component competitions in the TopCoder world.) The trouble is that I can't pick one. All have a certain appeal to me. The only reason I participated in more Algorithm competitions than in Marathon competitions than in Component competitions is that I favored those that take less time. One option would be to focus on Knuth. His books and programs are generally a nice blend, and of very high quality. The nagging problem with this is that he is somewhat conducive to a work style that ignores existent good tools and languages. This is partly on purpose, because he doesn't want to be dependent on technologies that come and go. But it can be annoying sometimes. For example, his CWEB system for literate programming is a strain for most editors (only ViM does proper syntax highlight) and it encourages C-style programming even when C++ has better alternatives (such as defines in C versus constants in C++). To some point this doesn't matter for learning principles but it does tend to form bad habits that kick you when you want to get things done.

For now I want to remain flexible in what I want to become good at. For the short term I want to make good progress in reading TAoCP and in writing good literate programs.

The second step is to be confident. Am I confident? Not really. I feel somewhat threatened by the lack of time and by my slow progress. I want to believe that my slow progress is generated by a lack of focus, but I'm not sure if I do believe it. In short, I need to work on my confidence. A good way to build confidence is to have some set of concrete achievements to which I can look back later on. I'll collect these on my webpage as programs.

The third step is to monitor my progress. The webpage will help. So will my TopCoder rating.

The fourth step is to have some small targets to follow. Here is the list for now.

  1. Release a functional FreeBoogie on 1st of October. (I promised one for the end of the summer but somehow I did not work on it for the last two or three months. I don't know how I manage so often to not work on something I promise. It's almost like promising something makes me less likely to do it.)
  2. Increase the TopCoder algorithm rating to ≥ 1900 by the end of the year. This should be doable since I've been there. Remember that rating is just a way to measure progress. If I think about the rating before and during a match I usually end up much worse. I should be thinking about the problems.
  3. Finish reading chapter 1 of TAoCP by the end of the year. This involves solving all exercises rated ≤ 10 and all those that are starred. For the starred ones, trying for 3 hours and then understanding the suggested solution is acceptable too. I know that there are some exercises that I just can't solve.

OK. So we saw two examples of how to apply this simple improvement strategy: driving and programming. Since I am a PhD candidate it makes sense to apply it to one more skill: doing research.

What is good research? I gave the answer a while back in terms of what a good researcher does: (1) works hard, (2) picks good problems, and (3) presents well the results. I don't need to work harder. I do it quite well (when I'm not on holiday as I am now). Do I pick good problems? I'm not sure. Do I present well the results? No.

Let's go thru the four steps again.

Is there room for improvement? Well, certainly. I am virtually unknown as a researcher. I need to publish more to let people know what I do. I need to interact more at conferences to let people know what I do. I need to give good presentations of what I do: at conferences, in articles, and on this blog. In research the objective measure of success that comes to mind is the h-index. (Sure, that is an imperfect measure. But that can be said about grades in school, TopCoder ratings, number of articles, and virtually any objective measure. The point is that looking at numbers helps one's self-confidence if those numbers improve.) Let's see, what is my h-index? It can't be too high because I have only three articles that I co-authored. Two have no citation and the third has 4. (There is no self-citation involved.) That would make my h-index equal to 1. The very best in Computer Science are somewhere above 60. So yes, there is plenty of room for improvement.

Am I confident? Yes, I am. But I do believe that I have a big problem with focusing on doing work that can be seen and measured as opposed to playing around. I have to drill this into my mind: If I don't communicate it well it's like I didn't do it — it won't benefit anyone. Writing things down may seem as slowing me down, but it should ultimately be beneficial.

Monitor my progress. I will keep track of articles I write on my webpage. I will also make it easy to contribute to FreeBoogie. The reason is that that piece of software is my best chance to become known in my research community if done properly. It's also the best way I see of contributing to the community. It uses my programming skills and gives back a platform for experimenting. At least that's the ideal.

Let's see some targets now.

  1. I must write one article every two months.
  2. I must improve old articles and build upon them as opposed to keep starting new things.
  3. I must turn FreeBoogie into a platform that others will want to use for research.
  4. I must give a short presentation of each of the articles I co-author on this blog for a wider audience. In fact I should have in mind a programmer from industry that is not familiar with applied formal methods when I write the posts.

That's it.

I will write a `self management' post only once every three months, roughly. I want this blog to be mainly about technical problems and about research. Nevertheless, a little introspection is good for one's improvement.

24 August 2007

Toy problems

These are problems I've heard about recently and seem cute.

Merge arrays. You are given an array x with m comparable objects followed by n empty cells. You are also given an array y with n comparable objects. Both x and y are sorted. Your task is to add to x the element of y such that x remains sorted. You may use only a constant amount of memory. [From TopCoder forums.]

Election winner. You are given an array containing votes. A candidate wins if he has more than half the votes. Your task is to find the winner. If there is no winner by the previous definition than you can choose as winner whoever you want. You may use only a constant amount of memory. [From the 0xDE blog.]

Handshake party. You are invited to a party. At the end you ask each person how meny times they shook hands. You notice that the response you receive is an odd number for an even number of people. Why? [Henry. Also in Combinatorics by van Lint.]

Handshake party, continued. You then ask each person with how many other people they shook hands. If no one is handshakeless then you expect to receive at least two identical answers. Why? [Combinatorics by van Lint.]

Sex in New York (Times). You are given a bipartite graph. Show that the sum of the degrees for the left nodes is equal to the sum of degrees for the right nodes. What is the relation between the average number of sexual parteners of women and the average number of sexual parteners of men? [From NYT, via WebDiarios de Motocicleta blog.]

Graphs at IOI 2007. Find a maximum weighted matching in an arbitrary graph knowing that the degree of the vertices is bounded by a constant. You are supposed to give a nice and simple implementation. [Via WebDiarios de Motocicleta; I don't know (yet) the `nice' answer. Here is the general case.]

PS1: Some of these are also on my webpage.

PS2: Did you notice the list of nice blog posts below?

19 August 2007

Orange Romania, 3G, and Linux

This post describes how I got the Business Everywhere, Option 3G/EDGE package from Orange Romania working on Linux, which unfortunately is not officially supported. These days I found a higher kind of fun than configuring things so I won't make a habit of such posts. I make an exception now, however, because this information may save someone's time and because in the process of figuring things out I had to give AT commands to a modem, which reminded me of the old days when I was a kid and only started to use a PC.

To be more precise I talk about Ubuntu Linux. In general, your mileage may vary and you may need to adapt these instructions, but they will at least give you a starting point.

Here is what you need to do.

sudo apt-get install pppd wvdial
sudo vim /etc/ppp/orange
 #debug 
 nodetach
 connect "/usr/bin/wvdial --chat --config /etc/wvdial.conf"
 /dev/noz0 
 460800 
 noauth
 noccp 
 novj 
 usepeerdns
 defaultroute
sudo vim /etc/wvdial.conf
 [Dialer Defaults]
 Modem = /dev/noz0
 Baud = 460800
 Init1 = ATZ
 Init2 = ATQ0 V1 E1 S0=0 &C1 &D2
 Init3 = AT+CGDCONT=1,"IP","internet"
 Dial Command = ATD
 Idle Seconds = 3000
 Username = dummy
 Password = dummy
 Phone = *99#

To connect to the Internet you say pppd call orange. You type C-c to hung up.

The above tells pppd that authentication is not needed (noauth), disables some stuff that does not seem to be supported by Orange (noccp, novj), sets up the phone number (*99#), and the access point (internet). If you have problems it might help to uncomment the debug line and see what goes wrong.

If the SIM is protected you may need to unlock it first. Both minicom and kermit can be used to send AT commands to the modem. Here is how to send a SMS (and how to unlock the SIM if not already unlocked).

rg@rg-laptop:~/blog$ kermit
C-Kermit 8.0.211, 10 Apr 2004, for Linux
 Copyright (C) 1985, 2004,
  Trustees of Columbia University in the City of New York.
Type ? or HELP for help.
(/home/rg/blog/) C-Kermit>set line /dev/noz0
(/home/rg/blog/) C-Kermit>c
Connecting to /dev/noz0, speed 115200
 Escape character: Ctrl-\ (ASCII 28, FS): enabled
Type the escape character followed by C to get back,
or followed by ? to see other options.
----------------------------------------------------
at+cpin="0000"
OK
at+cpms="SM"
+CPMS: 0,50,0,50,0,50

OK
at+cmgf=1
OK
at+cmgs="PHONE NUMBER"
> MESSAGE
> <TYPE C-z TO END THE MESSAGE>

+CMGS: 10

OK

(Back at localhost)
----------------------------------------------------
(/home/rg/blog/) C-Kermit>q
 A serial connection might still be active on /dev/noz0.

OK to exit? y
Closing /dev/noz0...OK

Very similar settings are used to get Zapp (Romania) working on Linux. Zapp provides Linux support on their webpage.

UCD CSIC 2007

On 11th of August we had the 2007 edition of the CSI programming competition. I decided the theme to be Harry Potter. The problems and the reference solutions are available online. We also have pictures. As you can guess from the balloons it was an ACM ICPC style contest. We (at least me and Joe) plan to send a team to the regionals to represent University College Dublin (and Ireland).

In this post I'll say a few things about the problems. I recommend you spend some time thinking about them before reading on.

Problem A was taken straight from the Harry Potter book. The trick is to not start coding until you think it through. Most contestants actually transformed the rules into code. Experienced programmers know that that is a sin. Those kind of rules must be encoded as data, otherwise it is a nightmare to get it right, or to adapt to changes in the rules.

Problem B tests coding skill. Here Java has a huge disadvantage because the standard IO routines are extremely slow compared to the standard IO routines of C. Complaining about this is pointless. First, fast IO can be done in Java at the price of writing C-style code. Second, no one stops you to use C. That aside, my C solution works in a little over 0.3 seconds on the test data (on my computer) so the 60 seconds limit should be enough even for Java, provided that the implementation has the correct complexity. What is the correct complexity? When you are trying to see if the word at index 0 is the start of an evil almost-anagram you look at the words 0, .., n such that they have at least 11 letters. The first important observation is that there is no need to look at 0, ..., n+1. The second important observation is that, when you look at word 1, you can start with 1, ..., n, because 1, ..., n-1 is guaranteed not to have enough letters. Hence, the solution should be O(M+N), where M is the size of the input and N is the size of the reference string, not O(MN).

Problem C is easy for anyone with experience in algorithm contests and hard for others. The solution is a simple BFS.

Problem D is by far the most interesting one. It is more or less taken directly out of an article by Kleinberg, Papadimitriou, and Raghavan on Segmentation Problems. They give a theorem that says that the problem has a O(C3) solution using dynamic programming based on the observation that optimal groups are made of clients that have consecutive optimal prices. They omit the "obvious" proof. I have no idea what solution they have in mind, for mine is O(C2). Since the winnings are (ab x)x the optimal price is a/(2b) and the maximum winnings are a2/(4b). These formulas hold for individual clients but also for groups of clients, with a and b being sums in the latter case. One approach is to try all possible ways of grouping clients and the use these formulas to get the optimal price and profit for each group. This is too slow. Suppose now that we know the optimal group prices and we try to figure out the groups. Because the winnings are given by a function symmetric around the maximum we should sell to each client using the group price closest to its individual optimal price. This shows that the optimal groups are indeed made out of consecutive clients, when sorted by their individual optimal prices.

10 August 2007

No debugging

We all know a big company that doesn't do any debugging. Instead, it lets its users find the bugs and improve the programs. That's a great idea!

Murray Gell-Mann, On Getting Creative Ideas, 2007

03 August 2007

Why Interfaces Should Be Fat and Should Not Contain Efficiency Guarantees

Joshua Bloch, the designer of a big part of the Java API, says that the fundamental law of API design is "to leave [something] out whenever you are in doubt." In a recent edition of the ACM Queue magazine Michi Henning gives some rules of thumb for API design, including: "An API should be minimal, without imposing undue inconvenience on the caller." We learn from Eemonn McManus that it's good to "be minimalist" and, in particular, "if your API has more methods than it needs, then it's taking up more space than it needs." That obviously sounds like something bad. It also sounds a bit like a tautology but that's beside the point. Bill Venners says that an API designer should "strive for interfaces that are minimal and complete." The first guideline in a compilation done by a fellow blogger says that a good API "has as few public members per class and as few classes as possible." These things are not hard to find: They are all top hits for a google on API design. I also remember myself arguing for "leaving things out" on more than one occasion and even in situations where the subject was life, universe, and everything, not just APIs.

So what's with the crazy title? Who would prefer slim to fat, and not knowing to knowing? The answer has three letters: STL—possibly the best API in the world. Actually, no, the answer is slightly longer.

I hate guidelines. Whatever you do please don't read the rest as a plead for some competing guidelines. No. It is simply a plead against a particular and also a plead for thought. Maybe a strangely worded one but take my word—that's what it is. Don't do something because some guideline, rule of thumb, or law tells you to do. Do it because you thought about it and you found good reasons to believe that it, for various values of it, is the right thing to do.

I'm debating the rule of slim interfaces because it is an example of a well-established rule. The trouble with well-established rules is that they are admired, applied, spread like plague, and are almost never questioned. When you question something the purpose is not to demolish it. The purpose is to understand why it stands, when it stands, and whether it stands at all. To do a good job the mindset must be "I must find a situation where this doesn't hold, a counterexample." If you fail, then you would have had first hand experience that make you see why the rule is good.

Enough digression.

The 2007 edition of the ICFP Contest was quite fun. To get going, teams had to write an interpreter for a very low level functional language. To do so efficiently it turned out that one needs fast concatenation and subrange extraction from sequences of characters. Both must be done is sublinear time. Otherwise it takes a few hours to run a program. Not fun. If you have access to TAoCP then you can find in Section 6.2.3 algorithms that tell you how to use arrays and simple logic tests to represent sequences as AVL trees. That gives you logarithmic time for both the needed operations. But it's a pain to implement. The memory needed by AVL trees is linear in the length of the sequence. It sounds good. Except it isn't. The constant factor is about 15, which means that for processing a program that takes 25MB in a file one would need about 375MB of memory. That sucks because memory is time. And virtual memory is an eternity.

C++ programmers were lucky. You can simply implement a program that uses strings to represent sequences of characters, try it, see that it is terribly slow, then replace string by crope in one place, and watch it fly. Most compilers already support ropes, a STL data structure that makes its way towards the next version of the standard. (The man pulling the ropes is Hans Boehm.) There is no subtype relation between string and crope. Why was this possible? Because both implement (almost) the same methods!

If you follow the slim rule blindly the you would not implement concatenation for strings. It is a very slow operation so the stupid user of the API should be forbidden to shoot himself by writing terribly slow code. If concatenation is what he wants then let him use ropes. Strings won't give him the means to hang himself. Besides, if he wants to use strings and only occasionally do a concatenation then he can write a simple loop that appends characters one by one. It's quite easy to do and it will make it obvious how slow that part of the code is. One should avoid expensive functions in an API. So concatenation for strings is out.

But that would prevent programmers from writing the interpreter using strings and then switching to ropes by one local change! The programmers should have used ropes in the first place, you say? But that was a hard problem. The programmers had to experiment a bit to see how fast things are. Besides, blaming the users (in this case the programmers) sounds like a lame excuse for the API designer.

I think that string should implement concatenation because (1) it can, (2) concatenation is a well defined operation (3) that is implemented efficiently by ropes, which are a similar data structure, and (4) it is sometimes needed in practice. It should also have the same name/syntax as in the case of ropes. This clearly goes against the minimalist point of view. Nevertheless, you saw an example where such a strategy leads to good results.

Now, how about not having efficiency guarantees?

My point is that the interface is the sequence and its operations. Both string and crope should be seen merely as implementations. And they should have efficiency guarantees.

So, write fat interfaces with no efficiency guarantees and have lots of implementations, all faithfully adhering to the interface. Explain the efficiency of the implementations as well as possible. Not only time. Not only asymptotics. Explain how it works, give exact formulas if possible, give the results of real benchmarks. (And tell me how it worked out ☺.)

There is one argument in the slim rule that I don't see how to crack. (Your help is welcome.) An operation that does not have a clear spec should not go in the interface, because there is a nonnegligible chance that it will change, and changes in the API are usually accompanied by havoc, some even warn about the possible extinction of life on Earth.

Bleah!