10 February 2010

Graph Clustering and Answer Set Programming

A short summary of two articles I read on the bus.

Cut-clustering. Flake, Tarjan, and Tsioutsiouliklis published in 2004 an algorithm for clustering graphs. I always enjoy reading articles co-authored by Tarjan because they are so clear and so light on notation. The algorithm acts on undirected graphs. First, a special node is connected to all other nodes thru edges of cost α, a parameter. Initially all nodes are their own cluster, but clusters will be merged as nodes are processed. To heuristically speed up the runtime, nodes are processed in decreasing order of their "connectivity", which is the sum of costs on adjacent edges. Processing one node consists of finding a minimum cut between it and the special node; for all nodes on the same side of the cut as the node being processed (1) we mark them as touched and (2) we merge their clusters. Touched nodes are not processed further. (It doesn't hurt but it's faster not to process them.) That's it.

The article discusses why this works and what properties the resulting clustering has. In short, by making α bigger you get smaller clusters, but there is no fine control on the number and size of clusters. On the bright side, the "quality" of the clusters is good. Also, by varying α the resulting clustering is hierarchical.

The authors present three case studies: the graph of citations for papers in CiteSeer, websites close to dmoz, and something about websites discussing 9/11. The CiteSeer results are interesting. Here are the top level clusters:

  • 61837 paper on image, learning, problem, linear, algorithm
    • 36986 papers on learning, image, neural, recognition, knowledge
    • 8179 papers on numerical, equations, linear, matrix, nonlinear
    • 3650 papers on wavelets, coding, transform
    • 2386 papers on constraint satisfaction
    • 2377 papers on bayesian, monte carlo, markov
  • 34353 papers on parallel, memory, distributed, execution, processors
  • 20593 papers on network, service, internet, security, traffic
  • 15415 papers on logic programming, software, language, semantics

It would be interesting to compare experimentally this approach with something like k-means.

Answer set programming. In 2002, Lifschitz wrote a nice introduction to answer set programming. I shall summarize only the basics. A program that finds a big clique in a graph looks like this:

const j = 3.
vertex(0..5).  % there are six vertices and here are the edges...
edge(0,1). edge(1,2). edge(2,0). edge(3,4). edge(4.5). edge(5,3). edge(4,0). edge(2,5).

j {in(X) : vertex(X)}. % we want at least 3 vertices ``in'' the clique
joined(X,Y) :- edge(X,Y).
joined(X,Y) :- edge(Y,X).
:- in(X), in(Y), X!=Y, vertex(X), vertex(Y), not joined(X,Y).
  % if X and Y are not joined, then they can't be two distinct selected vertices

SMODELS runs the above program. Beware, you need to change the sources to compile it with a modern GCC.

The theory behind it goes as follow. There is a set S of literals, which come in pairs, the positive literal l and the negative literal -l. Even if above we used some syntactic sugar, the program is essentially a set of rules of the form:

A, not B, :- C, not D

where A, B, C, and D are sets of literals. The set X is an answer set for such a program when

  • it is consistent (it does not contain both l and -l) and
  • it is an answer set for the reduct program

The reduct program is obtained by keeping only the rules for which XB and XD=∅ and by removing the not parts of the kept rules. The set X is an answer set for a not-free program when

  • X is closed under the program and
  • X is a minimal (with respect to set inclusion)

Finally, a set X is closed under a (not-free) program when for each rule

  • CX or
  • AX

A nice exercise would be to find an encoding of SAT in answer set programming. (Lifschitz has a solution.) The converse is the subject of a paper by Fangzhen Lin and Yuting Zhao from 2002 (but published in 2004).

update 1: when α is bigger the clusters are smaller.
update 2: fixed the problem pointed out by Mikolas in the first comment.

2 comments:

mikolas said...

Is "X is closed under the program" as stable? BTW "either" in that def is probably too strict.

rgrig said...

1. "X is closed under the program" as defined by Lifschitz is different from "X is a stable model" as defined by Lin&Zhao. A stable model is an answer set for the program you obtain by removing rules that have A=B=∅ (these rules are "constraints"). Unless I misunderstood the definitions, of course :)

2. The definition is correct without "either". I fixed it. Thanks for pointing it out.

Post a Comment

Note: (1) You need to have third-party cookies enabled in order to comment on Blogger. (2) Better to copy your comment before hitting publish/preview. Blogger sometimes eats comments on the first try, but the second works. Crazy Blogger.