*A short summary of two articles I read on the bus.*

**Cut-clustering**.
Flake, Tarjan, and Tsioutsiouliklis published in 2004
an algorithm for clustering graphs. I always enjoy reading
articles co-authored by Tarjan because they are so clear and
so light on notation. The algorithm acts on undirected graphs.
First, a special node is connected to all other nodes thru edges
of cost α, a parameter. Initially all nodes are their own
cluster, but clusters will be merged as nodes are processed.
To heuristically speed up the runtime, nodes are processed in
decreasing order of their "connectivity", which is the
sum of costs on adjacent edges. Processing one node consists of
finding a minimum cut between it and the special node; for all
nodes on the same side of the cut as the node being processed
(1) we mark them as touched and (2) we merge their clusters.
Touched nodes are not processed further. (It doesn't hurt but it's
faster not to process them.) That's it.

The article discusses why this works and what properties the resulting clustering has. In short, by making α bigger you get smaller clusters, but there is no fine control on the number and size of clusters. On the bright side, the "quality" of the clusters is good. Also, by varying α the resulting clustering is hierarchical.

The authors present three case studies: the graph of citations for papers in CiteSeer, websites close to dmoz, and something about websites discussing 9/11. The CiteSeer results are interesting. Here are the top level clusters:

- 61837 paper on image, learning, problem, linear, algorithm
- 36986 papers on learning, image, neural, recognition, knowledge
- 8179 papers on numerical, equations, linear, matrix, nonlinear
- 3650 papers on wavelets, coding, transform
- 2386 papers on constraint satisfaction
- 2377 papers on bayesian, monte carlo, markov

- 34353 papers on parallel, memory, distributed, execution, processors
- 20593 papers on network, service, internet, security, traffic
- 15415 papers on logic programming, software, language, semantics

It would be interesting to compare experimentally this approach with
something like
*k*-means.

**Answer set programming**.
In 2002,
Lifschitz wrote a nice introduction to answer set programming. I shall
summarize only the basics. A program that finds a big clique in
a graph looks like this:

const j = 3. vertex(0..5). % there are six vertices and here are the edges... edge(0,1). edge(1,2). edge(2,0). edge(3,4). edge(4.5). edge(5,3). edge(4,0). edge(2,5). j {in(X) : vertex(X)}. % we want at least 3 vertices ``in'' the clique joined(X,Y) :- edge(X,Y). joined(X,Y) :- edge(Y,X). :- in(X), in(Y), X!=Y, vertex(X), vertex(Y), not joined(X,Y). % if X and Y are not joined, then they can't be two distinct selected vertices

SMODELS runs the above program. Beware, you need to change the sources to compile it with a modern GCC.

The theory behind it goes as follow. There is a set *S*
of literals, which come in pairs, the positive literal *l*
and the negative literal -*l*. Even if above we used some syntactic
sugar, the program is essentially a set of rules of the form:

*A*,

**not**

*B*, :-

*C*,

**not**

*D*

where *A*, *B*, *C*, and *D* are sets of
literals. The set *X* is an answer set for such a program
when

- it is consistent (it does not contain both
*l*and -*l*)**and** - it is an answer set for the reduct program

The reduct program is obtained
by keeping only the rules for which *X*⊇*B* and
*X*∩*D*=∅ and by removing the **not** parts
of the kept rules. The set *X* is an answer set for a **not**-free
program when

*X*is closed under the program**and***X*is a minimal (with respect to set inclusion)

Finally, a set *X* is closed under a (**not**-free) program
when for each rule

*C*⊄*X***or***A*⊆*X*

A nice exercise would be to find an encoding of SAT in answer set programming. (Lifschitz has a solution.) The converse is the subject of a paper by Fangzhen Lin and Yuting Zhao from 2002 (but published in 2004).

**update 1**: when α is bigger the clusters are *smaller*.

**update 2**: fixed the problem pointed out by Mikolas in the first comment.

## 2 comments:

Is "X is closed under the program" as stable? BTW "either" in that def is probably too strict.

1. "X is closed under the program" as defined by Lifschitz is different from "X is a stable model" as defined by Lin&Zhao. A stable model is an answer set for the program you obtain by removing rules that have A=B=∅ (these rules are "constraints"). Unless I misunderstood the definitions, of course :)

2. The definition is correct without "either". I fixed it. Thanks for pointing it out.

## Post a Comment

Note: (1) You need to have third-party cookies enabled in order to comment on Blogger. (2) Better to copy your comment before hitting publish/preview. Blogger sometimes eats comments on the first try, but the second works. Crazy Blogger.