12 July 2014

Minimizing Weighted Automata via Linear Algebra

In this post I present the minimization problem for weighted automata, closely following a paper by Stefan Kiefer and Björn Wachter. I do add lots of steps that aren't necessary. To me, they make the result look less like magic. I really like the mixture of automata and linear algebra.

Definition. Finite automata describe functions $\Sigma^*\to2$. Weighted automata describe functions $\Sigma^*\to\mathbb{R}$. I will only describe weighted automata, because they are more general and less known. The configuration of a weighted automaton with $n$ states is a vector with $n$ components. Processing one letter amounts to multiplying this vector by a matrix. For example, the configuration after processing the word hello is $x^T A_hA_eA_lA_lA_o$. Here $x$ is the initial configuration. (When I write $x,y,z,\ldots$ I mean column vectors. But, it is more convenient to see the configuration of the automaton as a row vector. Otherwise, you'd have to reverse words, as in $A_oA_lA_lA_eA_hx$.) Each configuration of the automaton determines a real number: just take the scalar product between the configuration and a fixed vector $y$. Thus, the word hello is mapped to the number $x^TA_hA_eA_lA_lA_oy$. To summarize, a weighted automaton of size $n$ has:

• an initial configuration $x\in\mathbb{R}^n$,
• a square matrix $A_{\alpha}\in\mathbb{R}^{n\times n}$ for each letter $\alpha\in\Sigma$, and
• accepting weights, grouped in a vector $y\in\mathbb{R}^n$.

Example. To define the language of strings of odd length we could take \begin{align} &x=\left[\matrix{1\cr0\cr0\cr0\cr}\right] &&A_{\alpha} = \left[\matrix{0&1&0&0\cr0&0&1&0\cr0&0&0&1\cr1&0&0&0}\right] &&y=\left[\matrix{0\cr1\cr0\cr1}\right] \end{align} This is a deterministic finite automaton with $4$ states, out of which $2$ are accepting. ∎

Change of base. The automaton is represented by vectors $x$, $y$ and matrices $A_{\alpha}$. In turn, vectors and matrices are represented as collections of real numbers. These numbers implicitly depend on the choice of a basis. This observation gives us a simple way to design automata equivalent to a given one. Suppose we are given $x$, $y$, and $A_{\alpha}$. We pick some invertible matrix $C$ and set \begin{align} & x'=Cx && y'=Cy && A'_{\alpha}=CA_{\alpha}C^{-1} \end{align} This is a very standard thing to do in vector spaces. It is easy to see that the primed automaton computes exactly the same weight. For example, the weight of the word abc is \begin{align} &\phantom{=} x'^T A'_aA'_bA'_c y' \\ &= (Cx)^T (CA_aC^{-1}) (CA_bC^{-1}) (CA_cC^{-1}) (Cy) \\ &= x^T (C^T C) A_a (C^{-1}C) A_b (C^{-1}C) A_c (C^{-1}C) y \\ &= x^T (C^T C) A_aA_bA_c y \end{align} Hmmm … not quite the same: the matrix $C$ must also come from an orthonormal basis; that is, $C^T C=I_n$. But this is doable. And, of course, the same calculation works for any word, not just abc. Two automata that describe the same function $\Sigma^*\to\mathbb{R}$ are said to be equivalent.

Example. Continuing the previous example, we could pick $C=A_{\alpha}$. It is invertible and $C^T C=I_4$. The resulting automaton is the following: \begin{align} &x=\left[\matrix{0\cr0\cr0\cr1\cr}\right] &&A_{\alpha} = \left[\matrix{0&1&0&0\cr0&0&1&0\cr0&0&0&1\cr1&0&0&0}\right] &&y=\left[\matrix{1\cr0\cr1\cr0}\right] \end{align} This example is perhaps too simple: It amounts to renaming states. One could also pick a more complicated $C$, though. ∎

Changing the number of states. The calculation for the change of base suggests that what we really want is this: \begin{align} & C^TC=I_n && x'=Cx && y'=Cy && A'_{\alpha}=CA_{\alpha}C^T \end{align} We don't need to insist on $C$ being invertible, and therefore square! Now $n'$, the size of the primed automaton, may be different from $n$. Well, it must be that $n'\ge n$: The matrix $C$ has dimension $n'\times n$, and $C^TC=I_n$ says that the columns of $C$ form an orthonormal basis for an $n$-dimensional subspace of $\mathbb{R}^{n'}$. In other words, we have a very simple recipe for increasing the number of states. For any $n'\ge n$ it's easy to find some $n$-dimensional subspace of $\mathbb{R}^{n'}$, then choose an orthonormal basis for it, call it $C$, and then compute $Cx$, $Cy$, and $CA_{\alpha}C^T$.

Reducing the number of states. Of course, increasing the number of states seems like a rather dumb thing to do. So, the next thing that comes to mind is to ask when can the process from above be reversed? That is, if we are given $x'$, $y'$, $A'_{\alpha}$, can we compute $x$, $y$, $A_{\alpha}$? These are necessary conditions: \begin{align} & x=C^Tx' && y=C^Ty' && A_{\alpha}=C^TA'_{\alpha}C \end{align} (From $Cx=x'$ we get $C^TCx=C^Tx'$ and finaly $x=C^Tx'$. The others are similar.) These conditions are not sufficient. For example, $x'=Cx$ does not follow from $x=C^Tx'$, unless $x'$ is in the image of $C$. If it is, then there must be a $z$ such that $x'=Cz$ and it must be that $z=x$: $$x=C^Tx'=C^TCz=z$$ The same argument works for $y$. So, we know that we probably want to pick $C$ such that its image includes the given $x'$ and $y'$.

But what about $A_{\alpha}$? If we'd follow the same line of reasoning, we'd want to find some sufficient condition such that $A_{\alpha}=C^TA'_{\alpha}C$ implies $A'_{\alpha}=CA_{\alpha}C^T$. So, let's try it. We are given $A'_{\alpha}$ and, optimistically, we compute $A_{\alpha}$ as $C^TA'_{\alpha}C$. Then we check if we got the right result by plugging into the sufficient condition $A'_{\alpha}=CA_{\alpha}C^T$: $$A'_{\alpha} = CC^TA'_{\alpha}CC^T$$ These matrices are the same if they affect in the same way all vectors $z'$: $$A'_{\alpha} z' = CC^TA'_{\alpha}CC^T z'$$ And this provides a strong hint as to what are the sufficient conditions we are looking for. The following (optimistic) calculation almost suggests itself: \begin{align} &\phantom{=} CC^TA'_{\alpha}CC^Tz' \\ &= \{ \text{ASSUME z'=Cz for some z} \} \\ &\phantom{=} CC^TA'_{\alpha}CC^TCz \\ &= \{ \text{because C^TC=I_n} \} \\ &\phantom{=} CC^TA'_{\alpha}Cz \\ &= \{ \text{because z'=Cz} \} \\ &\phantom{=} CC^TA'_{\alpha}z' \\ &= \{ \text{denote A'_{\alpha}z' by u'} \} \\ &\phantom{=} CC^Tu' \\ &= \{ \text{ASSUME u'=Cu for some u} \} \\ &\phantom{=} CC^TCu \\ &= \{ \text{because C^TC=I_n} \}\\ &\phantom{=} Cu \\ &= \{ \text{because Cu=u'=A'_{\alpha}z'} \} \\ &\phantom{=} A'_{\alpha}z' \end{align} So, the equality almost works, except we had to make two new assumptions. We now know that $A'_{\alpha}$ and $CC^TA'_{\alpha}CC^T$ affect a vector $z'$ in the same way provided that: (1) $z'$ is in the image of $C$, and (2) $A'_{\alpha}z'$ is in the image of $C$. This looks very promising.

Let's recap the setting. We are given $x'$, $y'$, $A'_{\alpha}$. We pick some $C$ such that $C^TC=I_n$. We compute $x=C^Tx'$ and $y=C^Ty'$ and $A_{\alpha}=C^TA'_{\alpha}C$. We hope that if the image of $C$ includes all the right things, then we have an equivalent automaton. Let's look at the weight associated to the word abc. \begin{align} &\phantom{=} x^TA_aA_bA_cy \\ &= \{ \text{we computed x, y, A_{\alpha} as x=C^Tx' \ldots\, } \} \\ &\phantom{=} (C^Tx')^T (C^TA'_aC) (C^TA'_bC) (C^TA'_cC) (C^Ty') \\ &= \{ \text{associativity} \} \\ &\phantom{=} x'^T (CC^T) A'_a (CC^T) A'_b (CC^T) A'_c (CC^T) y' \\ &= \{ \text{ASSUME y' and A'_cy' are in the image of C} \} \\ &\phantom{=} x'^T (CC^T) A'_a (CC^T) A'_b A'_c y' \\ &= \{ \text{ASSUME A'_bA'_cy' and A'_aA'_bA'_cy' are in the image of C} \} \\ &\phantom{=} x'^T A'_a A'_b A'_c y' \end{align} Here we need to assume that the image of $C$ contains $y'$, $A'_cy'$, $A'_bA'_cy'$, $A'_aA'_bA'_cy'$. We did not need to assume that the image of $C$ contains $x$. In other words, we set out to revert the procedure for growing automata, but found something more general.

The general case should now be clear, so let's summarize it. If the columns of $C$ form an orthonormal basis that covers $A'_{\alpha_1}\ldots A'_{\alpha_n}y'$, then $x$, $y$, $A_{\alpha}$ computed as above give indeed an equivalent automaton. Let's call this the backward minimization procedure. By symmetry, there is also a forward minimization procedure. If the columns of $C$ form an orthonormal basis that covers $A'^T_{\alpha_1}\ldots A'^T_{\alpha_n}x'$, then $x$, $y$, $A_{\alpha}$ computed as above give indeed an equivalent automaton.

Example. Let's minimize our (first) example automaton.

The forward minimization procedure doesn't help much. If we start with $x=[1,0,0,0]^T$ and keep applying $A_{\alpha}$, then we can also reach $[0,1,0,0]^T$ and $[0,0,1,0]^T$ and $[0,0,0,1]^T$. The space that includes all these vectors is clearly $4$-dimensional.

What about the backward minimization procedure? If we start with $y=[0,1,0,1]^T$ and keep applying $A^T_{\alpha}$ we'll generate only one more vector, namely $[1,0,1,0]^T$. We can cover these two vectors with a $2$-dimensional orthonormal basis. $$C = \frac{1}{\sqrt{2}} \begin{bmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}$$ The minimized automaton will have $x=[0,1/2\sqrt{2}]^T$ and $y=[\sqrt{2},0]^T$ and $A_{\alpha}=\begin{bmatrix}0&1\\1&0\end{bmatrix}$. It is not too hard to check that these indeed compute weight $1$ for words of odd length, and weight $0$ for the others. ∎

Reference. The paper I used is [Kiefer, Wachter, Stability and Complexity of Minimizing Probabilistic Automata, 2014]. If you read the paper, you'll find

• that applying the two minimization procedures (in any order) yields a minimal automaton, and
• that the algorithm is similar to Brzozowski's algorithm.

As the title of the paper suggests, many other topics are covered.