## 29 September 2010

### Expected Size of a Set

You insert $a_1$, $a_2$, $\ldots$, $a_n$ into an initially empty set data structure $S$. What is the size $s$ of $S$ at the end?

Say that each $a_k$ is taken from $1.\,.\>m$ and all $m^n$ sequences are equally likely. If $n=0$ then $s=0$; if $n\to\infty$ then $s\to m$; and a quick and dirty simulator shows that if $n=m$ then $s\approx0.63m$. The number $c(n,k,l)$ of sequences of length $n$ that have exactly $k$ unique elements taken from $1.\,.\>l$ satisfies $$c(n,k,l)=c(n,k,l-1)+\sum_{q>0} {n \choose q} c(n-q,k-1,l-1).$$ This gives a way to compute $s$ exactly for $m=n < 100$, and it seems that indeed $s/n$ approaches $\approx0.63$ from above as $n$ grows.

Do you know a formula for $0.63$? For $s(m=n)$? For $s(m,n)$? (I don't.)

ja said...

1 - 1/e

rgrig said...

Do you know why?

ja said...

Inclusion/exclusion, I think.

Let the measure of a sequence be the number of distinct elements.

Assume $m = n$. There are $n!$ sequences with measure $n$. How many sequences are there with measure $< n-1$?

rgrig said...

I don't follow.

Let me tell you what I tried today.

First, I want to explain the equation above, after changing the notation. $$c(m,n,s)=c(m-1,n,s)+\sum_{k>0} {n\choose k} c(m-1,n-k,s-1)$$ Here $c(m,n,s)$ is the number of sequences $a_0, a_1, \ldots, a_n$ over the alphabet $1.\,.\,m$ that result in a set of size $s$. The letter $m$ appears $k=0,1,2,\ldots$ times in such a sequence. For each $k$, there are ${n\choose k}$ possible positions of the letter $m$. By deleting letter $m$ we obtain a sequence of length $n-k$ over the alphabet $1.\,.\,m-1$. This smaller sequence leads to a set of size $s$ or $s-1$ depending on whether $k=0$ or $k>0$.

The boundary condition should tell us how to evaluate $c(m,n,s)$ when one of the argument is 0. We have $c(0,0,s)=1$ and 0 in all other boundary cases.

Note that $c(m,n,s)=0$ when $s>m$ or $s>n$. Also, we may simplify the boundary of the sum, $$c(m,n,s)=c(m-1,n,s)-c(m-1,n,s-1)+\sum_k {n\choose k} c(m-1,n-k,s-1)$$

As a sanity check we may verify that $c(n,n,n)=n!$ as you noticed. The base case is given by the boundary condition. In $$c(n+1,n+1,n+1)=c(n,n+1,n+1)+\sum_{k>0}{n+1\choose k}c(n,n+1-k,n)$$ the only non-zero term is ${n+1\choose 1}c(n,n,n)$ which evaluates to $(n+1)!$ if we use the induction hypothesis.

Second, here's what I tried today. Let $$C(x,y,z)=\sum_{m,n,s} c(m,n,s) x^m \frac{y^n}{n!} z^s$$ Then $C=xC-xzC+xze^yC+(1-x)^{-1}$. The last term comes from the boundary condition; $e^y$ appears because the sum looks like a binomial convolution with $\langle1,1,1,\ldots\rangle$ So $$C=\frac{1}{(1-x)(1-(ze^y-z+1)x)}$$

Now, writing $[x^m]F$ for the coefficient of $x^m$ in the formal series F, we have $\sum_s c(m,n,s) z^s=[y^n/n!][x^m]C$ because $C=\sum_{m,n} x^m y^n/n! \sum_s c(m,n,s) z^s$. Let us write $D(z)$ for $\sum_s c(m,n,s) z^s$. The series $C$ is useful because $D$ can be computed from it and once $D$ is known the answer to the original question is $$s(m,n)=\frac{D'(1)}{m^n}$$

To compute $[x^m]C$ we treat $y$ and $z$ as constants. Since $1/(1-cx)$ is the ogf of $\langle c^m\rangle_m$ we have that $C$ is the ogf of the convolution of $\langle1\rangle_m$ and $\langle(ze^y-z+1)^m\rangle_m$, so $$[x^m]C=\sum_{k=0}^m (ze^y-z+1)^k$$. To compute $[y^n/n!][x^m]C$ we treat $z$ as a constant. We see that $ze^y-z+1$ is the egf of $\langle1,z,z,z,\ldots\rangle$ so $(ze^y-z+1)^k$ is the egf of the binomial convolution of $m$ copies of such sequences. Here the computation becomes a bit hairy. I got $s(m,n)=(0^{n+1}+1^{n+1}+\cdots+m^{n+1})/m^n$, which must be wrong because $s(n,n)>n$ for $n>0$.

I might leave this rest for awhile. Mistakes tend to glare at you after taking a break.

rgrig said...

Last night (while half asleep) I realized that it is very easy to get an approximation of $s(m,n)$. Imagine $m$ bins out of which $x$ are free. Then you throw a ball in one of the $m$ bins. The expected number of empty bins afterwards is $(x-1)x/m+x(m-x)/m$ which is $x(m-1)/m$. Hand-waving a bit, we see that $$m-s(m,n)\approx m\left(1-\frac{1}{m}\right)^n$$ which explains the $1-1/e$.

The remaining problem is to make the above argument rigorous.

This morning I realized that I came up with exactly the same solution as in this comment three years ago. The funny thing is that at the time a physicist was trying to convince me that the argument is not rigorous enough.

rgrig said...

... and now I just noticed that someone gave me the rigouros argument. I asked the question on mathexchange yesterday.

I feel pretty stupid now.