Processing math: 100%

29 September 2010

Expected Size of a Set

You insert a1, a2, , an into an initially empty set data structure S. What is the size s of S at the end?

Say that each ak is taken from 1..m and all mn sequences are equally likely. If n=0 then s=0; if n then sm; and a quick and dirty simulator shows that if n=m then s0.63m. The number c(n,k,l) of sequences of length n that have exactly k unique elements taken from 1..l satisfies c(n,k,l)=c(n,k,l1)+q>0(nq)c(nq,k1,l1). This gives a way to compute s exactly for m=n<100, and it seems that indeed s/n approaches 0.63 from above as n grows.

Do you know a formula for 0.63? For s(m=n)? For s(m,n)? (I don't.)

6 comments:

ja said...

1 - 1/e

rgrig said...

Do you know why?

ja said...

Inclusion/exclusion, I think.

Let the measure of a sequence be the number of distinct elements.

Assume m=n. There are n! sequences with measure n. How many sequences are there with measure <n1?

rgrig said...

I don't follow.

Let me tell you what I tried today.

First, I want to explain the equation above, after changing the notation. c(m,n,s)=c(m1,n,s)+k>0(nk)c(m1,nk,s1) Here c(m,n,s) is the number of sequences a0,a1,,an over the alphabet 1..m that result in a set of size s. The letter m appears k=0,1,2, times in such a sequence. For each k, there are (nk) possible positions of the letter m. By deleting letter m we obtain a sequence of length nk over the alphabet 1..m1. This smaller sequence leads to a set of size s or s1 depending on whether k=0 or k>0.

The boundary condition should tell us how to evaluate c(m,n,s) when one of the argument is 0. We have c(0,0,s)=1 and 0 in all other boundary cases.

Note that c(m,n,s)=0 when s>m or s>n. Also, we may simplify the boundary of the sum, c(m,n,s)=c(m1,n,s)c(m1,n,s1)+k(nk)c(m1,nk,s1)

As a sanity check we may verify that c(n,n,n)=n! as you noticed. The base case is given by the boundary condition. In c(n+1,n+1,n+1)=c(n,n+1,n+1)+k>0(n+1k)c(n,n+1k,n) the only non-zero term is (n+11)c(n,n,n) which evaluates to (n+1)! if we use the induction hypothesis.

Second, here's what I tried today. Let C(x,y,z)=m,n,sc(m,n,s)xmynn!zs Then C=xCxzC+xzeyC+(1x)1. The last term comes from the boundary condition; ey appears because the sum looks like a binomial convolution with 1,1,1, So C=1(1x)(1(zeyz+1)x)

Now, writing [xm]F for the coefficient of xm in the formal series F, we have sc(m,n,s)zs=[yn/n!][xm]C because C=m,nxmyn/n!sc(m,n,s)zs. Let us write D(z) for sc(m,n,s)zs. The series C is useful because D can be computed from it and once D is known the answer to the original question is s(m,n)=D(1)mn

To compute [xm]C we treat y and z as constants. Since 1/(1cx) is the ogf of cmm we have that C is the ogf of the convolution of 1m and (zeyz+1)mm, so [xm]C=mk=0(zeyz+1)k. To compute [yn/n!][xm]C we treat z as a constant. We see that zeyz+1 is the egf of 1,z,z,z, so (zeyz+1)k is the egf of the binomial convolution of m copies of such sequences. Here the computation becomes a bit hairy. I got s(m,n)=(0n+1+1n+1++mn+1)/mn, which must be wrong because s(n,n)>n for n>0.

I might leave this rest for awhile. Mistakes tend to glare at you after taking a break.

rgrig said...

Last night (while half asleep) I realized that it is very easy to get an approximation of s(m,n). Imagine m bins out of which x are free. Then you throw a ball in one of the m bins. The expected number of empty bins afterwards is (x1)x/m+x(mx)/m which is x(m1)/m. Hand-waving a bit, we see that ms(m,n)m(11m)n which explains the 11/e.

The remaining problem is to make the above argument rigorous.

This morning I realized that I came up with exactly the same solution as in this comment three years ago. The funny thing is that at the time a physicist was trying to convince me that the argument is not rigorous enough.

rgrig said...

... and now I just noticed that someone gave me the rigouros argument. I asked the question on mathexchange yesterday.

I feel pretty stupid now.

Post a Comment

Note: (1) You need to have third-party cookies enabled in order to comment on Blogger. (2) Better to copy your comment before hitting publish/preview. Blogger sometimes eats comments on the first try, but the second works. Crazy Blogger.