You insert a1, a2, …, an into an initially empty set data structure S. What is the size s of S at the end?
Say that each ak is taken from 1..m and all mn sequences are equally likely. If n=0 then s=0; if n→∞ then s→m; and a quick and dirty simulator shows that if n=m then s≈0.63m. The number c(n,k,l) of sequences of length n that have exactly k unique elements taken from 1..l satisfies c(n,k,l)=c(n,k,l−1)+∑q>0(nq)c(n−q,k−1,l−1). This gives a way to compute s exactly for m=n<100, and it seems that indeed s/n approaches ≈0.63 from above as n grows.
Do you know a formula for 0.63? For s(m=n)? For s(m,n)? (I don't.)
6 comments:
1 - 1/e
Do you know why?
Inclusion/exclusion, I think.
Let the measure of a sequence be the number of distinct elements.
Assume m=n. There are n! sequences with measure n. How many sequences are there with measure <n−1?
I don't follow.
Let me tell you what I tried today.
First, I want to explain the equation above, after changing the notation. c(m,n,s)=c(m−1,n,s)+∑k>0(nk)c(m−1,n−k,s−1) Here c(m,n,s) is the number of sequences a0,a1,…,an over the alphabet 1..m that result in a set of size s. The letter m appears k=0,1,2,… times in such a sequence. For each k, there are (nk) possible positions of the letter m. By deleting letter m we obtain a sequence of length n−k over the alphabet 1..m−1. This smaller sequence leads to a set of size s or s−1 depending on whether k=0 or k>0.
The boundary condition should tell us how to evaluate c(m,n,s) when one of the argument is 0. We have c(0,0,s)=1 and 0 in all other boundary cases.
Note that c(m,n,s)=0 when s>m or s>n. Also, we may simplify the boundary of the sum, c(m,n,s)=c(m−1,n,s)−c(m−1,n,s−1)+∑k(nk)c(m−1,n−k,s−1)
As a sanity check we may verify that c(n,n,n)=n! as you noticed. The base case is given by the boundary condition. In c(n+1,n+1,n+1)=c(n,n+1,n+1)+∑k>0(n+1k)c(n,n+1−k,n) the only non-zero term is (n+11)c(n,n,n) which evaluates to (n+1)! if we use the induction hypothesis.
Second, here's what I tried today. Let C(x,y,z)=∑m,n,sc(m,n,s)xmynn!zs Then C=xC−xzC+xzeyC+(1−x)−1. The last term comes from the boundary condition; ey appears because the sum looks like a binomial convolution with ⟨1,1,1,…⟩ So C=1(1−x)(1−(zey−z+1)x)
Now, writing [xm]F for the coefficient of xm in the formal series F, we have ∑sc(m,n,s)zs=[yn/n!][xm]C because C=∑m,nxmyn/n!∑sc(m,n,s)zs. Let us write D(z) for ∑sc(m,n,s)zs. The series C is useful because D can be computed from it and once D is known the answer to the original question is s(m,n)=D′(1)mn
To compute [xm]C we treat y and z as constants. Since 1/(1−cx) is the ogf of ⟨cm⟩m we have that C is the ogf of the convolution of ⟨1⟩m and ⟨(zey−z+1)m⟩m, so [xm]C=m∑k=0(zey−z+1)k. To compute [yn/n!][xm]C we treat z as a constant. We see that zey−z+1 is the egf of ⟨1,z,z,z,…⟩ so (zey−z+1)k is the egf of the binomial convolution of m copies of such sequences. Here the computation becomes a bit hairy. I got s(m,n)=(0n+1+1n+1+⋯+mn+1)/mn, which must be wrong because s(n,n)>n for n>0.
I might leave this rest for awhile. Mistakes tend to glare at you after taking a break.
Last night (while half asleep) I realized that it is very easy to get an approximation of s(m,n). Imagine m bins out of which x are free. Then you throw a ball in one of the m bins. The expected number of empty bins afterwards is (x−1)x/m+x(m−x)/m which is x(m−1)/m. Hand-waving a bit, we see that m−s(m,n)≈m(1−1m)n which explains the 1−1/e.
The remaining problem is to make the above argument rigorous.
This morning I realized that I came up with exactly the same solution as in this comment three years ago. The funny thing is that at the time a physicist was trying to convince me that the argument is not rigorous enough.
... and now I just noticed that someone gave me the rigouros argument. I asked the question on mathexchange yesterday.
I feel pretty stupid now.
Post a Comment
Note: (1) You need to have third-party cookies enabled in order to comment on Blogger. (2) Better to copy your comment before hitting publish/preview. Blogger sometimes eats comments on the first try, but the second works. Crazy Blogger.