The (a,b,1) class

This post is a continuation of the preceding post on (a,b,0) class. This post introduces the class of discrete discrete distributions called the (a,b,1) class.

The discussion in this post has a great deal of technical details. A concise summary of the (a,b,0) class and (a,b,1) class is found here.

The (a,b,1) Class

A counting distribution is a discrete probability distribution that takes on the non-negative integers. Let N be a random variable that is a counting distribution. For each integer k=0,1,2,\cdots, let P_k=P[N=k] if N is the counting distribution being considered. Recall that a counting distribution is a member of the (a,b,0) class of distributions if the following recursive relation holds for some constants a and b.

(1)……….\displaystyle \frac{P_k}{P_{k-1}}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots

For a member of the (a,b,0) class, the initial probability P_0 is fixed (the sum of all the P_k must sum to 1). Thus a member of the (a,b,0) class has two parameters, namely a and b in the recursive relation (1). A counting distribution is a member of the (a,b,1) class of distributions if the following recursive relation holds for some constants a and b.

(2)……….\displaystyle \frac{P_k}{P_{k-1}}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=2,3,4 \cdots

The recursion in the (a,b,1) class begins at k=2. That means that the initial probability P_0 must be an assumed value. The probability P_1 is then the value such that the sum P_1+P_2+\cdots is 1-P_0. Thus a member of the (a,b,1) class has three parameters: a, b and P_0.

There are two subclasses in the (a,b,1) class of distributions. They are determined by whether P_0=0 or P_0>0 The (a,b,1) distributions in the first category are called the zero-truncated distributions. The (a,b,1) distributions in the second category are called zero-modified distributions.

This is how we will proceed. Using a given distribution in the (a,b,0) class as a starting point, we show how to derive the zero-truncated distribution. From a given zero-truncated distribution, we show how to derive the zero-modified distribution.

The name of the (a,b,1) distribution has the same (a,b,0) name with either zero-truncated or zero-modified as the prefix. For example, if the starting point is the negative binomial distribution in the (a,b,0) class, then the derived distributions in the (a.b.1) class are the zero-truncated negative binomial distribution and the zero-modified negative binomial distribution.

There are only three distributions in the (a,b,0) class – Poisson, binomial and negative binomial. Then the (a,b,1) class contains the zero-truncated and zero-modified versions of these three distributions. However, the (a,b,1) class contained distributions that are not modifications of the (a,b,0) distributions. We discuss three such additional distributions – extended truncated negative binomial (ETNB) distribution, logarithmic distribution and Sibuya distribution. These three distributions that are not derived from an (a,b,0) distribution are discussed in a separate section below.

We present three examples demonstrating how using a base (a,b,0) negative binomial distribution to derive a zero-truncated negative binomial distribution (Example 1) and a zero-modified negative binomial distribution (Example 2). We also give an example for ETNB distribution (Example 3).

Notations

To facilitate discussion, let’s fix some notations. To clearly denote the distributions, notations without superscripts and subscripts refer to the (a,b,0) distributions. Notations with the superscript T (or subscript T) refer to the zero-truncated distributions in the (a,b,1) class. Likewise notations with the superscript M (or subscript M) refer to the zero-modified distributions in the (a,b,1) class.

For example, the following are the probability function (pf) and the probability generating function (pgf) of a distribution from the (a,b,0) class.

(3)……….\displaystyle \begin{aligned}&P_k=P[N=k] \\&P(z)=\sum \limits_{k=0}^\infty P_k z^k  \end{aligned}

The following shows the notations for the pf and pgf for a zero-truncated distribution from the (a,b,1) class.

(4)……….\displaystyle \begin{aligned}&P_k^T=P[N_T=k] \\&P^T(z)=\sum \limits_{k=1}^\infty P_k^T z^k  \end{aligned}

The following shows the notations for the pf and pgf for a zero-modified distribution from the (a,b,1) class.

(5)……….\displaystyle \begin{aligned}&P_k^M=P[N_M=k] \\&P^M(z)=\sum \limits_{k=0}^\infty P_k^M z^k  \end{aligned}

Whenever it is convenient to do so, N is a random variable from (a,b,0) while N_T and N_M are to denote random variables for the zero-truncated distribution and zero-modified distribution, respectively.

Zero-Truncated Distributions

The focus in this section is on the zero-truncated distributions that originate from the (a,b,0) class. The three distributions indicated above (ETNB, logarithmic and Shibuya) are discussed in a separate section below.

Suppose we start with a distribution from the (a,b,0) class, with the notations P_k, P(z) and N as indicated above. We show how to derive the corresponding zero-truncated distribution in the (a,b,1) class. For the zero-truncated distribution, there are two ways to compute probabilities. One is the recursion relation:

(6)……….\displaystyle \frac{P_k^T}{P_{k-1}^T}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=2,3,4 \cdots

The recursion relation (6) is identical to the one in (2). The recursion begins at k=2. The zero-truncated probabilities can also be derived from the (a,b,0) probabilities as follows:

(7)……….\displaystyle P_k^T=\frac{1}{1-P_0} P_k \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,4 \cdots

The probabilities P_k^T in (7) can be regarded as conditional probabilities – the probability that N=k given that N>0. From a procedural standpoint, the probabilities P_k^T are the (a,b,0) probabilities P_k multiplied by 1/(1-P_0) to make the probabilities sum to 1. With the probabilities established, the probability generating function (pgf) and mean and moments of the zero-truncated distribution can also be expressed in terms of the corresponding quantities of the (a,b,0) distribution.

(8)……….\displaystyle P^T(z)=\sum \limits_{k=1}^\infty P_k^T \ z^k =\frac{1}{1-P_0} \ [P(z)-P_0]

(9)……….\displaystyle E[N_T]=\frac{1}{1-P_0} \ E[N]

(10)……..\displaystyle E[N_T^2]=\frac{1}{1-P_0} \ E[N^2]

(11)……..\displaystyle \begin{aligned} Var[N_T]&=\frac{1}{1-P_0} \ E[N^2]-\biggl( \frac{1}{1-P_0} \ E[N]\biggr)^2 \\&=\frac{1}{1-P_0} \ Var[N]+\biggl(1-\frac{1}{1-P_0} \biggr) \ \frac{1}{1-P_0} \ E[N]^2  \end{aligned}

The goal of the above items is to inform on the zero-truncated distribution based on information from the (a,b,0) distribution. They can also be derived based on definitions using the probability function (7). The following shows the factorial means of the zero-truncated distribution.

(12)……..\displaystyle \mu_{(1)}=E[N_T]=\frac{a+b}{(1-a) (1-P_0)}
……..

(13)……..\displaystyle \begin{aligned} \mu_{(j)}&=E \{ N_T \ [N_T-1] \ [N_T-2] \cdots [N_T-(j-1)] \}\\&\text{ } \\&=\frac{(a j+b) \ \mu_{(j-1)}}{1-a}  \end{aligned}

The first factorial mean is identical to the mean of N_T. The P_0 in \mu_{(1)} is the value of zero probability for the corresponding member in the (a,b,0) class. The higher factorial moments are derived recursively as in the (a,b,0) case. The raw moments E[N_T^k] can be derived using the factorial moments. The variance, as derived from the factorial moments, is:

(14)……..\displaystyle Var[N_T]=\frac{(a+b) \ [1-(a+b+1) \ P_0]}{[(1-a) \ (1-P_0) ]^2}

Example 1
It is helpful to go through an example. First, we set up an (a,b,0) distribution – an negative binomial distribution with parameters r=2 and \theta=3. The (a,b,0) parameters are a=3/4 and b=3/4. The following gives the pf and the recursive relation for this negative binomial distribution, as well as the mean, variance and pgf.

……….\displaystyle P_k=(1+k) \ \frac{1}{16} \ \biggl(\frac{3}{4} \biggr)^k \ \ \ \ \ \ \ \ k=0,1,2,3,\cdots

……….\displaystyle P_k=\biggl(\frac{3}{4}+\frac{3}{4} \ \frac{1}{k} \biggr) \ P_{k-1} \ \ \ \ \ \ \ \ k=1,2,3,\cdots

……….\displaystyle E[N]=6

……….\displaystyle Var[N]=24

……….\displaystyle P(z)=[1-3 \ (z-1)]^{-2}

The following table shows the first 5 probabilities for the zero-truncated negative binomial distribution.

Table – Zero-Truncated Negative Binomial

\bold k (a,b,0) \bold P_{\bold k} Zero-Truncated \bold  P_{\bold k}^{\bold T}
0 \displaystyle \frac{1}{16}
1 \displaystyle \frac{3}{32} \displaystyle \frac{3}{30}
2 \displaystyle \frac{27}{256} \displaystyle \frac{27}{240}
3 \displaystyle \frac{27}{256} \displaystyle \frac{27}{240}
4 \displaystyle \frac{405}{4096} \displaystyle \frac{405}{3840}
5 \displaystyle \frac{729}{8192} \displaystyle \frac{729}{7680}

The probabilities P_k are generated by either the (a,b,0) pf or the recursive relation. The probabilities P_k^T are generated by the recursive relation (7) or by the recursive relation (6). The following lists the mean, variance and pgf of the zero-truncated negative binomial example.

……….\displaystyle E[N_T]=\frac{16}{15} \ 6=\frac{32}{5}=6.4

……….\displaystyle Var[N_T]=\frac{576}{25}=23.04

……….\displaystyle P^T(z)=\frac{16}{15} \biggl( [1-3 \ (z-1)]^{-2}-\frac{1}{16} \biggr)

Zero-Modified Distributions

The goal of this section is to derive a zero-modified distribution from a zero-truncated distribution, either derived from an (a,b,0) distribution as discussed in the preceding section, or a truncated distribution not originated from (a,b,0) class.

We now take a zero-truncated distribution as a given and derive the probabilities P_k^M and other distributional quantities. As in the case of zero-truncated distribution, one way to generate probabilities is through the recursion process:

(15)……..\displaystyle \frac{P_k^M}{P_{k-1}^M}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=2,3,4 \cdots

The probability P_0^M>0 is an assumed value. The probability P_1^M is the value that ensures that all the probabilities sum to 1. As indicated the recursion begins at k=2. Another way to calculate probabilities is through the zero-truncated distribution:

(16)……..\displaystyle P_k^M=(1-P_0^M) \ P_k^T \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,4 \cdots

Of course, if the zero-truncated distribution is based on a distribution from the (a,b,0) class, we can express the zero-modified probabilities as, after plugging (7) into (16):

(17)……..\displaystyle P_k^M=\frac{1-P_0^M}{1-P_0} \ P_k \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,4 \cdots

Further distributional quantities can now be derived:

(18)……..\displaystyle P^M(z)=P_0^M \cdot 1+(1-P_0^M) \ P^T(z)

(19)……..\displaystyle E[N_M]=(1-P_0^M) \ E[N_T]

(20)……..\displaystyle Var[N_M]=(1-P_0^M) \ Var[N_T]+P_0^M \ (1-P_0^M) \ E[N_T]^2

The result (18) is the pgf of the zero-modified distribution based on the pgf of the given zero-truncated distribution. In words, (19) says that the mean of the modified distribution is 1-P_0^M times the mean of the given zero-truncated distribution. In words, (20) says that the variance of the zero-modified distribution is 1-P_0^M times the variance of the given zero-truncated distribution plus P_0^M (1-P_0^M) times the square of the mean of the truncated distribution.

If the given zero-truncated distribution is actually obtained from a member of the (a,b,0) class, then the above three results can be expressed in terms of (a,b,0) information, after plugging the corresponding information for N_T into (18), (19) and (20).

(21)……..\displaystyle P^M(z)=\biggl(1-\frac{1-P_0^M}{1-P_0} \biggr) \cdot 1+\frac{1-P_0^M}{1-P_0} \ P(z)

(22)……..\displaystyle E[N_M]=\frac{1-P_0^M}{1-P_0} \ E[N]

(23)……..\displaystyle Var[N_M]=\frac{1-P_0^M}{1-P_0} \ Var[N]+ \biggl(1-\frac{1-P_0^M}{1-P_0} \biggr) \ \frac{1-P_0^M}{1-P_0} \ E[N]^2

Example 2
Consider the zero-truncated negative binomial distribution considered in Example 1. We now generate information on the corresponding zero-modified negative binomial distribution with the assumed value of P_0^M=0.2. The following table gives several probabilities.

Table – Zero-Modified Negative Binomial

\bold k (a,b,0) \bold P_{\bold k} Zero-Truncated \bold  P_{\bold k}^{\bold T} Zero-Modified \bold  P_{\bold k}^{\bold M}
0 \displaystyle \frac{1}{16} 0.2
1 \displaystyle \frac{3}{32} \displaystyle \frac{3}{30} \displaystyle \frac{2.4}{30}
2 \displaystyle \frac{27}{256} \displaystyle \frac{27}{240} \displaystyle \frac{21.6}{240}
3 \displaystyle \frac{27}{256} \displaystyle \frac{27}{240} \displaystyle \frac{21.6}{240}
4 \displaystyle \frac{405}{4096} \displaystyle \frac{405}{3840} \displaystyle \frac{326.4}{3840}
5 \displaystyle \frac{729}{8192} \displaystyle \frac{729}{7680} \displaystyle \frac{583.2}{7680}

The zero-modified probabilities P_k^M are calculated according to (16). With the assumed value P_0^M=0.2, 1-P_0^M=0.8. We simply multiply each zero-truncated probability by 0.8. Using (18), (19) and (20), we obtain the mean, variance and pgf of the zero-modified negative binomial example.

……….\displaystyle E[N_M]=0.8 \ E[N_T]=0.8 (6.4)=5.12

……….\displaystyle Var[N_M]=24.9856

……….\displaystyle P^M(z)=0.2+0.8 \biggl[ \frac{16}{15} \biggl( [1-3 \ (z-1)]^{-2}-\frac{1}{16} \biggr) \biggr]

Additional Zero-Truncated Distributions

As indicated earlier, the (a,b,1) class contains distributions other than the ones derived from the three (a,b,0) distributions. These distributions also have the zero-truncated versions as well as the zero-modified versions. We discuss the truncated versions. They are: the extended truncated negative binomial (ETNB) distribution, the logarithmic distribution and the Sibuya distribution. The extended truncated negative binomial (ETNB) distribution is resulted from relaxing the r parameter of the negative binomial distribution. The logarithmic distribution and Sibuya distribution are derived from the ETNB distribution. The modified versions of these three distributions can then be obtained by going through the process outlined in the preceding section.

ETNB
Recall that the (a,b,0) negative binomial distribution has two parameters r and \theta. The following gives the parameters a and b used in the (a,b,0) recursion and the first two probabilities.

……….\displaystyle a=\frac{\theta}{1+\theta} \ \ \ \ \ \ \ \ \ \ \ \ \ \ b=(r-1) \ \frac{\theta}{1+\theta} \ \ \ \ \ \ r>0, \ \theta >0

……….\displaystyle P_0=\biggl( \frac{1}{1+\theta} \biggr)^r \ \ \ \ \ \ P_1=r \ \biggl( \frac{1}{1+\theta} \biggr)^r \biggl( \frac{\theta}{1+\theta} \biggr)=\frac{r \ \theta}{(1+\theta)^{r+1}}

The extended negative binomial distribution is resulted from extending the r parameter so that -1<r<0 is applicable in addition to the usual r>0. With the extension of r, the ETNB probabilities are generated according to the truncated probabilities of (7). In effect, we are pretending that we are starting from a base (a,b,0) negative binomial distribution even the r parameter could be such that -1<r<0. Thus the two parameters of the zero-truncated ETNB distribution are given by the following:

(24)……..\displaystyle a=\frac{\theta}{1+\theta} \ \ \ \ \ \ b=(r-1) \ \frac{\theta}{1+\theta} \ \ \ \ \ \ -1<r<\infty, \ r \ne 0, \ \theta >0

What do we do with the ETNB parameters indicated in (24)? Using these a and b, we can generate the “negative binomial” probabilities P_k according to the recursive relation (1) with P_0=[1/(1+\theta)]^r. However, with r being negative, these values of P_k are not probabilities (in fact they are negative). However, the value of 1/(1-P_0) is also negative when r is negative. Using (7), the zero-truncated probabilities P_k^T are positive. Thus the “negative binomial” distribution using a negative r is not really a distribution. It is just a device to define ETNB distribution.

Using the idea in the preceding paragraph, we can also come up with direct formula for the ETNB probabilities P_k. The following gives the first three probabilities.

……….\displaystyle P_1^T=\frac{1}{1-P_0} \ P_1=\frac{1}{1-P_0} \ \frac{r \ \theta}{(1+\theta)^{r+1}}=\frac{r \theta}{(1+\theta)^{r+1}-(1+\theta) }

……….\displaystyle \begin{aligned} P_2^T&=\frac{1}{1-P_0} \ P_2\\&=\frac{1}{1-P_0} \ \frac{r (r+1)}{2} \frac{1}{(1+\theta)^r} \ \biggl(\frac{\theta}{1+\theta} \biggr)^2 \\&=\frac{r (r+1)}{2} \ \frac{1}{(1+\theta)^r-1} \ \biggl(\frac{\theta}{1+\theta} \biggr)^2  \end{aligned}

……….\displaystyle \begin{aligned} P_3^T&=\frac{1}{1-P_0} \ P_3\\&=\frac{1}{1-P_0} \ \frac{r (r+1) (r+2)}{3!} \frac{1}{(1+\theta)^r} \ \biggl(\frac{\theta}{1+\theta} \biggr)^3 \\&=\frac{r (r+1) (r+2)}{3!} \ \frac{1}{(1+\theta)^r-1} \ \biggl(\frac{\theta}{1+\theta} \biggr)^3  \end{aligned}

Based on the pattern of the above three probabilities, the ENTB probability P_k, k=1,2,3,\cdots, is:

(25)……..\displaystyle P_k=\frac{1}{1-P_0} \ P_k=\frac{r (r+1) \cdots (r+k-1)}{k!} \ \frac{1}{(1+\theta)^r-1} \ \biggl(\frac{\theta}{1+\theta} \biggr)^k

All other distributional quantities such as pgf and means and higher moments can be derived based on the ETNB pf P_k^T For example, the mean, variance and pgf are:

(26)……..\displaystyle E[N_T]=\frac{1}{1-P_0} \ E[N]=\frac{1}{1-P_0} \ r \ \theta=\frac{r \ \theta}{1-(1+\theta)^{-r}}

(27)……..\displaystyle \begin{aligned} Var[N_T]&=\frac{1}{1-P_0} \ Var[N]+\biggl(1-\frac{1}{1-P_0} \biggr) \ \frac{1}{1-P_0} \ E[N]^2 \\&=r \ \theta \ \frac{(1+\theta)-(1+\theta+ r \ \theta) \ (1+\theta)^{- r}}{[1-(1+\theta)^{-r}]^2}  \end{aligned}

(28)……..\displaystyle \begin{aligned} P^T(z)&=\frac{1}{1-P_0} \ (P(z)-P_0) \\&=\frac{1}{1-(1+\theta)^{-r}} \ \biggl[ \biggl(1-\theta (z-1) \biggr)^{-r} -(1+\theta)^{-r}\biggr]  \end{aligned}

Logarithmic Distribution
This is a truncated distribution that is derived from ETNB by letting r \rightarrow 0. The following shows the information that is needed for the recursive generation of probabilities.

(29)……..\displaystyle a=\frac{\theta}{1+\theta} \ \ \ \ \ \ \ \ b=-\frac{\theta}{1+\theta}

(30)……..\displaystyle P_1^T=\frac{\theta}{(1+\theta) \ \ln(1+\theta)}

The parameter b is obtained by letting r \rightarrow 0 in the b for ETNB. The logarithmic P_1^T is from taking the limit of the ETNB P_1^T as r \rightarrow 0 (using the L’Hopital’s rule). The rest of the pf P_k for k=2,3,\cdots can be generated from the recursive relation (6). Unlike a zero-truncated distribution that is derived from an (a,b,0) distribution, the distributional quantities of the logarithmic distribution cannot be derived from an (a,b,0) distribution. Thus in order to gain more information about the logarithmic, its pf must be used. The mean and variance for the logarithmic distribution are:

(31)……..\displaystyle E[N_T]=\frac{\theta}{\ln(1+\theta) }

(32)……..\displaystyle Var[N_T]=\frac{\theta \biggl[1+\theta-\theta / \ln(1+\theta)  \biggr]}{\ln(1+\theta)}

Sibuya Distribution
This is a truncated distribution that is derived from ETNB by letting \theta \rightarrow \infty and making -1<r<0. The following shows the information that is needed for the recursive generation of probabilities.

(33)……..\displaystyle a=1 \ \ \ \ \ \ \ \ b=r-1

(34)……..\displaystyle P_1^T=-r

All of the three items are obtained by letting \theta \rightarrow \infty in the corresponding items in ETNB. To see that \displaystyle P_1^T=-r, rewrite the ETNB P_1^T as follows:

……….\displaystyle P_1^T=\frac{r \theta}{(1+\theta)^{r+1}-(1+\theta)}=\frac{r \theta}{(1+\theta) \ [(1+\theta)^r-1]}=r \ \frac{\theta}{1+\theta} \ \frac{1}{(1+\theta)^r-1}

As \theta \rightarrow \infty, the ratio \theta / (1+\theta) goes to 1. As \theta \rightarrow \infty, (1+\theta)^r goes to 0 because r is negative. Thus the above P_1^T goes to -r. With the a and b in (34) and the P_1^T in (35), the rest of the Shibuya pf can be generated by the recursive relation in (6). Note that the mean does not exist for the Shibuya distribution. The following is the pgf of the Sibuya distribution.

(35)……..\displaystyle P^T(z)=1-(1-z)^{-r}

Once these three zero-truncated distributions are obtained, we can derive the zero-modified versions of these distributions in the process described earlier.

Example 3
We demonstrate how ETNB is calculated. Let r=-\frac{1}{2} and \theta=3. Then the parameters for the “artificial” negative binomial distribution are:

……….\displaystyle a=\frac{3}{4} \ \ \ \ \ \ \ \ b=\biggl(-\frac{1}{2}-1 \biggr) \frac{3}{4}=-\frac{9}{8}

The P_0 for the artificial negative binomial distribution is P_0=(1/4)^{-0.5}=2, making 1/(1-P_0)=-1. We generate the fake negative binomial probabilities recursively using the a and b. Then we multiply by the 1/(1-P_0)=-1 to get the zero-truncated probabilities according to (7).

Table – Zero-Truncated ETNB

\bold k Artificial \bold P_{\bold k} Zero-Truncated \bold  P_\bold k^{\bold T}
0 \displaystyle 2
1 \displaystyle -\frac{3}{4} \displaystyle \frac{3}{4}
2 \displaystyle -\frac{9}{64} \displaystyle \frac{9}{64}
3 \displaystyle -\frac{27}{512} \displaystyle \frac{27}{512}
4 \displaystyle -\frac{405}{16384} \displaystyle \frac{405}{16384}
5 \displaystyle -\frac{1701}{131072} \displaystyle \frac{1701}{131072}

The column labeled artificial P_k is obviously not probabilities. It is generated recursively using a=3/4 and b=-9/8. Then multiply the column labeled artificial P_k by 1/(1-P_0)=-1 to obtain the ETNB probabilities, which can also be computed directly using (25).

Using (26) and (27), the ETNB mean and variance are E[N_T]=3/2 and Var[N_T]=3/2. With an assumed value of P_0^M=0.1, we generate the first 5 zero-modified ETNB probabilities in the following table.

Table – Zero-Modified ETNB

\bold k Artificial \bold P_{\bold k} Zero-Truncated \bold  P_\bold k^{\bold T} Zero-Modified \bold  P_\bold k^{\bold M}
0 \displaystyle 2 0.1
1 \displaystyle -\frac{3}{4} \displaystyle \frac{3}{4} \displaystyle \frac{2.7}{4}
2 \displaystyle -\frac{9}{64} \displaystyle \frac{9}{64} \displaystyle \frac{8.1}{64}
3 \displaystyle -\frac{27}{512} \displaystyle \frac{27}{512} \displaystyle \frac{24.3}{512}
4 \displaystyle -\frac{405}{16384} \displaystyle \frac{405}{16384} \displaystyle \frac{364.5}{16384}
5 \displaystyle -\frac{1701}{131072} \displaystyle \frac{1701}{131072} \displaystyle \frac{1530.9}{131072}

With the assumed value of P_0^M=0.1, the zero-modified probabilities are obtained by multiplying the zero-truncated probabilities by 1-P_0^M=0.9. Using (19) and (20), the zero-modified ETNB mean and variance are: E[N_M]=1.35 and Var[N_M]=1.5525.

Practice Problems

The discussion in this post has a great deal of technical details. A concise summary of the (a,b,0) class and (a,b,1) class is found here.

Practice problems on (a,b,0) class

Practice problems on (a,b,1) class

Dan Ma actuarial topics
Dan Ma actuarial
Dan Ma math

Daniel Ma actuarial
Daniel Ma mathematics
Daniel Ma actuarial topics

\copyright 2019 – Dan Ma

The (a,b,0) class

This post introduces the class of discrete distributions called the (a,b,0) class.

A counting distribution is a discrete random variable that takes on values of non-negative integers 0,1,2, … Examples include the Poisson distribution, the binomial distribution and the negative binomial distribution (see here for a discussion). These distributions are potential models for the number of occurrences for some random events of interest, e.g. the number of losses in actuarial applications. The discussion below shows that the notion of (a,b,0) class is another way to describe the big three counting distributions of Poisson, binomial and negative binomial. The notion of (a,b,1) class is a generalization of the (a,b,0) class and is defined in a subsequent post.

The (a,b,0) Class

The (a,b,0) class is at heart a recursive algorithm to generate probabilities. Let’s fix some notations. Let N be a counting random variable. For each k=0,1,2,3,\cdots, let P_k=P(N=k). The counting random variable N is said to be a member of the (a,b,0) class of distributions if for some constants a and b the following recursive relation holds

    \displaystyle (1) \ \ \ \ \ \frac{P_k}{P_{k-1}}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots

Note that the recursive relation (1) generates all the probabilities P_k for all integers k starting at 1. The relation (1) does not account for P_0. Does that mean that the initial probability P_0 can be any arbitrary probability value? Note that the recursive relation (1) means that each P_k is ultimately expressed in terms of P_0.

    P_0=P_0

    \displaystyle P_1=(a+b) P_0

    \displaystyle P_2=\biggl(a+\frac{b}{2} \biggr) (a+b) P_0

    \cdots

    \displaystyle P_k=\biggl(a+\frac{b}{k} \biggr) \biggl(a+\frac{b}{k-1} \biggr) \cdots \biggl(a+\frac{b}{2} \biggr) (a+b) P_0

    \cdots

When a and b are fixed, the value of P_0 is also fixed since the probabilities must sum to 1. In fact P_0 is the following value.

    ……….\displaystyle P_0 \biggl[ 1+(a+b)+\biggl(a+\frac{b}{2} \biggr)+\cdots+\biggl(a+\frac{b}{k} \biggr)+\cdots \biggr]=1

    \displaystyle (2) \ \ \ \ \ P_0 =\biggl[ 1+(a+b)+\biggl(a+\frac{b}{2} \biggr)+\cdots+\biggl(a+\frac{b}{k} \biggr)+\cdots \biggr]^{-1}

Thus a member of the (a,b,0) class has two parameters, namely a and b, which completely determine the distribution.

Example 1
As an example, let a=0 and b=\lambda where \lambda>0 is a fixed positive constant. Using (1), we see that

    \displaystyle P_1=\lambda \ P_0
    \displaystyle P_2=\frac{1}{2!} \ \lambda^2 P_0
    \displaystyle P_3=\frac{1}{3!} \ \lambda^3 P_0
    ……..\cdots
    ……..\cdots
    ……..\cdots
    \displaystyle P_n=\frac{1}{n!} \lambda^n P_0
    ……..\cdots
    ……..\cdots
    ……..\cdots

According to (2), P_0=e^{-\lambda}

    \displaystyle \begin{aligned} P_0&=\biggl(1+ \lambda+\frac{1}{2!} \ \lambda^2 +\cdots+\frac{1}{n!} \lambda^n +\cdots \biggr)^{-1} \\&=(e^{\lambda})^{-1}\\&=e^{-\lambda}  \end{aligned}

With P_0=e^{-\lambda}, the probabilities P_n are from a Poisson distribution. Thus, when the parameter a is 0, and the parameter b is a positive constant, the corresponding distribution from the (a,b,0) class is a Poisson distribution.

Only Three Members in the (a,b,0) Class

In essence, the (a,b,0) class has only three members, namely the big 3 discrete distributions – the Poisson distribution, the binomial distribution and the negative binomial distribution, with each distribution represented by a different sign of the parameter a. Using the recursive relation (1), it can be shown that each of the big three distributions belongs to the (a,b,0) class. The following table shows the parameters a and b in the three cases.

Table 1

Distribution Usual Parameters Parameter a Parameter b
Poisson \lambda 0 \lambda
Binomial n and p \displaystyle -\frac{p}{1-p} \displaystyle (n+1) \ \frac{p}{1-p}
Negative binomial r and p 1-p (r-1) \ (1-p)
Negative binomial r and \theta \displaystyle \frac{\theta}{1+\theta} \displaystyle (r-1) \ \frac{\theta}{1+\theta}
Geometric p 1-p 0
Geometric \theta \displaystyle \frac{\theta}{1+\theta} 0

Table 1 shows how to parametrize the three distributions. For example, for the binomial distribution with parameters n (the number of trials) and p (the probability of success), the (a,b,0) parameters are a=-p/(1-p) and b=-(n+1) a. The two rows for negative binomial reflect two different parametrizations. Of course, the geometric distribution is simply a negative binomial distribution when the parameter r=1. Essentially Table 1 consists of three different distributions.

Table 1 works in the opposite direction as well. Any set of (a,b,0) parameters a and b must fit into one of the distributions listed in Table 1. In other words, the recursive relation (1) produces no new counting distribution. Any counting distribution satisfying (1) must be one of the big 3 counting distributions listed in Table 1.

Note that under the recursive relation (1), not all combinations of a and b will make a probability distribution. For example, when both a and b are negative constants, the resulting probabilities P_k are negative for odd k. When a+b<0, the resulting probabilities P_k cannot be reliably positive in all instances. When a+b=0, P_0=1, i.e. the distribution is a point mass at 0. So we would like to restrict the attention on the case where a+b>0.

To echo the point made previously, it is the case that when a+b>0 and when the recursive relation (1) produces a viable probability distribution, the resulting distribution must be one of the three distributions listed in Table 1. This point is not entirely obvious. Any interested reader can see chapter 6 of [1].

Table 1 indicates that the sign of the parameter a determines the form of the (a,b,0) distribution. If a=0, it is a Poisson distribution. If a is negative, it is a binomial distribution. If a is positive, it is a negative binomial distribution.

Examples

We now present a few more examples illustrating the working of the (a,b,0) recursive relation.

Example 2
This example illustrates that knowing three consecutive probabilities of a member of the (a,b,0) class determines the entire distribution. For example, suppose we know that

    P_1=0.0567
    P_2=0.07938
    P_3=0.09261

These three consecutive probabilities produce the following two linear equations of a and b.

    \displaystyle \frac{P_2}{P_1}=\frac{0.07938}{0.0567}=a+\frac{b}{2}
    \displaystyle \frac{P_3}{P_2}=\frac{0.09261}{0.07938}=a+\frac{b}{3}

Solving these two linear equations produces a=0.7 and b=1.4. Since a is positive, this is a negative binomial distribution. The corresponding negative binomial parameters are r=3 and p=0.3. With this information, the (a,b,0) distribution in question is completely determined. The following are the several distributional quantities.

    P_0=0.3^3=0.027
    P_4=(0.7+\frac{1.4}{4}) \ P_3=0.0972405
    \displaystyle E(N)=r \frac{1-p}{p}=3 \frac{0.7}{0.3}=7
    \displaystyle Var(N)=r \frac{1-p}{p^2}=3 \frac{0.7}{0.3^2}=\frac{7}{0.3}=23.3333

Example 3
Actually any three given probabilities determine the entire (a,b,0) distribution. They do not have to be consecutive. Suppose we are given the following probabilities.

    P_1=0.33554432
    P_2=0.29360128
    P_4=0.0458752

Applying the recursive relation (1) produces the following equations.

    \displaystyle \frac{P_2}{P_1}=\frac{0.29360128}{0.33554432}=a+\frac{b}{2}
    \displaystyle \frac{P_3}{P_2}=\frac{P_3}{0.29360128}=a+\frac{b}{3}
    \displaystyle \frac{P_4}{P_3}=\frac{0.0458752}{P_3}=a+\frac{b}{4}

The above 3 equations lead to the following two equations.

    \displaystyle \frac{P_2}{P_1}=\frac{0.29360128}{0.33554432}=a+\frac{b}{2}
    \displaystyle P_4=0.0458752=\biggl(a+\frac{b}{4} \biggr) \biggl(a+\frac{b}{3} \biggr) \ 0.29360128

Of the above two equations, one is a linear equation and one is a quadratic equation. Solving these two equations produces a=-0.25 and b=2.25. Since a is negative, this is a binomial distribution. Using the translation in Table 1 gives the following equations.

    \displaystyle -\frac{p}{1-p}=-0.25 \ \ \ \ \ \ \ \ \ \ \ (n+1) \ \frac{p}{1-p}=2.25

Solving these equations gives n=8 and p=0.2. The (a,b,0) distribution in question is then completely determined.

Factorial Moments

Another distributional quantity that can give insight into the (a,b,0) class is the factorial moment. For any random variable X, its nth factorial moment is

    (3) \ \ \ \ \ \mu_{(n)}=E[X (X-1) (X-2) \cdots (X-(n-1))]

For example, the first three factorial moments are:

    \mu_{(1)}=E[X]

    \mu_{(2)}=E[X (X-1)]

    \mu_{(3)}=E[X (X-1) (X-2)]

For any member of the (a,b,0) class with parameters a and b, the first factorial moment is:

    \displaystyle (4) \ \ \ \ \ \mu_{(1)}=\frac{a+b}{1-a}

The higher (a,b,0) factorial moments can be obtained recursively as follows:

    \displaystyle (5) \ \ \ \ \ \mu_{(j)}=\frac{a j +b}{1-a} \ \mu_{(j-1)} \ \ \ \ \ \ \ j \ge 2

The recursive formula (5) is a good way to determine the raw moments of the member of the (a,b,0) class. For example, the following calculate the second raw moment and the variance of the random variable N, assumed to be a member of the (a,b,0) class with parameters a and b.

    \displaystyle \mu_{(1)}=E[N]=\frac{a+b}{1-a}

    \displaystyle \mu_{(2)}=\frac{2 a +b}{1-a} \ \frac{a+b}{1-a}=\frac{(2a+b) (a+b)}{(1-a)^2}=E[N (N-1)]=E[N^2]-E[N]

    \displaystyle E[N^2]=\frac{(2a+b) (a+b)}{(1-a)^2}+\frac{a+b}{1-a}=\frac{(a+b) (a+b+1)}{(1-a)^2}

    \displaystyle Var(N)=E[N^2]-E[N]^2=\frac{a+b}{(1-a)^2}

One interesting characteristic of the (a,b,0) class is that knowing limited distributional information determines the distribution. Example 2 and Example 3 show that knowing three point masses completely determines the (a,b,0) distribution. The above derivation shows that knowing the mean and the variance also completely determines the (a,b,0) distribution.

Fitting (a,b,0) Distributions

If the (a,b,0) recursive formula in (1) generates no new distributions, why study (a,b,0) class and why not just focus on Poisson, binomial and negative binomial distribution individually? One reason for studying the recursive (a,b,0) formula is that it gives a graphical way to choose an appropriate member of the (a,b,0) class. To see this, rewrite (1) as follows:

    \displaystyle (6) \ \ \ \ \ k \ \frac{P_k}{P_{k-1}}=a k+ b \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots

Note that the quantity on the right side of (6) is a linear function of the integers k. If we plot the left hand side quantity of (6) with k on the x-axis, the plot should be a linear one with the slope being the parameter a and the y-intercept being the parameter b (of course assuming it is an (a,b,0) distribution).

The relation (6) is a way to quickly determine whether a given sample is taken from a member of the (a,b,0) class. To do this, calculate the ratio of two consecutive data categories times k. In other words, compute ratio such as the following for values of k:

    \displaystyle (7) \ \ \ \ \ k \ \frac{\hat{P}_k}{\hat{P}_{k-1}}=k \ \frac{n_k}{n_{k-1}} \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots

where n_k is the observed frequency for the category k. The ratio of n_k to n_{k-1} multiplied by k is a stand-in for the left hand side of (6). Then plot these values against k. A linear trend that is observed in the graph is evidence that the data in the sample is taken from an (a,b,0) distribution.

The slope of the plotted line gives an indication of which (a,b,0) member to use. If the plot is approximately horizontal, then the Poisson model is appropriate. If the plot is a line with negative slope, then the binomial model is more appropriate. If the plot is approximately a line with positive slope, use the negative binomial model. For this approach to work properly, large observed data set is preferred.

The (a,b,1) Class

It is possible that the (a,b,0) distributions do not adequately describe a random counting phenomenon being observed. For example, the sample data may indicate that the probability at zero may be larger than is indicated by the distributions in the (a,b,0) class. One alternative is to assign a larger value for P_0 and recursively generate the subsequent probabilities P_k for k=2,3,\cdots. The class of the distributions defined by this recursive scheme is called the (a,b,1) class, which is discussed in the next post.

Practice Problems

Practice problems on (a,b,0) class

Practice problems on (a,b,1) class

Reference

  1. Panjer H. H., Wilmot G. E., Insurance Risk Models, Society of Actuaries, Chicago, 1992.

Dan Ma actuarial topics
Dan Ma actuarial
Dan Ma math

Daniel Ma actuarial
Daniel Ma mathematics
Daniel Ma actuarial topics

\copyright 2018 – Dan Ma

The big 3 claim frequency models

We now turn the attention to discrete distributions. In particular the focus is on counting distributions. These are the discrete distributions that have positive probabilities only on the non-negative integers 0, 1, 2, 3, … One important application is on finding suitable counting distributions for modeling the number of losses or the number claims to an insurer or more generally the number of other random events that are of interest in actuarial applications. From a claim perspective, these counting distributions would be models for claim frequency. Combining frequency models with models for claim severity would provide a more complete picture of the exposure of risks to the insurer than using claim severity alone. This post and several subsequent posts are preparation for the discussion on modeling aggregate losses and claims. Another upcoming topic would be the effect of insurance coverage modifications (e.g. deductibles) on the claim frequency and claim severity.

This post focuses on the three commonly used counting distributions – Poisson, binomial and negative binomial (the big 3). These three distributions are the basis for defining a large class of other counting distributions.

Probability Generating Function

Let Y be a random variable with positive probabilities only on the non-negative integers, i.e. P(Y=k) is positive only for k=0,1,2,\cdots. The function P(Y=k) is the probability of the occurrence of the event Y=k, i.e. the observed value of the random variable Y is k. It is called the probability function of the random variable Y (also called probability mass function). From the probability function, many other distributional quantities can be derived, e.g. mean, variance and higher moments.

We can also elicit information about Y from its generating function. The generating function (or probability generating function) of Y is defined by:

    \displaystyle P_Y(z)=p_0+p_1 z +p_2 z^2 + \cdots=\sum \limits_{j=0}^\infty p_j z^j

where each p_j=P(Y=j). The generating function P_Y(z) is defined wherever the infinite sum converges. At minimum, P_Y(z) converges for \lvert z \lvert \le 1. Some P_Y(z) converges for all real z, e.g. when Y has a Poisson distribution (see below).

One reason for paying attention to generating function is that the moments of Y can be generated from P_Y(z). The nth moment of Y is derived from the result of taking the nth derivative of P_Y(z) and evaluating at z=1.

    \displaystyle E[Y (Y-1) (Y-2) \cdots (Y-(n-1)]=P_Y^{(n)}(1)

The above expectation is said to be a factorial moment. It follows that E(Y)=P_{Y}^{(1)}(1). Since E[Y (Y-1)]=P_Y^{(2)}(1), the second moment is E(Y^2)=P_{Y}^{(1)}(1)+P_Y^{(2)}(1). In general, the nth moment E(Y^n) can be expressed in terms of P_Y^{(k)}(1) for all k \ge n.

Another application of generating function is that P_Y(z) encodes the probability function P(Y=k), which is obtained by taking the derivatives of P_Y(z) and then evaluated at z=0.

    \displaystyle P(Y=n)=\frac{P_{Y}^{(n)}(0)}{n!}

where n=0,1,2,3,\cdots. Another useful property about generating function is that the probability distribution of a random variable is uniquely determined by its generating function. This fundamental property is useful in determining the distribution of an independent sum. The generating function of an independent sum of random variables is simply the product of the individual generating functions. If the product is the generating function of a certain distribution, then the independent sum must be of the same distribution.

For a more detailed discussion on probability generating function, see this blog post in a companion blog.

Poisson Distribution

We now describe the three counting distributions indicated at the beginning of the post. We start with the Poisson distribution. Consider a random variable X that only takes on the non-negative integers. For each k=0,1,2,\cdots, let p_k=P(X=k).

The random variable X has a Poisson distribution if its probability function is:

    \displaystyle p_k=\frac{e^{-\lambda} \lambda^k}{k!} \ \ \ \ \ \ k=0,1,2,\cdots

for some positive constant \lambda. This constant \lambda is the parameter of the Poisson distribution in question. It is also the mean and variance of the Poisson distribution. The following is the probability generating function of the Poisson distribution.

    \displaystyle P_X(z)=e^{\lambda \ (z-1)}

The Poisson generating function is defined for all real numbers z. The mean and variance and higher moments can be computed using the generating function.

    E(X)=P_X^{(1)}(1)=\lambda

    E[X (X-1)]=P_X^{(2)}(1)=\lambda^2

    E[X^2]=P_X^{(1)}(1)+P_X^{(2)}(1)=\lambda+\lambda^2

    Var(X)=E(X^2)-E(X)^2=\lambda

One interesting characteristic of the Poisson distribution is that its mean is the same the variance. From a mathematical standpoint, the Poisson distribution arises from the Poisson process (see a more detailed discussion here). Another discussion of the Poisson distribution is found here.

One useful characteristic of the Poisson distribution is that combining independent Poisson distributions results in another Poisson distribution. Suppose that X_1,X_2,\cdots,X_n are independent Poisson random variables with means \lambda_1,\lambda_2,\cdots,\lambda_n, respectively. Then the probability generating function of the sum X=X_1+X_2+\cdots +X_n is simply the product of the individual probability generating functions.

    \displaystyle P_X(z)=\prod \limits_{j=1}^n e^{\lambda_j (z-1)}=e^{\lambda \ (z-1)}

where \lambda=\lambda_1+\lambda_2+\cdots +\lambda_n. The probability generating function of the sum X is the generating function of a Poisson distribution. Thus independent sum of Poisson distributions is a Poisson distribution with parameter being the sum of the individual Poisson parameters.

Another useful property is that of splitting a Poisson distribution. For example, suppose that the number of claims N in a given year follows a Poisson distribution with mean \lambda per year. Also suppose that the claims can be classified into W distinct types such that the probability of a claim being of type i is p_i, i=1,2,\cdots, W and such that p_1+\cdots+p_W=1. If we are interested in studying the number N_i of claims in a year that are of type i, i=1,2,\cdots,W, then N_1,N_2,\cdots,N_W are independent Poisson random variables with means \lambda_1 p_1,  \lambda_2 p_2,\cdots,\lambda_W p_W, respectively. For a mathematical discussion of this Poisson splitting phenomenon, see this blog post in a companion blog.

Binomial Distribution

Consider a series of independent events each of which results in one of two distinct outcomes (one is called Success and the other Failure) in such a way that the probability of observing a Success in a trial is constant across all trials (these are called Bernoulli trials). For a binomial distribution, we are only interested in observing n such trials and count the number of successes in these n trials.

More specifically, let p be the probability of observing a Success in a Bernoulli trial. Let X be the number of Successes observed in n independent trials. Then the random variable X is said to have a binomial distribution with parameters n and p.

Note that the random variable X is the independent sum of X_1,X_2,\cdots,X_n where X_i is the number of Success in the ith Bernoulli trial. Thus X_i is 1 with probability p and is 0 with probability 1-p. Its probability generating function would be:

    \displaystyle g(z)=(1-p) z^0 +p z^1=1-p+p z

As a result, the probability generating function for X would be g(z) raised to n.

    \displaystyle P_X(z)=(1-p+p z)^n

The generating function P_X(z) is defined for all real values z. Differentiating P_X(z) twice produces the mean and variance.

    E(X)=n p

    E[X (X-1)]=n (n-1) p^2

    Var(X)=n p (1-p)

By differentiating P_X(z) and evaluating at z=0, we obtain the probability function.

    \displaystyle P(X=k)=\frac{P_X^{(k)}(0)}{k!}=\binom{n}{k} p^k (1-p)^{n-k}

where k=0,1,2,\cdots,n.

By taking the product of probability generating functions, it follows that the independent sum Y=Y_1+Y_2+\cdots+Y_m, where each Y_i has a binomial distribution with parameters n_i and p, has a binomial distribution with parameters n=n_1+\cdots+n_m and p. In other words, as long as the probability of success p is identical in the binomial distributions, the independent sum is always a binomial distribution.

Note that the variance of the binomial distribution is less than the mean. Thus the binomial distribution is suitable candidate for modeling frequency for the situations where the sample variance is smaller than the sample mean.

Negative Binomial Distribution

As mentioned above, the Poisson distribution requires that the mean and the variance are equaled. The binomial binomial distribution requires that the variance is smaller than the mean. Thus these two counting distributions are not appropriate in all cases. The negative binomial distribution is an excellent alternative to the Poisson distribution and the binomial distribution, especially in the cases where the observed variance is greater than the observed mean.

The negative binomial naturally arises from the same probability experiment that generates the binomial distribution. Consider a series of independent Bernoulli trials each of which results in one of two distinct outcomes (called success and failure) in such a way that the probability of success p is constant across the trials. Instead of observing the outcomes in a fixed number of trials, we now observe the trials until r number of success have occurred.

As we observe the Bernoulli trials, let’s Y be the number of failures until the rth success has occurred. The random variable Y has a negative binomial distribution with parameters r and p. The parameter r is necessarily a positive integer and the parameter p is a real number between 0 and 1. The following is the probability function for the random variable Y.

    \displaystyle \begin{aligned} P(Y=k)&=\binom{r+k-1}{k} \ p^r (1-p)^k \\&=\frac{(r+k-1)!}{k! \ (r-1)!} \ p^r (1-p)^k \ \ \ \ \ \ k=0,1,2,\cdots \\& \end{aligned}

In the above probability function, the parameter r must be a positive integer. The binomial coefficient \binom{r+k-1}{k} is computed by its usual definition. The above probability probability function can be relaxed so that r can be any positive real number. The key to the relaxation is a reformulation of the binomial coefficient.

    \displaystyle  \binom{n}{j}=\left\{ \begin{array}{ll}                     \displaystyle  \frac{n (n-1) (n-2) \cdots (n-(j-1))}{j!} &\ n>j-1, j=1,2,3,\cdots \\           \text{ } & \text{ } \\           \displaystyle  1 &\  j=0           \end{array} \right.

Note that in the above formulation, the n in \binom{n}{j} does not have to be an integer. If n were to be a positive integer, the usual definition \binom{n}{j}=\frac{n!}{j! (n-j)!} would lead to the same calculation. The reformulation is a generalization of the usual binomial coefficient definition.

With the new definition of binomial coefficient, the following is the probability function of the negative binomial distribution in the general case.

    \displaystyle P(Y=k)=\binom{r+k-1}{k} \ p^r (1-p)^k \ \ \ \ \ \ k=0,1,2,\cdots

The following is the same probability function with the binomial coefficient explicitly written out.

    \displaystyle  P(Y=k)=\left\{ \begin{array}{ll}                     \displaystyle  p^r &\ k=0 \\           \text{ } & \text{ } \\           \displaystyle  \frac{(k-1+r) \cdots (1+r) r}{k!} \ p^r \ (1-p)^k &\ k=1,2,\cdots           \end{array} \right.

For either of the above versions, the mean and variance are:

    \displaystyle E(Y)=\frac{r (1-p)}{p}

    \displaystyle Var(Y)=\frac{r (1-p)}{p^2}

Another formulation of the negative binomial distribution is that it is a Poisson-gamma mixture. The following is the probability function.

    \displaystyle  P(Y=k)=\left\{ \begin{array}{ll}                     \displaystyle  \biggl(\frac{1}{\beta+1} \biggr)^r &\ k=0 \\           \text{ } & \text{ } \\           \displaystyle  \frac{(k-1+r) \cdots (1+r) r}{k!} \ \biggl(\frac{1}{\beta+1} \biggr)^r \biggl(\frac{\beta}{\beta+1} \biggr)^k &\ k=1,2,\cdots           \end{array} \right.

It is still a 2-parameter discrete distribution. The parameters r and \beta originate from the parameters of the gamma distribution in the Poisson-gamma mixture. The mean and variance are:

    \displaystyle E(Y) = r \ \beta

    \displaystyle Var(Y)=r \ \beta \ (1+\beta)

The negative binomial distribution has been discussed at length in blog posts in several companion blogs. For the natural interpretation of negative binomial distribution based on counting the number of failures until the rth success, see this blog post. This is an excellent introduction.

For the general version of the negative binomial distribution where the parameter r can be any positive real number, see this blog post.

For the version of negative binomial distribution from a Poisson-Gamma mixture point of view, see this blog post.

This blog post has additional facts about the negative binomial distribution. This blog post summarizes the various versions as well as focusing on the calculation of probabilities.

More Counting Distributions

The three counting distributions – Poisson, binomial and negative binomial – provide a versatile tool kit in modeling the number of random events such as losses to the insured or claims to the insurer. The tool kit can be greatly expanded by modifying these three distributions to generate additional distributions. The new distributions belong to the (a,b,0) and (a,b,1) classes. This topic is discussed in the subsequent posts.

Dan Ma actuarial topics
Dan Ma actuarial
Dan Ma math

Daniel Ma actuarial
Daniel Ma mathematics
Daniel Ma actuarial topics

\copyright 2018 – Dan Ma

The Chi-Squared Distribution, Part 3a

This post is the part 3 of a three-part series on chi-squared distribution. In this post, we discuss the roles played by chi-squared distribution on experiments or random phenomena that result in measurements that are categorical rather than quantitative (part 2 deals with quantitative measurements). An introduction of the chi-squared distribution is found in part 1.

The chi-squared test discussed here is also referred to as Pearson’s chi-squared test, which was formulated by Karl Pearson in 1900. It can be used to assess three types of comparison on categorical variables – goodness of fit, homogeneity, and independence. As a result, we break up the discussion into 3 parts – part 3a (goodness of fit, this post), part 3b (test of homogeneity) and part 3c (test of independence).

_______________________________________________________________________________________________

Multinomial Experiments

Let’s look at the setting for Pearson’s goodness-of-fit test. Consider a random experiment consisting of a series of independent trials each of which results in exactly one of k categories. We are interested in summarizing the counts of the trials that fall into the k distinct categories. Some examples of such random experiments are:

  • In rolling a die n times, consider the counts of the faces of the die.
  • Perform a series of experiments each of which is a toss of three coins. Summarize the experiments according to the number of heads, 0, 1, 2, and 3, that occur in each experiment.
  • Blood donors can be classified into the blood types A, B, AB and O.
  • Record the number of automobile accidents per week in a one-mile stretch of highway. Classify the weekly accident counts into the groupings 0, 1, 2, 3, 4 and 5+.
  • A group of auto insurance policies are classified into the claim frequency rates of 0, 1, ,2, 3, 4+ accidents per year.
  • Auto insurance claims are classified into various claim size groupings, e.g. under 1000, 1000 to 5000, 5000 to 10000 and 10000+.
  • In auditing financial transactions in financial documents (accounting statements, expense reports etc), the leading digits of financial figures can be classified into 9 cells: 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Each of the example can be referred to as a multinomial experiment. The characteristics of such an experiment are

  • The experiment consists of performing n identical trials that are independent.
  • For each trial, the outcome falls into exactly one of k categories or cells.
  • The probability of the outcome of a trial falling into a particular cell is constant across all trials.

For cell j, let p_j be the probability of the outcome falling into cell j. Of course, p_1+p_2+\cdots+p_k=1. We are interested in the joint random variables Y_1,Y_2,\cdots,Y_k where Y_j is the number of trials whose outcomes fall into cell j.

If k=2 (only two categories for each trial), then the experiment is a binomial experiment. Then one of the categories can be called success (with cell probability p) and the other is called failure (with cell probability 1-p). If Y_1 is the count of the successes, then Y_1 has a binomial distribution with parameters n and p.

In general, the variables Y_1,Y_2,\cdots,Y_k have a multinomial distribution. To be a little more precise, the random variables Y_1,Y_2,\cdots,Y_{k-1} have a multinomial distribution with parameters n and p_1,p_2,\cdots,p_{k-1}. Note that the last variable Y_k is deterministic since Y_k=n-(Y_1+\cdots+Y_{k-1}).

In the discussion here, the objective is to make inference on the cell probabilities p_1,p_2,\cdots,p_k. The hypotheses in the statistical test are expressed in terms of specific values of p_j, j=1,2,\cdots,k. For example, the null hypothesis may be of the following form:

    H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k

where p_{j,0} are the hypothesized values of the cell probabilities. It is cumbersome to calculate the probabilities for the multinomial distribution. As a result, it would be difficult (if not impossible) to calculate the exact level of significance, which is the probability of type I error. Thus it is critical to use a test statistic that does not depend on the multinomial distribution. Fortunately this problem was solved by Karl Pearson. He formulated a test statistic that has an approximate chi-squared distribution.

_______________________________________________________________________________________________

Test Statistic

The random variables Y_1,Y_2,\cdots,Y_k discussed above have a multinomial distribution with parameters p_1,p_2,\cdots,p_k, respectively. Of course, each p_j is the probability that the outcome of a trial falls into cell j. The marginal distribution of each Y_j has a binomial distribution with parameters n and p_j with p_j being the probability of success. Thus the expected value and the variance of Y_j are E[Y_j]=n p_j and Var[Y_j]=n p_j (1-p_j). The following is the chi-squared test statistic.

    \displaystyle \chi^2=\sum \limits_{j=1}^k \frac{(Y_j-n \ p_j)^2}{n \ p_j}=\sum \limits_{j=1}^k \frac{(Y_j-E[Y_j])^2}{E[Y_j]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The statistic defined in (1) was proposed by Karl Pearson in 1900. It is defined by summing the squares of the difference of the observed counts Y_j and the expected counts E[Y_j] where each squared difference is normalized by the expected count (i.e. divided by the expected count). On one level, the test statistic in (1) seems intuitive since it involves all the k deviations Y_j-E[Y_j]. If the observed values Y_j are close to the expected cell counts, then the test statistic in (1) would have a small value.

The chi-squared test statistic defined in (1) has an approximate chi-squared distribution when the number of trials n is large. The proof of this fact will not be discussed here. We demonstrate with the case for k=2.

    \displaystyle \begin{aligned} \sum \limits_{j=1}^2 \frac{(Y_j-n \ p_j)^2}{n \ p_j}&=\frac{p_2 \ (Y_1-n p_1)^2+p_1 \ (Y_2-n p_2)^2}{n \ p_1 \ p_2} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+p_1 \ ((n-Y_1)-n (1-p_1))^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+ p_1 \ (Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\biggl( \frac{Y_1-n p_1}{\sqrt{n \ p_1 \ (1-p_1)}} \biggr)^2 =\biggl( \frac{Y_1-E[Y_1]}{\sqrt{Var[Y_1]}} \biggr)^2 \end{aligned}

The quantity inside the brackets in the last step is approximately normal according to the central limit theorem. Since the square of a normal distribution has a chi-squared distribution with one degree of freedom (see Part 1), the last step in the above derivation has an approximate chi-distribution with 1 df.

In order for the chi-squared distribution to provide an adequate approximation to the test statistic in (1), a rule of thumb requires that the expected cell counts E[Y_j] are at least five. The null hypothesis to be tested is that the cell probabilities p_j are certain specified values p_{j,0} for j=1,2,\cdots,k. The following is the formal statement.

    H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The hull hypothesis is to be tested against all possible alternatives. In other words, the alternative hypothesis H_1 is the statement that p_j \ne p_{j,0} for at least one j.

The chi-squared test statistic in (1) can be used for a goodness-of-fit test, i.e. to test how well a probability model fit the sample data, in other words, to test whether the observed data come from a hypothesized probability distribution. Example 2 below will test whether the Poisson model is a good fit for a set of claim frequency data.

_______________________________________________________________________________________________

Degrees of Freedom

Now that we have addressed the distribution for the test statistic in (1), we need to address two more issues. One is the direction of the hypothesis test (one-tailed or two-tailed). The second is the degrees of freedom. The direction of the test is easy to see. Note that the chi-squared test statistic in (1) is always a positive value. On the other hand, if the difference between observed cell counts and expected cell counts is large, the large difference would contradict the null hypothesis. Thus if the chi-squared statistic has a large value, we should reject the null hypothesis. So the correct test to use is the upper tailed chi-squared test.

The number of degrees of freedom is obtained by subtracting one from the cell count k for each independent linear restriction placed on the cell probabilities. There is at least one linear restriction. The sum of all the cell probabilities must be 1. Thus the degrees of freedom must be the result of reducing k by one at least one time. This means that the degrees of freedom of the chi-squared statistic in (1) is at most k-1.

Furthermore, in the calculation for the specified cell probabilities, if there is any parameter that is unknown and is required to be estimated from the data, then there are further reductions in k-1. If there is any unknown parameter that needs to be estimated from data, a maximum likelihood estimator (MLE) should be used. All these points will be demonstrated by the examples below.

If the value of the chi-squared statistic in (1) is “large”, we reject the null hypothesis H_0 stated in (2). By large we mean the value of the chi-squared statistic exceeds the critical value for the desired level of significance. The critical value is the upper tail in the chi-squared distribution (with the appropriate df) of area \alpha where \alpha is the desired level of significance (e.g. \alpha=0.05 and \alpha=0.01 or some other appropriate level). Instead of using critical value, the p-value approach can also be used. Critical value or p-value can be looked up using table or computed using software. For the examples below, chi-squared functions in Excel are used.

_______________________________________________________________________________________________

Examples

Example 1
Suppose that we wish to test whether a given die is a fair die. We roll the die 240 times and the following table shows the results.

    \displaystyle \begin{array} {rr} \text{Cell} & \text{Frequency} \\ 1 & 38  \\ 2 & 35 \\ 3 & 37 \\ 4 & 38  \\ 5 & 42 \\ 6 & 50 \\ \text{ } & \text{ } \\ \text{Total} & 240   \end{array}

The null hypothesis

    \displaystyle H_0: p_1=p_2=p_3=p_4=p_5=p_6=\frac{1}{6}

is tested against the alternative that at least one of the equalities is not true. This example is the simplest problem for testing cell probabilities. Since the specified values for the cells probabilities in H_0 are known, the degrees of freedom is one less than the cell count. Thus df = 5. The following is the chi-squared statistic based on the data and the null hypothesis.

    \displaystyle \begin{aligned} \chi^2&=\frac{(38-40)^2}{40}+\frac{(35-40)^2}{40}+\frac{(37-40)^2}{40} \\& \ \ + \frac{(38-40)^2}{40}+\frac{(42-40)^2}{40}+\frac{(50-40)^2}{40}=3.65  \end{aligned}

At \alpha=0.05 level of significance, the chi-squared critical value at df = 5 is \chi_{0.05}^2(5)=11.07049769. Since 3.65 < 11.07, the hypothesis that the die is fair is not rejected at \alpha=0.05. The p-value is P[\chi^2 > 3.65]=0.6. With such a large p-value, we also come to the conclusion that the null hypothesis is not rejected. \square

In all the examples, the critical values and the p-values are obtained by using the following functions in Excel.

    critical value
    =CHISQ.INV.RT(level of significance, df)

    p-value
    =1 – CHISQ.DIST(test statistic, df, TRUE)

Example 2
We now give an example for the chi-squared goodness-of-fit test. The number of auto accident claims per year from 700 drivers are recorded by an insurance company. The claim frequency data is shown in the following table.

    \displaystyle \begin{array}{rrr} \text{Claim Count} & \text{ } & \text{Frequency} \\ 0 & \text{ } & 351  \\ 1 & \text{ } & 241 \\ 2 & \text{ } & 73 \\ 3 & \text{ } & 29  \\ 4 & \text{ } & 6 \\ 5+ & \text{ } & 0 \\ \text{ } & \text{ } & \text{ } \\ \text{Total} & \text{ } & 700   \end{array}

Test the hypothesis that the annual claim count for a driver has a Poisson distribution. Use \alpha=0.05. Assume that the claim frequency across the drivers in question are independent.

The hypothesized distribution of the annual claim frequency is a Poisson distribution with unknown mean \lambda. The MLE of the parameter \lambda is the sample mean, which in this case is \hat{\lambda}=\frac{498}{700}=0.711428571.

Under the assumption that the claim frequency is Poisson with mean \hat{\lambda}, the cell probabilities are calculated using \hat{\lambda}.

    \displaystyle p_1=P[Y=0]=e^{-\hat{\lambda}}=0.4909

    \displaystyle p_2=P[Y=1]=\hat{\lambda} \ e^{-\hat{\lambda}}=0.3493

    \displaystyle p_3=P[Y=2]=\frac{1}{2} \ \hat{\lambda}^2 \ e^{-\hat{\lambda}}=0.1242

    \displaystyle p_4=P[Y=3]=\frac{1}{3!} \ \hat{\lambda}^3 \ e^{-\hat{\lambda}}=0.0295

    \displaystyle p_5=P[Y \ge 4]=1-P[Y=0]-P[Y=1]-P[Y=2]-P[Y=3]=0.0061

Then the null hypothesis is:

    H_0: p_1=0.4909, p_2=0.3493, p_3=0.1242, p_4=0.0295, p_5=0.0061

The null hypothesis is tested against all alternatives. The following table shows the calculation of the chi-squared statistic.

    \displaystyle \begin{array}{rrrrrrrrr}   \text{Cell} & \text{ } & \text{Claim Count} & \text{ } & \text{Cell Probability} & \text{ } & \text{Expected Count} & \text{ } & \text{Chi-Squared}   \\ 1 & \text{ } & 0 & \text{ } & 0.4909 & \text{ } & 343.63 & \text{ } & 0.15807   \\ 2 & \text{ } & 1 & \text{ } & 0.3493 & \text{ } & 244.51 & \text{ } & 0.05039   \\ 3 & \text{ } & 2 & \text{ } & 0.1242 & \text{ } & 86.94 & \text{ } & 2.23515   \\ 4 & \text{ } & 3 & \text{ } & 0.0295 & \text{ } & 20.65 & \text{ } & 3.37639    \\ 5 & \text{ } & 4+ & \text{ } & 0.0061 & \text{ } & 4.27 & \text{ } & 0.70091   \\ \text{ } & \text{ } & \text{ }   \\ \text{Total} & \text{ } & \text{ } & \text{ } & 1.0000 & \text{ } & \text{ } & \text{ } & 6.52091   \end{array}

The degrees of freedom of the chi-squared statistic is df = 5 – 1 -1 = 3. The first reduction of one is due to the linear restriction of all cell probabilities summing to 1. The second reduction is due to the fact that one unknown parameter \lambda has to be estimated using sample data. Using Excel, the critical value is \chi_{0.05}^2(3)=7.814727903. The p-value is P[\chi^2 > 6.52091]=0.088841503. Thus the null hypothesis is not rejected at the level of significance \alpha=0.05. \square

Example 3
For many data sets, especially for data sets with numbers that distribute across multiple orders of magnitude, the first digits occur according to the probability distribution indicated in below:

    Probability for leading digit 1 = 0.301
    Probability for leading digit 2 = 0.176
    Probability for leading digit 3 = 0.125
    Probability for leading digit 4 = 0.097
    Probability for leading digit 5 = 0.079
    Probability for leading digit 6 = 0.067
    Probability for leading digit 7 = 0.058
    Probability for leading digit 8 = 0.051
    Probability for leading digit 9 = 0.046

This probability distribution was discovered by Simon Newcomb in 1881 and was rediscovered by physicist Frank Benford in 1938. Since then this distribution has become known as the Benford’s law. Thus in many data sets, the leading digit 1 occurs about 30% of the time. The data sets for which this law is applicable are demographic data (e.g. income data of a large population, census data such as populations of cities and counties) and scientific data. The law is also applicable in certain financial data, e.g. tax data, stock exchange data, corporate disbursement and sales data. Thus the Benford’s law is a great tool for forensic accounting and auditing.

The following shows the distribution of first digits in the population counts of all 3,143 counties in the United States (from US census data).

    Count for leading digit 1 = 972
    Count for leading digit 2 = 573
    Count for leading digit 3 = 376
    Count for leading digit 4 = 325
    Count for leading digit 5 = 205
    Count for leading digit 6 = 209
    Count for leading digit 7 = 179
    Count for leading digit 8 = 155
    Count for leading digit 9 = 149

Use the chi-squared goodness-of-fit test to test the hypothesis that the leading digits in the county population data follow the Benford’s law. This example is also discussed in this blog post. \square

For further information and more examples on chi-squared test, please see the sources listed in the reference section.

_______________________________________________________________________________________________

Reference

  1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
  2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
\copyright \ 2017 - \text{Dan Ma}

More topics on the exponential distribution

This is a continuation of two previous posts on the exponential distribution (an introduction and a post on the connection with the Poisson process). This post presents more properties that are not discussed in the two previous posts. Of course, a serious and in-depth discussion of the exponential distribution can fill volumes. The goal here is quite modest – to present a few more properties related to the memoryless property of the exponential distribution.

_______________________________________________________________________________________________

The Failure Rate

A previous post discusses the Poisson process and its relation to the exponential distribution. Now we present another way of looking at both notions. Suppose a counting process counts the occurrences of a type of random events. Suppose that an event means a termination of a system, be it biological or manufactured. Furthermore suppose that the terminations occur according to a Poisson process at a constant rate \lambda per unit time. Then what is the meaning of the rate \lambda? It is the rate of termination (dying). It is usually called the failure rate (or hazard rate or force of mortality). The meaning of the constant rate \lambda is that the rate of dying is the same regardless of the location of the time scale (i.e. regardless how long a life has lived). This means that the lifetime (the time until death) of such a system has no memory. Since the exponential distribution is the only continuous distribution with the memoryless property, the time until the next termination inherent in the Poisson process in question must be an exponential random variable with rate \lambda or mean \frac{1}{\lambda}. So the notion of failure rate function (or hazard rate function) runs through the notions of exponential and Poisson process and further illustrates the memoryless property.

Consider a continuous random variable X that only takes on the positive real numbers. Suppose F and f are the CDF and density function of X, respectively. The survival function is S=1-F. The failure rate (or hazard rate) \mu(t) is defined as:

    \displaystyle \mu(t)=\frac{f(t)}{1-F(t)}=\frac{f(t)}{S(t)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The function \mu(t) can be interpreted as the rate of failure at the next instant given that the life has survived to time t. Suppose that the lifetime distribution is exponential. Because of the memoryless proeprty, the remaining lifetime of a t-year old is the same as the lifetime distribution for a new item. It is then intuitively clear that the failure rate must be constant. Indeed, it is straightforward to show that if the lifetime X is an exponential random variable with rate parameter \lambda, i.e. the density is f(t)=\lambda e^{-\lambda t}, then the failure rate is \mu(t)=\lambda. This is why the parameter \lambda in the density function f(t)=\lambda e^{-\lambda t} is called the rate parameter.

On the other hand, the failure rate or hazard rate is not constant for other lifetime distributions. However, the hazard rate function \mu(t) uniquely determines the distributional quantities such as the CDF and the survival function. The definition (1) shows that the failure rate is derived using the CDF or survival function. We show that the failure rate is enough information to derive the CDF or the survival function. From the definition, we have:

    \displaystyle \mu(t)=\frac{-\frac{d}{dt} S(t)}{S(t)} \ \ \  \text{or} \ \ \ -\mu(t)=\frac{\frac{d}{dt} S(t)}{S(t)}

Integrate both sides produces the following:

    \displaystyle - \int_0^t \ \mu(x) \ dx=\int_0^t \ \frac{\frac{d}{dt} S(t)}{S(t)} \ dx=\text{ln} S(t)

Exponentiate on each side produces the following:

    \displaystyle S(t)=e^{-\int_0^t \ \mu(x) \ dx} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

Once the survival function is obtained from \mu(t), the CDF and the density function can be obtained. Interestingly, the derivation in (2) can give another proof of the fact that the exponential distribution is the only one with the memoryless property. If X is memoryless, then the failure rate must be constant. If \mu(x) is constant, then by (2), S(t) is the exponential survival function. Of course, the other way is clear: if X is exponential, then X is memoryless.

The preceding discussion shows that having a constant failure rate is another way to characterize the exponential distribution, in particular the memoryless property of the exponential distribution. Before moving onto the next topics, another example of a failure rate function is \mu(t)=\frac{\alpha}{\beta} (\frac{t}{\beta})^{\alpha-1} where both \alpha and \beta are positive constants. This is the Weibull hazard rate function. The derived survival function is S(t)=e^{-(\frac{t}{\beta})^\alpha}, which is the survival function for the Weibull distribution. This distribution is an excellent model choice for describing the life of manufactured objects. See here for an introduction to the Weibull distribution.

_______________________________________________________________________________________________

The Minimum Statistic

The following result is about the minimum of independent exponential distributions.

    Suppose that X_1,X_2,\cdots,X_n are independent exponential random variables with rates \lambda_1,\lambda_2,\cdots,\lambda_n, respectively. Then the minimum of the sample, denoted by Y=\text{min}(X_1,X_2,\cdots,X_n), is also an exponential random variable with rate \lambda_1+\lambda_2 +\cdots+\lambda_n. \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

For the minimum to be > y, all sample items must be > y, Thus P[Y> y] is:

    \displaystyle \begin{aligned} P[Y> y]&=P[X_1> y] \times P[X_2> y] \times \cdots \times P[X_n> y] \\&=e^{-\lambda_1 y} \times e^{-\lambda_2 y} \times \cdots \times e^{-\lambda_n y}\\&=e^{-(\lambda_1+\lambda_2+\cdots+\lambda_n) y}  \end{aligned}

This means that Y=\text{min}(X_1,X_2,\cdots,X_n) has an exponential distribution with rate \lambda_1+\lambda_2+\cdots+\lambda_n. As a result, the smallest item of a sample of independent exponential observations also follows an exponential distribution with rate being the sum of all the individual exponential rates.

Example 1
Suppose that X_1,X_2,X_3 are independent exponential random variables with rates \lambda_i for i=1,2,3. Calculate E[\text{min}(X_1,X_2,X_3)] and E[\text{max}(X_1,X_2,X_3)].

Let X=\text{min}(X_1,X_2,X_3) and Y=\text{max}(X_1,X_2,X_3). Then X is an exponential distribution with rate \lambda_1+\lambda_2+\lambda_3. As a result, E[X]=\frac{1}{\lambda_1+\lambda_2+\lambda_3}. Finding the expected value of the maximum requires calculation. First calculate P[Y \le y].

    \displaystyle \begin{aligned} P[Y \le y]&=P[X_1 \le y] \times P[X_2 \le y] \times  P[X_3 \le y] \\&=(1-e^{-\lambda_1 y}) \times (1-e^{-\lambda_2 y}) \times (1-e^{-\lambda_3 y}) \\&=1-e^{-\lambda_1 y}-e^{-\lambda_2 y}-e^{-\lambda_3 y}+e^{-(\lambda_1+\lambda_2) y}+e^{-(\lambda_1+\lambda_3) y} \\& \ \ \ +e^{-(\lambda_2+\lambda_3) y}-e^{-(\lambda_1+\lambda_2+\cdots+\lambda_n) y}  \end{aligned}

Differentiate F_Y(y)=P[Y \le y] to obtain the density function f_Y(y).

    \displaystyle \begin{aligned} f_Y(y)&=\lambda_1 e^{-\lambda_1 y}+\lambda_2 e^{-\lambda_2 y}+ \lambda_3 e^{-\lambda_3 y} \\& \ \ \ -(\lambda_1+\lambda_2) e^{-(\lambda_1+\lambda_2) y}-(\lambda_1+\lambda_3) e^{-(\lambda_1+\lambda_3) y}-(\lambda_2+\lambda_3)e^{-(\lambda_2+\lambda_3) y} \\& \ \ \ \ +(\lambda_1+\lambda_2+\cdots+\lambda_n)e^{-(\lambda_1+\lambda_2+\cdots+\lambda_n) y}  \end{aligned}

Each term in the density function f_Y(y) is by itself an exponential density. Thus the mean of the maximum is:

    \displaystyle E[Y]=\frac{1}{\lambda_1}+\frac{1}{\lambda_2}+ \frac{1}{\lambda_3} -\frac{1}{\lambda_1+\lambda_2}-\frac{1}{\lambda_1+\lambda_3}-\frac{1}{\lambda_2+\lambda_3}+\frac{1}{\lambda_1+\lambda_2+\lambda_3}

To make sense of the numbers, let \lambda_1=\lambda_2=\lambda_3=1. Then E[X]=\frac{1}{3}=\frac{2}{6} and E[Y]=\frac{11}{6}. In this case, the expected value of the maximum is 5.5 times larger than the expected value of the minimum. For \lambda_1=1, \lambda_2=2 and \lambda_3=3, E[X]=\frac{1}{6} and E[Y]=\frac{73}{60}=\frac{7.3}{6}. In the second case, the expected value of the maximum is 7.3 times larger than the expected value of the minimum. \square

_______________________________________________________________________________________________

Ranking Independent Exponential Distributions

In this section, X_1,X_2,\cdots,X_n are independent exponential random variables where the rate of X_i is \lambda_i for i=1,2,\cdots,n. What is the probability of P[X_{j(1)} <X_{j(2)}<\cdots<X_{j(k)}]? Here the subscripts j(1),\cdots,j(k) are distinct integers from \left\{1,2,\cdots,n \right\}. For example, for sample size of 2, what are the probabilities P[X_1<X_2] and P[X_2<X_1]? For sample size of 3, what are P[X_1<X_2<X_3] and P[X_2<X_1<X_3]? First, the case of ranking two independent exponential random variables.

    Ranking X_1 and X_2.

      \displaystyle P[X_1<X_2]=\frac{\lambda_1}{\lambda_1+\lambda_2} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

    where \lambda_1 is the rate of X_1 and \lambda_2 is the rate of X_2. Note that this probability is the ratio of the rate of the smaller exponential random variable over the total rate. The probability P[X_1<X_2] can be computed by evaluating the following integral:

      \displaystyle P[X_1<X_2]=\int_0^\infty \int_x^\infty \ \lambda_1 e^{-\lambda_1 \ x} \ \lambda_2 e^{-\lambda_2 \ x} \ dy \ dx

The natural next step is to rank three or more exponential random variables. Ranking three variables would require a triple integral and ranking more variables would require a larger multiple integral. Instead, we rely on a useful fact about the minimum statistic. First, another basic result.

    When one of the variables is the minimum:

      \displaystyle P[X_i=\text{min}(X_1,\cdots,X_n)]=\frac{\lambda_i}{\lambda_1+\lambda_2+\cdots+\lambda_n} \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

    The above says that the probability that the ith random variable is the smallest is simply the ratio of the rate of the ith variable over the total rate. This follows from (4) since we are ranking the two exponential variables X_i and \text{min}(X_1,\cdots,X_{i-1},X_{i+1},\cdots,X_n).

We now consider the following theorem.

Theorem 1
Let X_1,X_2,\cdots,X_n be independent exponential random variables. Then the minimum statistic \text{min}(X_1,X_2,\cdots,X_n) and the rank ordering of X_i are independent.

The theorem basically says that the probability of a ranking is not dependent on the location of the minimum statistic. For example, if we know that the minimum is more than 3, what is the probability of X_1<X_2<X_3? The theorem is saying that conditioning on \text{min}(X_1,X_2,X_3)>3 makes no difference on the probability of the ranking. Let Y=\text{min}(X_1,X_2,\cdots,X_n). The following establishes the theorem.

    \displaystyle P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)} \ |\ \text{min}(X_1,X_2,\cdots,X_n)>t]

    \displaystyle =P[X_{j(1)}-t<X_{j(2)}-t<\cdots<X_{j(k)}-t \ |\ \text{min}(X_1,X_2,\cdots,X_n)>t]

    \displaystyle =P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)}]

The key to the proof is the step to go from the second line to the third line. Assume that each X_i is the lifetime of a machine. When \text{min}(X_1,X_2,\cdots,X_n)>t, all the lifetimes X_i>t. By the memoryless property, the remaining lifetimes X_i-t are independent and exponential with the original rates. In other words, each X_i-t has the same exponential distribution as X_i. Consequently, the second line equals to the third line. To make it easier to see the step from the second line to the third line, think of the two dimensional case with X_1 and X_2. \square

The following is a consequence of Theorem 1.

Corollary 2
Let X_1,X_2,\cdots,X_n be independent exponential random variables. Then the event X_i=\text{min}(X_1,X_2,\cdots,X_n) and the rank ordering of X_1,\cdots,X_{i-1},X_{i+1}\cdots,X_n are independent.

Another way to state the corollary is that the knowledge that X_i is the smallest in the sample has no effect on the ranking of the variables other than X_i. This is the consequence of Theorem 1. To see why, let t=X_i=\text{min}(X_1,X_2,\cdots,X_n).

    \displaystyle P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)} \ |\ t=X_i=\text{min}(X_1,X_2,\cdots,X_n)]

    \displaystyle =P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)} \ |\ \text{min}(X_1,\cdots,X_{i-1},X_{i+1}\cdots,X_n)>t]

    \displaystyle =P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)}] \square

We now present examples demonstrating how these ideas are used.

Example 2
Suppose that a bank has three tellers for serving its customers. The random variable X_1,X_2,X_3 are independent exponential random variables where X_i is the time spent by teller i serving a customer. The rate parameter of X_i is \lambda_i where i=1,2,3. If all three tellers are busy serving customers, what is P[X_1<X_2<X_3]? If the bank has 4 tellers instead, then what is P[X_1<X_2<X_3<X_4]?

The answer is given by the following:

    \displaystyle \begin{aligned} P[X_1<X_2<X_3]&=P[X_1=\text{min}(X_1,X_2,X_3)] \\& \ \ \ \times P[X_2<X_3 | X_1=\text{min}(X_1,X_2,X_3)] \\&\text{ } \\&=P[X_1=\text{min}(X_1,X_2,X_3)] \times P[X_2<X_3] \\&\text{ } \\&=\frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3} \times \frac{\lambda_2}{\lambda_2+\lambda_3}  \end{aligned}

The derivation uses Corollary 2 and the idea in (5). The same idea can be used for P[X_1<X_2<X_3<X_4].

    \displaystyle \begin{aligned} P[X_1<X_2<X_3<X_4]&=P[X_1=\text{min}(X_1,X_2,X_3,X_4)] \\& \ \ \times P[X_2<X_3<X_4 | X_1=\text{min}(X_1,X_2,X_3,X_4)] \\&\text{ } \\&=P[X_1=\text{min}(X_1,X_2,X_3,X_4)] \times P[X_2<X_3<X_4] \\&\text{ } \\&=P[X_1=\text{min}(X_1,X_2,X_3,X_4)] \\& \ \ \ \times P[X_2=\text{min}(X_2,X_3,X_4)] \times P[X_3<X_4] \\&\text{ } \\&=\frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3+\lambda_4} \times \frac{\lambda_2}{\lambda_2+\lambda_3+\lambda_4} \times \frac{\lambda_3}{\lambda_3+\lambda_4} \end{aligned}

The above applies the independence result twice, the first time on X_1,X_2,X_3,X_4, the second time on X_2,X_3,X_4. This approach is much preferred over direction calculation, which would involve integral calculation that is tedious and error prone. \square

Example 3
As in Example 2, a bank has three tellers for serving its customers. The service times of the three tellers are independent exponential random variables. The mean service time for teller i is \frac{1}{\lambda_i} minutes where i=1,2,3. You walk into the bank and find that all three tellers are busy serving customers. You are the only customer waiting for an available teller. Calculate the expected amount of time you spend at the bank.

Let T be the total time you spend in the bank, which is T=W+S where W is the waiting time for a teller to become free and S is the service time of the teller helping you. When you walk into the bank, the tellers are already busy. Let X_i be the remaining service time for teller i, i=1,2,3. By the memoryless property, X_i is exponential with the original mean \frac{1}{\lambda_i}. As a result, the rate parameter of X_i is \lambda_i.

The waiting time W is simply \text{min}(X_1,X_2,X_3). Thus E[W]=\frac{1}{\lambda_1+\lambda_2+\lambda_3}. To find E[S], we need to consider three cases, depending on which teller finishes serving the current customer first.

    \displaystyle \begin{aligned} E[S]&=E[S | X_1=\text{min}(X_1,X_2,X_3)] \times P[X_1=\text{min}(X_1,X_2,X_3)] \\& \ \ + E[S | X_2=\text{min}(X_1,X_2,X_3)] \times P[X_2=\text{min}(X_1,X_2,X_3)]  \\& \ \ + E[S | X_3=\text{min}(X_1,X_2,X_3)] \times P[X_3=\text{min}(X_1,X_2,X_3)]\end{aligned}

Finishing the calculation,

    \displaystyle \begin{aligned} E[S]&=\frac{1}{\lambda_1} \times \frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3}  + \frac{1}{\lambda_2} \times \frac{\lambda_2}{\lambda_1+\lambda_2+\lambda_3}  + \frac{1}{\lambda_3} \times \frac{\lambda_3}{\lambda_1+\lambda_2+\lambda_3}   \\&=\frac{3}{\lambda_1+\lambda_2+\lambda_3} \end{aligned}

    \displaystyle E[T]=\frac{1}{\lambda_1+\lambda_2+\lambda_3}+\frac{3}{\lambda_1+\lambda_2+\lambda_3}=\frac{4}{\lambda_1+\lambda_2+\lambda_3}

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

The exponential distribution and the Poisson process

This post is a continuation of the previous post on the exponential distribution. The previous post discusses the basic mathematical properties of the exponential distribution including the memoryless property. This post looks at the exponential distribution from another angle by focusing on the intimate relation with the Poisson process.

_______________________________________________________________________________________________

The Poisson Process

A previous post shows that a sub family of the gamma distribution that includes the exponential distribution is derived from a Poisson process. This post gives another discussion on the Poisson process to draw out the intimate connection between the exponential distribution and the Poisson process.

Suppose a type of random events occur at the rate of \lambda events in a time interval of length 1. As the random events occur, we wish to count the occurrences. Starting at time 0, let N(t) be the number of events that occur by time t. A counting process is the collection of all the random variables N(t). That is, we are interested in the collection \left\{N(t): t \ge 0 \right\}. More specifically, we are interested in a counting process that satisfies the following three axioms:

  • The probability of having more than one occurrence in a short time interval is essentially zero.
  • The probability of the occurrence of a random event in a short time interval is proportional to the length of the time interval and not on where the time interval is located. More specifically, the probability of the occurrence of the random event in a short interval of length h is approximately \lambda \ h.
  • The numbers of random events occurring in non-overlapping time intervals are independent.

Any counting process that satisfies the above three axioms is called a Poisson process with the rate parameter \lambda. Of special interest are the counting random variables N(t), which is the number of random events that occur in the interval [0,t] and N(s+t)-N(t), which is the number of events that occur in the interval [s,s+t]. There are also continuous variables that are of interest. For example, the time until the occurrence of the first event, denoted by S_1, and in general, the time until the occurrence of the nth event, denoted by S_n. We are also interested in the interarrival times, i.e. the time between the occurrences of two consecutive events. These are notated by T_1, T_2, T_3,\cdots where T_n is the time between the occurrence of the (n-1)st event and the occurrence of the nth event. Of course, S_n=T_1+T_2+\cdots+T_n.

_______________________________________________________________________________________________

Some Basic Results

Given a Poisson process \left\{N(t): t \ge 0 \right\} with rate parameter \lambda>0, we discuss the following basic results:

    Counting Variables

  • The discrete random variable N(t) is a Poisson random variable with mean \lambda \ t.
  • The discrete random variable N(s+t)-N(s) is a Poisson random variable with mean \lambda \ t.

    Waiting Time Variables

  • The interarrival times, T_n, where n=1,2,3,\cdots, are independent and identically distributed exponential random variables with rate parameter \lambda (or mean \frac{1}{\lambda}).
  • The waiting time until the nth event has a gamma distribution with shape parameter n and rate parameter \lambda.

The result that N(t) is a Poisson random variable is a consequence of the fact that the Poisson distribution is the limit of the binomial distribution. To see this, let’s say we have a Poisson process with rate \lambda>0. Note thar \lambda is the rate of occurrence of the event per unit time interval. Then subdivide the interval [0,t] into n subintervals of equal length. When n is sufficiently large, we can assume that there can be only at most one event occurring in a subinterval (using the first two axioms in the Poisson process). Each subinterval is then like a Bermoulli trial (either 0 events or 1 event occurring in the subinterval). The probability of having exactly one event occurring in a subinterval is approximately \lambda (\frac{t}{n})=\frac{\lambda t}{n}. By the third criterion in the Poisson process, the n subintervals are independent Bernoulli trials. Thus the total number of events occurring in these n subintervals is a Binomial random variable with n trials and with probability of success in each trial being \frac{\lambda t}{n}. It can be shown mathematically that when n \rightarrow \infty, the binomial distributions converge to the Poisson distribution with mean \lambda \ t. This fact is shown here and here. It follows that N(t) has a Poisson distribution with mean \lambda \ t.

To show that the increment N(s+t)-N(s) is a Poisson distribution, we simply count the events in the Poisson process starting at time s. It is clear that the resulting counting process is also a Poisson process with rate \lambda. We can use the same subdivision argument to derive the fact that N(s+t)-N(s) is a Poisson random variable with mean \lambda \ t. The subdividing is of course on the interval [s,s+t].

Based on the preceding discussion, given a Poisson process with rate parameter \lambda, the number of occurrences of the random events in any interval of length t has a Poisson distribution with mean \lambda \ t. Thus in a Poisson process, the number of events that occur in any interval of the same length has the same distribution. Any counting process that satisfies this property is said to possess stationary increments. On the other hand, any counting process that satisfies the third criteria in the Poisson process (the numbers of occurrences of events in disjoint intervals are independent) is said to have independent increments. Thus a Poisson process possesses independent increments and stationary increments.

Here is an interesting observation as a result of the possession of independent increments and stationary increments in a Poisson process. By independent increments, the process from any point forward is independent of what had previously occurred. By stationary increments, from any point forward, the occurrences of events follow the same distribution as in the previous phase. Starting with a Poisson process, if we count the events from some point forward (calling the new point as time zero), the resulting counting process is probabilistically the same as the original process. The probabilistic behavior of the new process from some point on is not dependent on history. The probability statements we can make about the new process from some point on can be made using the same parameter as the original process. In other words, a Poisson process has no memory.

We now discuss the continuous random variables derived from a Poisson process. The time until the first change, T_1, has an exponential distribution with mean \frac{1}{\lambda}. To see this, for T_1>t to happen, there must be no events occurring in the interval [0,t]. Thus, P[T_1>t] is identical to P[T_1>t]=P[N(t)=0]=e^{-\lambda t}. This means that T_1 has an exponential distribution with rate \lambda. What about T_2 and T_3 and so on? After the first event had occurred, we can reset the counting process to count the events starting at time T_1. Then the time until the next occurrence is also an exponential random variable with rate \lambda. Consequently, all the interarrival times T_n are exponential random variables with the same rate \lambda. Furthermore, by the discussion in the preceding paragraph, the exponential interarrival times T_1, T_2, T_3,\cdots are independent.

As a consequence of the T_n being independent exponential random variables, the waiting time until the nth change S_n is a gamma random variable with shape parameter n and rate parameter \lambda. Note that S_n=T_1+\cdots+T_n and that independent sum of n identical exponential distribution has a gamma distribution with parameters n and \lambda, which is the identical exponential rate parameter.

The connection between exponential/gamma and the Poisson process provides an expression of the CDF and survival function for the gamma distribution when the shape parameter is an integer. Specifically, the following shows the survival function and CDF of the waiting time S_n as well as the density.

    \displaystyle P(S_n>t)=\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}

    \displaystyle P(S_n \le t)=1-\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}=\sum \limits_{k=n}^{\infty} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}

    \displaystyle f_{S_n}(t)=\frac{1}{(n-1)!} \ \lambda^n \ t^{n-1} \ e^{-\lambda t}

The key in establishing the survival \displaystyle P(S_n>t) is that the waiting time S_n is intimately related to N(t), which has a Poisson distribution with mean \lambda t. For S_n>t to happen, there can be at most n events occurring prior to time t, i.e. N(t) \le t.

_______________________________________________________________________________________________

Reversing the Poisson Process

The preceding discussion shows that a Poisson process has independent exponential waiting times between any two consecutive events and gamma waiting time between any two events. Starting with a collection \left\{N(t): t \ge 0 \right\} of Poisson counting random variables that satisfies the three axioms described above, it can be shown that the sequence of interarrival times are independent exponential random variables with the same rate parameter as in the given Poisson process. Interestingly, the process can also be reversed, i.e. given a sequence of independent and identically distributed exponential distributions, each with rate \lambda, a Poisson process can be generated.

To see this, let T_1,T_2,T_3,\cdots be a sequence of independent and identically distributed exponential random variables with rate parameter \lambda. Mathematically, the T_n are just independent and identically distributed exponential random variables. Now think of them as the interarrival times between consecutive events. So the first event occurs at time S_1=T_1 and the second event occurs at time S_2=T_1+T_2 and so on. In general, the nth event occurs at time S_n=T_1+\cdots+T_n. More specifically, the counting process is \left\{N(t): t \ge 0 \right\} where N(t) is defined below:

    N(t)=\text{max} \left\{n: S_n=T_1+\cdots+T_n \le t \right\}

For N(t)=n to happen, it must be true that S_n=T_1+\cdots+T_n \le t and S_{n+1}=T_1+\cdots+T_n+T_{n+1} > t. The probability P[N(t)=n] is then

    \displaystyle \begin{aligned}P[N(t)=n]&=P[S_{n+1} > t \text{ and } S_n \le t]\\&=P[S_{n+1} > t \text{ and } (\text{ not } S_n > t)] \\&=P[S_{n+1} > t]-P[S_n > t] \\&=\sum \limits_{k=0}^{n} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}-\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!} \\&=\frac{e^{-\lambda t} \ (\lambda t)^n}{n!} \end{aligned}

The above derivation shows that the counting variable N(t) is a Poisson random variable with mean \lambda t. The derivation uses the gamma survival function derived earlier. On the other hand, because of the memoryless property, T_1-s,T_2-s,T_3-s,\cdots are also independent exponential random variables with the same rate \lambda. If the counting of events starts at a time s rather than at time 0, the counting would be based on T_k-s,T_{k+1}-s,T_{k+3}-s,\cdots for some k. By the same argument, N(s+t)-N(s) would be Poisson with mean \lambda t.

We have just established that the resulting counting process from independent exponential interarrival times T_n has stationary increments. The resulting counting process has independent increments too. This is because the interarrival times T_n are independent and that the interarrival times T_n are also memoryless.

From a mathematical point of view, a sequence of independent and identically distributed exponential random variables leads to a Poisson counting process. On the other hand, if random events occur in such a manner that the times between two consecutive events are independent and identically and exponentially distributed, then such a process is a Poisson process. This characterization gives another way work with Poisson processes.

_______________________________________________________________________________________________

Examples

Example 1
Suppose that the time until the next departure of a bus at a certain bus station is exponentially distributed with mean 10 minutes. Assume that the times in between consecutive departures at this bus station are independent.

  • Tom and his friend Mike are to take a bus trip together. Tom arrives at the bus station at 12:00 PM and is the first one to arrive. Mike arrives at the bus stop at 12:30 PM. They will board the first bus to depart after the arrival of Mike. What is the the probability that zero buses depart from this bus station while Tom is waiting for Mike?
  • Answer the same question for one bus, and two buses?
  • What is the probability that there are at least three buses leaving the station while Tom is waiting?

Because the inter-departure times are independent and exponential with the same mean, the random events (bus departures) occur according to a Poisson process with rate \frac{1}{10} per minute, or 1 bus per 10 minutes. The number of bus departures in a 30-minute period is a Poisson random variable with mean 3 (per 30 minutes). Let N be the number of buses leaving the bus station between 12:00 PM and 12:30 PM. Thus the answers are:

    P[N=0]=e^{-3}=0.05

    P[N=1]=e^{-3} \times 3=0.149

    P[N=2]=e^{-3} \times \frac{3^2}{2}=0.224

    P[N \ge 3]=1-P[n=0]-P[n=1]-P[n=2]=0.5768 \square

Example 2
Taxi arrives at a certain street corner according to a Poisson process at the rate of two taxi for every 15 minutes. Suppose that you are waiting for a taxi at this street corner and you are third in line. Assume that the people waiting for taxi do not know each other and each one will have his own taxi. What is the probability that you will board a taxi within 30 minutes?

The number of arrivals of taxi in a 30-minute period has a Poisson distribution with a mean of 4 (per 30 minutes). If there are at least 3 taxi arriving, then you are fine. Let N be the number of arrivals of taxi in a 30-minute period. The answer is

    \displaystyle \begin{aligned}P[N \ge 3]&=1-P[N=0] -P[N=1]-P[N=2]\\&=1-e^{-4}-e^{-4} \times 4-e^{-4} \times \frac{4^2}{2} \\&=1-0.2381 \\&=0.7619 \end{aligned}

_______________________________________________________________________________________________

Remarks

In this post, we present a view of the exponential distribution through the view point of the Poisson process. Any counting process that satisfies the three axioms of a Poisson process has independent and exponential waiting time between any two consecutive events and gamma waiting time between any two events. On the other hand, given a sequence of independent and identically distributed exponential interarrival times, a Poisson process can be derived. The memoryless property of the exponential distribution plays a central role in the interplay between Poisson and exponential.

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

The gamma distribution from the point of view of a Poisson process

In the previous post, the gamma distribution is defined from the gamma function. This post shows that the gamma distribution can arise from a Poisson process.

_______________________________________________________________________________________________

The Poisson Process

Consider an experiment in which events that are of interest occur at random in a time interval. The goal here is to derive two families of random variables, one continuous and one discrete. Starting at time 0, record the time of the occurrence of the first event. Then record the time at which the second random event occurs and so on (these are the continuous random variables). Out of these measurements, we can derive discrete random variables by counting the number of random events in a fixed time interval.

The recording of times of the occurrences of the random events is like placing markings on a time line to denote the arrivals of the random events. We are interested in counting the number of markings in a fixed interval. We are also interested in measuring the length from the starting point to the first marking and to the second marking and so on. Because of this interpretation, the random process discussed here can also describe random events occurring along a spatial interval, i.e. intervals in terms of distance or volume or other spatial measurements.

A Poisson process is a random process described above in which several criteria are satisfied. We show that in a Poisson process, the number of occurrences of random events in a fixed time interval follows a Poisson distribution and the time until the nth random event follows a Gamma distribution.

A good example of a Poisson process is the well known experiment in radioactivity conducted by Rutherford and Geiger in 1910. In this experiment, \alpha-particles were emitted from a polonium source and the number of \alpha-particles were counted during an interval of 7.5 seconds (2,608 many such time intervals were observed). In these 2,608 intervals, a total of 10,097 particles were observed. Thus the mean count per period of 7.5 seconds is 10097 / 2608 = 3.87.

In the Rutherford and Geiger experiment in 1910, a random event is the observation of an \alpha-particle. The random events occur at an average of 3.87 per unit time interval (7.5 seconds).

One of the criteria in a Poisson process is that in a very short time interval, the chance of having more than one random event is essentially zero. So either one random event will occur or none will occur in a very short time interval. Considering the occurrence of a random event as a success, there is either a success or a failure in a very short time interval. So a very short time interval in a Poisson process can be regarded as a Bernoulli trial.

The second criterion is that the experiment remains constant over time. Specifically this means that the probability of a random event occurring in a given subinterval is proportional to the length of that subinterval and not on where the subinterval is in the original interval. Any counting process that satisfies this criterion is said to possess stationary increments. For example, in the 1910 radioactivity study, \alpha-particles were emitted at the rate of \lambda= 3.87 per 7.5 seconds. So the probability of one \alpha-particle emitted from the radioactive source in a one-second interval is 3.87/7.5 = 0.516. Then the probability of observing one \alpha-particle in a half-second interval is 0.516/2 = 0.258. For a quarter-second interval, the probability is 0.258/2 = 0.129. So if we observe half as long, it will be half as likely to observe the occurrence of a random event. On the other hand, it does not matter when the quarter-second subinterval is, whether at the beginning or toward the end of the original interval of 7.5 seconds.

The third criterion is that non-overlapping subintervals are mutually independent in the sense that what happens in one subinterval (i.e. the occurrence or non-occurrence of a random event) will have no influence on the occurrence of a random event in another subinterval. Any counting process that satisfies this criterion is said to possess independent increments. In the Rutherford and Geiger experiment, the observation of one particle in one half-second period does not imply that a particle will necessarily be observed in the next half-second.

To summarize, the following are the three criteria of a Poisson process:

    Suppose that on average \lambda random events occur in a time interval of length 1.

    1. The probability of having more than one random event occurring in a very short time interval is essentially zero.
    2. For a very short subinterval of length \frac{1}{n} where n is a sufficiently large integer, the probability of a random event occurring in this subinterval is \frac{\lambda}{n}.
    3. The numbers of random events occurring in non-overlapping time intervals are independent.

_______________________________________________________________________________________________

The Poisson Distribution

We are now ready to derive the Poisson distribution from a Poisson random process.

Consider random events generated in a Poisson process and let Y be the number of random events observed in a unit time interval. Break up the unit time interval into n non-overlapping subintervals of equal size where n is a large integer. Each subinterval can have one or no random event. The probability of one random event in a subinterval is \lambda/n. The subintervals are independent. In other words, the three criteria of a Poisson process described above ensure that the n subintervals are independent Bernoulli trials. As a result, the number of events occurring in these n subintervals is a binomial distribution with n trials and probability of success \lambda/n. This binomial distribution is an approximation of the random variable Y. The binomial distribution can get more and more granular. The resulting limit is a Poisson distribution, which coincides with the distribution for Y. The fact that the Poisson distribution is the limiting case of the binomial distribution is discussed here and here.

It follows that Y follows the Poisson distribution with mean \lambda. The following is the probability function.

    \displaystyle P(Y=y)=\frac{e^{-\lambda} \ \lambda^y}{y!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots

In the 1910 radioactivity study, the number of \alpha-particles observed in a 7.5-second period has a Poisson distribution with the mean of \lambda=3.87 particles per 7.5 seconds.

Sometimes it may be necessary to count the random events not in a unit time interval but in a smaller or larger time interval of length t. In a sense, the new unit time is t and the new average rate of the Poisson process is then \lambda t. Then the idea of taking granular binomial distributions will lead to a Poisson distribution. Let Y_t be the number of occurrences of the random events in a time interval of length t. Then Y_t follows a Poisson distribution with mean \lambda t. The following is the probability function.

    \displaystyle P(Y_t=x)=\frac{e^{-\lambda t} \ (\lambda t)^y}{y!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots

In the 1910 radioactivity study, the number of \alpha-particles observed in a 3.75-second period has a Poisson distribution with the mean of \lambda=3.87/2=1.935 particles per 3.75 seconds.

_______________________________________________________________________________________________

The Gamma Distribution as Derived from a Poisson Process

With the Poisson process and Poisson distribution properly set up and defined, we can now derive the gamma distribution. As before, we work with a Poisson process in which the random events arrive at an average rate of \lambda per unit time. Let W_1 be the waiting time until the occurrence of the first random event, W_2 be the waiting time until the occurrence of the second random event and so on. First examine the random variable W_1.

Consider the probability P(W_1>t). The event W_1>t means that the first random event takes place after time t. This means that there must be no occurrence of the random event in question from time 0 to time t. It follows that

    P(W_1>t)=P(Y_t=0)=e^{-\lambda t}

As a result, P(W_1 \le t)=1-e^{-\lambda t}, which is the cumulative distribution of W_1. Taking the derivative, the probability density function of W_1 is f_{W_1}(t)=\lambda e^{-\lambda t}. This is the density function of the exponential distribution with mean \frac{1}{\lambda}. Recall that \lambda is the rate of the Poisson process, i.e. the random events arrive at the mean rate of \lambda per unit time. Then the mean time between two consecutive events is \frac{1}{\lambda}.

Consider the probability P(W_2>t). The event W_2>t means that the second random event takes place after time t. This means that there can be at most one occurrence of the random events in question from time 0 to time t. It follows that

    P(W_2>t)=P(Y_t=0)+P(Y_t=1)=e^{-\lambda t}+\lambda t \ e^{-\lambda t}

As a result, P(W_2 \le t)=1-e^{-\lambda t}-\lambda t \ e^{-\lambda t}, which is the cdf of the waiting time W_2. Taking the derivative, the probability density function of W_2 is f_{W_2}(t)=\lambda^2 \ t \ e^{-\lambda t}. This is the density function of the gamma distribution with shape parameter 2 and rate parameter \lambda

By the same reasoning, the waiting time until the n^{th} random event, W_n, follows a gamma distribution with shape parameter n and rate parameter \lambda. The survival function, cdf and the density function are:

    \displaystyle P(W_n>t)=\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}

    \displaystyle P(W_n \le t)=1-\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}=\sum \limits_{k=n}^{\infty} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}

    \displaystyle f_{W_n}(t)=\frac{1}{(n-1)!} \ \lambda^n \ t^{n-1} \ e^{-\lambda t}

The survival function P(W_n>t) is identical to P(Y_t \le n-1). The equivalence is through the translation: the event W_n>t is equivalent to the event that there can be at most n-1 random events occurring from time 0 to time t.

Example 1
Let’s have a quick example of calculation for the gamma distribution. In the study by Rutherford and Geiger in 1910, the average rate of arrivals of \alpha-particles is 3.87 per 7.5-second period, giving the average rate of 0.516 particles per second. On average, it takes 0.516^{-1} = 1.94 seconds to wait for the next particle. The probability that it takes more than 3 seconds of waiting time for the first particle to arrive is e^{-0.516 (3)} = 0.213.

How long would it take to wait for the second particle? On average it would take 2 \cdot 0.516^{-1} = 3.88 seconds. The probability that it takes more than 5 seconds of waiting time for the second particle to arrive is

    e^{-0.516 (5)}+0.516 \cdot 5 e^{-0.516 (5)}=3.58 \cdot e^{-2.58} = 0.271

_______________________________________________________________________________________________

Remarks

The above discussion shows that the gamma distribution arises naturally from a Poisson process, a random experiment that satisfies three assumptions that deal with independence and uniformity in time. The gamma distribution derived from a Poisson process has two parameters n and \lambda where n is a positive integer and is the shape parameter and \lambda is the rate parameter. If the random variable W follows this distribution, its pdf is:

    \displaystyle f_W(w)=\frac{1}{(n-1)!} \ \lambda^n \ w^{n-1} \ e^{-\lambda w} \ \ \ \ \ \ \ \ \ \ \ \ w>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The above pdf can be interpreted as the density function for the waiting time until the arrival of the nth random event in a Poisson process with an average rate of arrivals at \lambda per unit time. The density function may be derived from an actual Poisson process or it may be just describing some random quantity that has nothing to do with any Poisson process. But the Poisson process interpretation is still useful. One advantage of the Poisson interpretation is that the survival function and the cdf would have an expression in closed form.

    \displaystyle P(W>w)=\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda w} \ (\lambda w)^k}{k!} \ \ \ \ \ \ \ \ \ \ \ \ w>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

    \displaystyle \begin{aligned} P(W \le w)&=1-\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda w} \ (\lambda w)^k}{k!} \\&=\sum \limits_{k=n}^{\infty} \frac{e^{-\lambda w} \ (\lambda w)^k}{k!} \ \ \ \ \ \ \ \ \ \ \ \ w>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

In the Poisson process interpretation, P(W>w) is the probability that the nth random event occurs after time w. This means that in the interval (0, w), there are at most n-1 random events. Thus the gamma survival function is identical to the cdf of a Poisson distribution. Even when W is simply a model of some random quantity that has nothing to do with a Poisson process, such interpretation can still be used to derive the survival function and the cdf of such a gamma distribution.

The gamma distribution described in the density function (1) has a shape parameter that is a positive integer. This special case of the gamma distribution sometimes go by the name Erlang distribution and is important in queuing theory.

In general the shape parameter does not have to be integers; it can be any positive real number. For the more general gamma distribution, see the previous post.

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}