# The (a,b,1) class

This post is a continuation of the preceding post on (a,b,0) class. This post introduces the class of discrete discrete distributions called the (a,b,1) class.

The discussion in this post has a great deal of technical details. A concise summary of the (a,b,0) class and (a,b,1) class is found here.

The (a,b,1) Class

A counting distribution is a discrete probability distribution that takes on the non-negative integers. Let $N$ be a random variable that is a counting distribution. For each integer $k=0,1,2,\cdots$, let $P_k=P[N=k]$ if $N$ is the counting distribution being considered. Recall that a counting distribution is a member of the (a,b,0) class of distributions if the following recursive relation holds for some constants $a$ and $b$.

(1)……….$\displaystyle \frac{P_k}{P_{k-1}}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

For a member of the (a,b,0) class, the initial probability $P_0$ is fixed (the sum of all the $P_k$ must sum to 1). Thus a member of the (a,b,0) class has two parameters, namely $a$ and $b$ in the recursive relation (1). A counting distribution is a member of the (a,b,1) class of distributions if the following recursive relation holds for some constants $a$ and $b$.

(2)……….$\displaystyle \frac{P_k}{P_{k-1}}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=2,3,4 \cdots$

The recursion in the (a,b,1) class begins at $k=2$. That means that the initial probability $P_0$ must be an assumed value. The probability $P_1$ is then the value such that the sum $P_1+P_2+\cdots$ is $1-P_0$. Thus a member of the (a,b,1) class has three parameters: $a$, $b$ and $P_0$.

There are two subclasses in the (a,b,1) class of distributions. They are determined by whether $P_0=0$ or $P_0>0$ The (a,b,1) distributions in the first category are called the zero-truncated distributions. The (a,b,1) distributions in the second category are called zero-modified distributions.

This is how we will proceed. Using a given distribution in the (a,b,0) class as a starting point, we show how to derive the zero-truncated distribution. From a given zero-truncated distribution, we show how to derive the zero-modified distribution.

The name of the (a,b,1) distribution has the same (a,b,0) name with either zero-truncated or zero-modified as the prefix. For example, if the starting point is the negative binomial distribution in the (a,b,0) class, then the derived distributions in the (a.b.1) class are the zero-truncated negative binomial distribution and the zero-modified negative binomial distribution.

There are only three distributions in the (a,b,0) class – Poisson, binomial and negative binomial. Then the (a,b,1) class contains the zero-truncated and zero-modified versions of these three distributions. However, the (a,b,1) class contained distributions that are not modifications of the (a,b,0) distributions. We discuss three such additional distributions – extended truncated negative binomial (ETNB) distribution, logarithmic distribution and Sibuya distribution. These three distributions that are not derived from an (a,b,0) distribution are discussed in a separate section below.

We present three examples demonstrating how using a base (a,b,0) negative binomial distribution to derive a zero-truncated negative binomial distribution (Example 1) and a zero-modified negative binomial distribution (Example 2). We also give an example for ETNB distribution (Example 3).

Notations

To facilitate discussion, let’s fix some notations. To clearly denote the distributions, notations without superscripts and subscripts refer to the (a,b,0) distributions. Notations with the superscript T (or subscript T) refer to the zero-truncated distributions in the (a,b,1) class. Likewise notations with the superscript M (or subscript M) refer to the zero-modified distributions in the (a,b,1) class.

For example, the following are the probability function (pf) and the probability generating function (pgf) of a distribution from the (a,b,0) class.

(3)……….\displaystyle \begin{aligned}&P_k=P[N=k] \\&P(z)=\sum \limits_{k=0}^\infty P_k z^k \end{aligned}

The following shows the notations for the pf and pgf for a zero-truncated distribution from the (a,b,1) class.

(4)……….\displaystyle \begin{aligned}&P_k^T=P[N_T=k] \\&P^T(z)=\sum \limits_{k=1}^\infty P_k^T z^k \end{aligned}

The following shows the notations for the pf and pgf for a zero-modified distribution from the (a,b,1) class.

(5)……….\displaystyle \begin{aligned}&P_k^M=P[N_M=k] \\&P^M(z)=\sum \limits_{k=0}^\infty P_k^M z^k \end{aligned}

Whenever it is convenient to do so, $N$ is a random variable from (a,b,0) while $N_T$ and $N_M$ are to denote random variables for the zero-truncated distribution and zero-modified distribution, respectively.

Zero-Truncated Distributions

The focus in this section is on the zero-truncated distributions that originate from the (a,b,0) class. The three distributions indicated above (ETNB, logarithmic and Shibuya) are discussed in a separate section below.

Suppose we start with a distribution from the (a,b,0) class, with the notations $P_k$, $P(z)$ and $N$ as indicated above. We show how to derive the corresponding zero-truncated distribution in the (a,b,1) class. For the zero-truncated distribution, there are two ways to compute probabilities. One is the recursion relation:

(6)……….$\displaystyle \frac{P_k^T}{P_{k-1}^T}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=2,3,4 \cdots$

The recursion relation (6) is identical to the one in (2). The recursion begins at $k=2$. The zero-truncated probabilities can also be derived from the (a,b,0) probabilities as follows:

(7)……….$\displaystyle P_k^T=\frac{1}{1-P_0} P_k \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,4 \cdots$

The probabilities $P_k^T$ in (7) can be regarded as conditional probabilities – the probability that $N=k$ given that $N>0$. From a procedural standpoint, the probabilities $P_k^T$ are the (a,b,0) probabilities $P_k$ multiplied by $1/(1-P_0)$ to make the probabilities sum to 1. With the probabilities established, the probability generating function (pgf) and mean and moments of the zero-truncated distribution can also be expressed in terms of the corresponding quantities of the (a,b,0) distribution.

(8)……….$\displaystyle P^T(z)=\sum \limits_{k=1}^\infty P_k^T \ z^k =\frac{1}{1-P_0} \ [P(z)-P_0]$

(9)……….$\displaystyle E[N_T]=\frac{1}{1-P_0} \ E[N]$

(10)……..$\displaystyle E[N_T^2]=\frac{1}{1-P_0} \ E[N^2]$

(11)……..\displaystyle \begin{aligned} Var[N_T]&=\frac{1}{1-P_0} \ E[N^2]-\biggl( \frac{1}{1-P_0} \ E[N]\biggr)^2 \\&=\frac{1}{1-P_0} \ Var[N]+\biggl(1-\frac{1}{1-P_0} \biggr) \ \frac{1}{1-P_0} \ E[N]^2 \end{aligned}

The goal of the above items is to inform on the zero-truncated distribution based on information from the (a,b,0) distribution. They can also be derived based on definitions using the probability function (7). The following shows the factorial means of the zero-truncated distribution.

(12)……..$\displaystyle \mu_{(1)}=E[N_T]=\frac{a+b}{(1-a) (1-P_0)}$
……..

(13)……..\displaystyle \begin{aligned} \mu_{(j)}&=E \{ N_T \ [N_T-1] \ [N_T-2] \cdots [N_T-(j-1)] \}\\&\text{ } \\&=\frac{(a j+b) \ \mu_{(j-1)}}{1-a} \end{aligned}

The first factorial mean is identical to the mean of $N_T$. The $P_0$ in $\mu_{(1)}$ is the value of zero probability for the corresponding member in the (a,b,0) class. The higher factorial moments are derived recursively as in the (a,b,0) case. The raw moments $E[N_T^k]$ can be derived using the factorial moments. The variance, as derived from the factorial moments, is:

(14)……..$\displaystyle Var[N_T]=\frac{(a+b) \ [1-(a+b+1) \ P_0]}{[(1-a) \ (1-P_0) ]^2}$

Example 1
It is helpful to go through an example. First, we set up an (a,b,0) distribution – an negative binomial distribution with parameters $r=2$ and $\theta=3$. The (a,b,0) parameters are $a=3/4$ and $b=3/4$. The following gives the pf and the recursive relation for this negative binomial distribution, as well as the mean, variance and pgf.

……….$\displaystyle P_k=(1+k) \ \frac{1}{16} \ \biggl(\frac{3}{4} \biggr)^k \ \ \ \ \ \ \ \ k=0,1,2,3,\cdots$

……….$\displaystyle P_k=\biggl(\frac{3}{4}+\frac{3}{4} \ \frac{1}{k} \biggr) \ P_{k-1} \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

……….$\displaystyle E[N]=6$

……….$\displaystyle Var[N]=24$

……….$\displaystyle P(z)=[1-3 \ (z-1)]^{-2}$

The following table shows the first 5 probabilities for the zero-truncated negative binomial distribution.

Table – Zero-Truncated Negative Binomial

$\bold k$ (a,b,0) $\bold P_{\bold k}$ Zero-Truncated $\bold P_{\bold k}^{\bold T}$
0 $\displaystyle \frac{1}{16}$
1 $\displaystyle \frac{3}{32}$ $\displaystyle \frac{3}{30}$
2 $\displaystyle \frac{27}{256}$ $\displaystyle \frac{27}{240}$
3 $\displaystyle \frac{27}{256}$ $\displaystyle \frac{27}{240}$
4 $\displaystyle \frac{405}{4096}$ $\displaystyle \frac{405}{3840}$
5 $\displaystyle \frac{729}{8192}$ $\displaystyle \frac{729}{7680}$

The probabilities $P_k$ are generated by either the (a,b,0) pf or the recursive relation. The probabilities $P_k^T$ are generated by the recursive relation (7) or by the recursive relation (6). The following lists the mean, variance and pgf of the zero-truncated negative binomial example.

……….$\displaystyle E[N_T]=\frac{16}{15} \ 6=\frac{32}{5}=6.4$

……….$\displaystyle Var[N_T]=\frac{576}{25}=23.04$

……….$\displaystyle P^T(z)=\frac{16}{15} \biggl( [1-3 \ (z-1)]^{-2}-\frac{1}{16} \biggr)$

Zero-Modified Distributions

The goal of this section is to derive a zero-modified distribution from a zero-truncated distribution, either derived from an (a,b,0) distribution as discussed in the preceding section, or a truncated distribution not originated from (a,b,0) class.

We now take a zero-truncated distribution as a given and derive the probabilities $P_k^M$ and other distributional quantities. As in the case of zero-truncated distribution, one way to generate probabilities is through the recursion process:

(15)……..$\displaystyle \frac{P_k^M}{P_{k-1}^M}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=2,3,4 \cdots$

The probability $P_0^M>0$ is an assumed value. The probability $P_1^M$ is the value that ensures that all the probabilities sum to 1. As indicated the recursion begins at $k=2$. Another way to calculate probabilities is through the zero-truncated distribution:

(16)……..$\displaystyle P_k^M=(1-P_0^M) \ P_k^T \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,4 \cdots$

Of course, if the zero-truncated distribution is based on a distribution from the (a,b,0) class, we can express the zero-modified probabilities as, after plugging (7) into (16):

(17)……..$\displaystyle P_k^M=\frac{1-P_0^M}{1-P_0} \ P_k \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,4 \cdots$

Further distributional quantities can now be derived:

(18)……..$\displaystyle P^M(z)=P_0^M \cdot 1+(1-P_0^M) \ P^T(z)$

(19)……..$\displaystyle E[N_M]=(1-P_0^M) \ E[N_T]$

(20)……..$\displaystyle Var[N_M]=(1-P_0^M) \ Var[N_T]+P_0^M \ (1-P_0^M) \ E[N_T]^2$

The result (18) is the pgf of the zero-modified distribution based on the pgf of the given zero-truncated distribution. In words, (19) says that the mean of the modified distribution is $1-P_0^M$ times the mean of the given zero-truncated distribution. In words, (20) says that the variance of the zero-modified distribution is $1-P_0^M$ times the variance of the given zero-truncated distribution plus $P_0^M (1-P_0^M)$ times the square of the mean of the truncated distribution.

If the given zero-truncated distribution is actually obtained from a member of the (a,b,0) class, then the above three results can be expressed in terms of (a,b,0) information, after plugging the corresponding information for $N_T$ into (18), (19) and (20).

(21)……..$\displaystyle P^M(z)=\biggl(1-\frac{1-P_0^M}{1-P_0} \biggr) \cdot 1+\frac{1-P_0^M}{1-P_0} \ P(z)$

(22)……..$\displaystyle E[N_M]=\frac{1-P_0^M}{1-P_0} \ E[N]$

(23)……..$\displaystyle Var[N_M]=\frac{1-P_0^M}{1-P_0} \ Var[N]+ \biggl(1-\frac{1-P_0^M}{1-P_0} \biggr) \ \frac{1-P_0^M}{1-P_0} \ E[N]^2$

Example 2
Consider the zero-truncated negative binomial distribution considered in Example 1. We now generate information on the corresponding zero-modified negative binomial distribution with the assumed value of $P_0^M=0.2$. The following table gives several probabilities.

Table – Zero-Modified Negative Binomial

$\bold k$ (a,b,0) $\bold P_{\bold k}$ Zero-Truncated $\bold P_{\bold k}^{\bold T}$ Zero-Modified $\bold P_{\bold k}^{\bold M}$
0 $\displaystyle \frac{1}{16}$ 0.2
1 $\displaystyle \frac{3}{32}$ $\displaystyle \frac{3}{30}$ $\displaystyle \frac{2.4}{30}$
2 $\displaystyle \frac{27}{256}$ $\displaystyle \frac{27}{240}$ $\displaystyle \frac{21.6}{240}$
3 $\displaystyle \frac{27}{256}$ $\displaystyle \frac{27}{240}$ $\displaystyle \frac{21.6}{240}$
4 $\displaystyle \frac{405}{4096}$ $\displaystyle \frac{405}{3840}$ $\displaystyle \frac{326.4}{3840}$
5 $\displaystyle \frac{729}{8192}$ $\displaystyle \frac{729}{7680}$ $\displaystyle \frac{583.2}{7680}$

The zero-modified probabilities $P_k^M$ are calculated according to (16). With the assumed value $P_0^M=0.2$, $1-P_0^M=0.8$. We simply multiply each zero-truncated probability by 0.8. Using (18), (19) and (20), we obtain the mean, variance and pgf of the zero-modified negative binomial example.

……….$\displaystyle E[N_M]=0.8 \ E[N_T]=0.8 (6.4)=5.12$

……….$\displaystyle Var[N_M]=24.9856$

……….$\displaystyle P^M(z)=0.2+0.8 \biggl[ \frac{16}{15} \biggl( [1-3 \ (z-1)]^{-2}-\frac{1}{16} \biggr) \biggr]$

As indicated earlier, the (a,b,1) class contains distributions other than the ones derived from the three (a,b,0) distributions. These distributions also have the zero-truncated versions as well as the zero-modified versions. We discuss the truncated versions. They are: the extended truncated negative binomial (ETNB) distribution, the logarithmic distribution and the Sibuya distribution. The extended truncated negative binomial (ETNB) distribution is resulted from relaxing the r parameter of the negative binomial distribution. The logarithmic distribution and Sibuya distribution are derived from the ETNB distribution. The modified versions of these three distributions can then be obtained by going through the process outlined in the preceding section.

ETNB
Recall that the (a,b,0) negative binomial distribution has two parameters $r$ and $\theta$. The following gives the parameters $a$ and $b$ used in the (a,b,0) recursion and the first two probabilities.

……….$\displaystyle a=\frac{\theta}{1+\theta} \ \ \ \ \ \ \ \ \ \ \ \ \ \ b=(r-1) \ \frac{\theta}{1+\theta} \ \ \ \ \ \ r>0, \ \theta >0$

……….$\displaystyle P_0=\biggl( \frac{1}{1+\theta} \biggr)^r \ \ \ \ \ \ P_1=r \ \biggl( \frac{1}{1+\theta} \biggr)^r \biggl( \frac{\theta}{1+\theta} \biggr)=\frac{r \ \theta}{(1+\theta)^{r+1}}$

The extended negative binomial distribution is resulted from extending the $r$ parameter so that $-1 is applicable in addition to the usual $r>0$. With the extension of $r$, the ETNB probabilities are generated according to the truncated probabilities of (7). In effect, we are pretending that we are starting from a base (a,b,0) negative binomial distribution even the $r$ parameter could be such that $-1. Thus the two parameters of the zero-truncated ETNB distribution are given by the following:

(24)……..$\displaystyle a=\frac{\theta}{1+\theta} \ \ \ \ \ \ b=(r-1) \ \frac{\theta}{1+\theta} \ \ \ \ \ \ -10$

What do we do with the ETNB parameters indicated in (24)? Using these $a$ and $b$, we can generate the “negative binomial” probabilities $P_k$ according to the recursive relation (1) with $P_0=[1/(1+\theta)]^r$. However, with $r$ being negative, these values of $P_k$ are not probabilities (in fact they are negative). However, the value of $1/(1-P_0)$ is also negative when $r$ is negative. Using (7), the zero-truncated probabilities $P_k^T$ are positive. Thus the “negative binomial” distribution using a negative $r$ is not really a distribution. It is just a device to define ETNB distribution.

Using the idea in the preceding paragraph, we can also come up with direct formula for the ETNB probabilities $P_k$. The following gives the first three probabilities.

……….$\displaystyle P_1^T=\frac{1}{1-P_0} \ P_1=\frac{1}{1-P_0} \ \frac{r \ \theta}{(1+\theta)^{r+1}}=\frac{r \theta}{(1+\theta)^{r+1}-(1+\theta) }$

……….\displaystyle \begin{aligned} P_2^T&=\frac{1}{1-P_0} \ P_2\\&=\frac{1}{1-P_0} \ \frac{r (r+1)}{2} \frac{1}{(1+\theta)^r} \ \biggl(\frac{\theta}{1+\theta} \biggr)^2 \\&=\frac{r (r+1)}{2} \ \frac{1}{(1+\theta)^r-1} \ \biggl(\frac{\theta}{1+\theta} \biggr)^2 \end{aligned}

……….\displaystyle \begin{aligned} P_3^T&=\frac{1}{1-P_0} \ P_3\\&=\frac{1}{1-P_0} \ \frac{r (r+1) (r+2)}{3!} \frac{1}{(1+\theta)^r} \ \biggl(\frac{\theta}{1+\theta} \biggr)^3 \\&=\frac{r (r+1) (r+2)}{3!} \ \frac{1}{(1+\theta)^r-1} \ \biggl(\frac{\theta}{1+\theta} \biggr)^3 \end{aligned}

Based on the pattern of the above three probabilities, the ENTB probability $P_k$, $k=1,2,3,\cdots$, is:

(25)……..$\displaystyle P_k=\frac{1}{1-P_0} \ P_k=\frac{r (r+1) \cdots (r+k-1)}{k!} \ \frac{1}{(1+\theta)^r-1} \ \biggl(\frac{\theta}{1+\theta} \biggr)^k$

All other distributional quantities such as pgf and means and higher moments can be derived based on the ETNB pf $P_k^T$ For example, the mean, variance and pgf are:

(26)……..$\displaystyle E[N_T]=\frac{1}{1-P_0} \ E[N]=\frac{1}{1-P_0} \ r \ \theta=\frac{r \ \theta}{1-(1+\theta)^{-r}}$

(27)……..\displaystyle \begin{aligned} Var[N_T]&=\frac{1}{1-P_0} \ Var[N]+\biggl(1-\frac{1}{1-P_0} \biggr) \ \frac{1}{1-P_0} \ E[N]^2 \\&=r \ \theta \ \frac{(1+\theta)-(1+\theta+ r \ \theta) \ (1+\theta)^{- r}}{[1-(1+\theta)^{-r}]^2} \end{aligned}

(28)……..\displaystyle \begin{aligned} P^T(z)&=\frac{1}{1-P_0} \ (P(z)-P_0) \\&=\frac{1}{1-(1+\theta)^{-r}} \ \biggl[ \biggl(1-\theta (z-1) \biggr)^{-r} -(1+\theta)^{-r}\biggr] \end{aligned}

Logarithmic Distribution
This is a truncated distribution that is derived from ETNB by letting $r \rightarrow 0$. The following shows the information that is needed for the recursive generation of probabilities.

(29)……..$\displaystyle a=\frac{\theta}{1+\theta} \ \ \ \ \ \ \ \ b=-\frac{\theta}{1+\theta}$

(30)……..$\displaystyle P_1^T=\frac{\theta}{(1+\theta) \ \ln(1+\theta)}$

The parameter $b$ is obtained by letting $r \rightarrow 0$ in the $b$ for ETNB. The logarithmic $P_1^T$ is from taking the limit of the ETNB $P_1^T$ as $r \rightarrow 0$ (using the L’Hopital’s rule). The rest of the pf $P_k$ for $k=2,3,\cdots$ can be generated from the recursive relation (6). Unlike a zero-truncated distribution that is derived from an (a,b,0) distribution, the distributional quantities of the logarithmic distribution cannot be derived from an (a,b,0) distribution. Thus in order to gain more information about the logarithmic, its pf must be used. The mean and variance for the logarithmic distribution are:

(31)……..$\displaystyle E[N_T]=\frac{\theta}{\ln(1+\theta) }$

(32)……..$\displaystyle Var[N_T]=\frac{\theta \biggl[1+\theta-\theta / \ln(1+\theta) \biggr]}{\ln(1+\theta)}$

This is a truncated distribution that is derived from ETNB by letting $\theta \rightarrow \infty$ and making $-1. The following shows the information that is needed for the recursive generation of probabilities.

(33)……..$\displaystyle a=1 \ \ \ \ \ \ \ \ b=r-1$

(34)……..$\displaystyle P_1^T=-r$

All of the three items are obtained by letting $\theta \rightarrow \infty$ in the corresponding items in ETNB. To see that $\displaystyle P_1^T=-r$, rewrite the ETNB $P_1^T$ as follows:

……….$\displaystyle P_1^T=\frac{r \theta}{(1+\theta)^{r+1}-(1+\theta)}=\frac{r \theta}{(1+\theta) \ [(1+\theta)^r-1]}=r \ \frac{\theta}{1+\theta} \ \frac{1}{(1+\theta)^r-1}$

As $\theta \rightarrow \infty$, the ratio $\theta / (1+\theta)$ goes to 1. As $\theta \rightarrow \infty$, $(1+\theta)^r$ goes to 0 because $r$ is negative. Thus the above $P_1^T$ goes to $-r$. With the $a$ and $b$ in (34) and the $P_1^T$ in (35), the rest of the Shibuya pf can be generated by the recursive relation in (6). Note that the mean does not exist for the Shibuya distribution. The following is the pgf of the Sibuya distribution.

(35)……..$\displaystyle P^T(z)=1-(1-z)^{-r}$

Once these three zero-truncated distributions are obtained, we can derive the zero-modified versions of these distributions in the process described earlier.

Example 3
We demonstrate how ETNB is calculated. Let $r=-\frac{1}{2}$ and $\theta=3$. Then the parameters for the “artificial” negative binomial distribution are:

……….$\displaystyle a=\frac{3}{4} \ \ \ \ \ \ \ \ b=\biggl(-\frac{1}{2}-1 \biggr) \frac{3}{4}=-\frac{9}{8}$

The $P_0$ for the artificial negative binomial distribution is $P_0=(1/4)^{-0.5}=2$, making $1/(1-P_0)=-1$. We generate the fake negative binomial probabilities recursively using the $a$ and $b$. Then we multiply by the $1/(1-P_0)=-1$ to get the zero-truncated probabilities according to (7).

Table – Zero-Truncated ETNB

$\bold k$ Artificial $\bold P_{\bold k}$ Zero-Truncated $\bold P_\bold k^{\bold T}$
0 $\displaystyle 2$
1 $\displaystyle -\frac{3}{4}$ $\displaystyle \frac{3}{4}$
2 $\displaystyle -\frac{9}{64}$ $\displaystyle \frac{9}{64}$
3 $\displaystyle -\frac{27}{512}$ $\displaystyle \frac{27}{512}$
4 $\displaystyle -\frac{405}{16384}$ $\displaystyle \frac{405}{16384}$
5 $\displaystyle -\frac{1701}{131072}$ $\displaystyle \frac{1701}{131072}$

The column labeled artificial $P_k$ is obviously not probabilities. It is generated recursively using $a=3/4$ and $b=-9/8$. Then multiply the column labeled artificial $P_k$ by $1/(1-P_0)=-1$ to obtain the ETNB probabilities, which can also be computed directly using (25).

Using (26) and (27), the ETNB mean and variance are $E[N_T]=3/2$ and $Var[N_T]=3/2$. With an assumed value of $P_0^M=0.1$, we generate the first 5 zero-modified ETNB probabilities in the following table.

Table – Zero-Modified ETNB

$\bold k$ Artificial $\bold P_{\bold k}$ Zero-Truncated $\bold P_\bold k^{\bold T}$ Zero-Modified $\bold P_\bold k^{\bold M}$
0 $\displaystyle 2$ 0.1
1 $\displaystyle -\frac{3}{4}$ $\displaystyle \frac{3}{4}$ $\displaystyle \frac{2.7}{4}$
2 $\displaystyle -\frac{9}{64}$ $\displaystyle \frac{9}{64}$ $\displaystyle \frac{8.1}{64}$
3 $\displaystyle -\frac{27}{512}$ $\displaystyle \frac{27}{512}$ $\displaystyle \frac{24.3}{512}$
4 $\displaystyle -\frac{405}{16384}$ $\displaystyle \frac{405}{16384}$ $\displaystyle \frac{364.5}{16384}$
5 $\displaystyle -\frac{1701}{131072}$ $\displaystyle \frac{1701}{131072}$ $\displaystyle \frac{1530.9}{131072}$

With the assumed value of $P_0^M=0.1$, the zero-modified probabilities are obtained by multiplying the zero-truncated probabilities by $1-P_0^M=0.9$. Using (19) and (20), the zero-modified ETNB mean and variance are: $E[N_M]=1.35$ and $Var[N_M]=1.5525$.

Practice Problems

The discussion in this post has a great deal of technical details. A concise summary of the (a,b,0) class and (a,b,1) class is found here.

Practice problems on (a,b,0) class

Practice problems on (a,b,1) class

Dan Ma actuarial topics
Dan Ma actuarial
Dan Ma math

Daniel Ma actuarial
Daniel Ma mathematics
Daniel Ma actuarial topics

$\copyright$ 2019 – Dan Ma

# The (a,b,0) class

This post introduces the class of discrete distributions called the (a,b,0) class.

A counting distribution is a discrete random variable that takes on values of non-negative integers 0,1,2, … Examples include the Poisson distribution, the binomial distribution and the negative binomial distribution (see here for a discussion). These distributions are potential models for the number of occurrences for some random events of interest, e.g. the number of losses in actuarial applications. The discussion below shows that the notion of (a,b,0) class is another way to describe the big three counting distributions of Poisson, binomial and negative binomial. The notion of (a,b,1) class is a generalization of the (a,b,0) class and is defined in a subsequent post.

The (a,b,0) Class

The (a,b,0) class is at heart a recursive algorithm to generate probabilities. Let’s fix some notations. Let $N$ be a counting random variable. For each $k=0,1,2,3,\cdots$, let $P_k=P(N=k)$. The counting random variable $N$ is said to be a member of the (a,b,0) class of distributions if for some constants $a$ and $b$ the following recursive relation holds

$\displaystyle (1) \ \ \ \ \ \frac{P_k}{P_{k-1}}=a + \frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

Note that the recursive relation (1) generates all the probabilities $P_k$ for all integers $k$ starting at 1. The relation (1) does not account for $P_0$. Does that mean that the initial probability $P_0$ can be any arbitrary probability value? Note that the recursive relation (1) means that each $P_k$ is ultimately expressed in terms of $P_0$.

$P_0=P_0$

$\displaystyle P_1=(a+b) P_0$

$\displaystyle P_2=\biggl(a+\frac{b}{2} \biggr) (a+b) P_0$

$\cdots$

$\displaystyle P_k=\biggl(a+\frac{b}{k} \biggr) \biggl(a+\frac{b}{k-1} \biggr) \cdots \biggl(a+\frac{b}{2} \biggr) (a+b) P_0$

$\cdots$

When $a$ and $b$ are fixed, the value of $P_0$ is also fixed since the probabilities must sum to 1. In fact $P_0$ is the following value.

$\displaystyle (2) \ \ \ \ \ P_0 =W^{-1}$

where $\displaystyle W=\biggl[ 1+(a+b)+\biggl(a+\frac{b}{2} \biggr)(a+b)+\cdots+ \biggl \{ \biggl(a+\frac{b}{k} \biggr) \cdots \biggl(a+\frac{b}{2} \biggr)(a+b) \biggr \} +\cdots \biggr]$

Thus a member of the (a,b,0) class has two parameters, namely $a$ and $b$, which completely determine the distribution.

Example 1
As an example, let $a=0$ and $b=\lambda$ where $\lambda>0$ is a fixed positive constant. Using (1), we see that

$\displaystyle P_1=\lambda \ P_0$
$\displaystyle P_2=\frac{1}{2!} \ \lambda^2 P_0$
$\displaystyle P_3=\frac{1}{3!} \ \lambda^3 P_0$
……..$\cdots$
……..$\cdots$
……..$\cdots$
$\displaystyle P_n=\frac{1}{n!} \lambda^n P_0$
……..$\cdots$
……..$\cdots$
……..$\cdots$

According to (2), $P_0=e^{-\lambda}$

\displaystyle \begin{aligned} P_0&=\biggl(1+ \lambda+\frac{1}{2!} \ \lambda^2 +\cdots+\frac{1}{n!} \lambda^n +\cdots \biggr)^{-1} \\&=(e^{\lambda})^{-1}\\&=e^{-\lambda} \end{aligned}

With $P_0=e^{-\lambda}$, the probabilities $P_n$ are from a Poisson distribution. Thus, when the parameter $a$ is 0, and the parameter $b$ is a positive constant, the corresponding distribution from the (a,b,0) class is a Poisson distribution.

Only Three Members in the (a,b,0) Class

In essence, the (a,b,0) class has only three members, namely the big 3 discrete distributions – the Poisson distribution, the binomial distribution and the negative binomial distribution, with each distribution represented by a different sign of the parameter $a$. Using the recursive relation (1), it can be shown that each of the big three distributions belongs to the (a,b,0) class. The following table shows the parameters $a$ and $b$ in the three cases.

Table 1

Distribution Usual Parameters Parameter a Parameter b
Poisson $\lambda$ 0 $\lambda$
Binomial $n$ and $p$ $\displaystyle -\frac{p}{1-p}$ $\displaystyle (n+1) \ \frac{p}{1-p}$
Negative binomial $r$ and $p$ $1-p$ $(r-1) \ (1-p)$
Negative binomial $r$ and $\theta$ $\displaystyle \frac{\theta}{1+\theta}$ $\displaystyle (r-1) \ \frac{\theta}{1+\theta}$
Geometric $p$ $1-p$ 0
Geometric $\theta$ $\displaystyle \frac{\theta}{1+\theta}$ 0

Table 1 shows how to parametrize the three distributions. For example, for the binomial distribution with parameters $n$ (the number of trials) and $p$ (the probability of success), the (a,b,0) parameters are $a=-p/(1-p)$ and $b=-(n+1) a$. The two rows for negative binomial reflect two different parametrizations. Of course, the geometric distribution is simply a negative binomial distribution when the parameter $r=1$. Essentially Table 1 consists of three different distributions.

Table 1 works in the opposite direction as well. Any set of (a,b,0) parameters $a$ and $b$ must fit into one of the distributions listed in Table 1. In other words, the recursive relation (1) produces no new counting distribution. Any counting distribution satisfying (1) must be one of the big 3 counting distributions listed in Table 1.

Note that under the recursive relation (1), not all combinations of $a$ and $b$ will make a probability distribution. For example, when both $a$ and $b$ are negative constants, the resulting probabilities $P_k$ are negative for odd $k$. When $a+b<0$, the resulting probabilities $P_k$ cannot be reliably positive in all instances. When $a+b=0$, $P_0=1$, i.e. the distribution is a point mass at 0. So we would like to restrict the attention on the case where $a+b>0$.

To echo the point made previously, it is the case that when $a+b>0$ and when the recursive relation (1) produces a viable probability distribution, the resulting distribution must be one of the three distributions listed in Table 1. This point is not entirely obvious. Any interested reader can see chapter 6 of [1].

Table 1 indicates that the sign of the parameter $a$ determines the form of the (a,b,0) distribution. If $a=0$, it is a Poisson distribution. If $a$ is negative, it is a binomial distribution. If $a$ is positive, it is a negative binomial distribution.

Examples

We now present a few more examples illustrating the working of the (a,b,0) recursive relation.

Example 2
This example illustrates that knowing three consecutive probabilities of a member of the (a,b,0) class determines the entire distribution. For example, suppose we know that

$P_1=0.0567$
$P_2=0.07938$
$P_3=0.09261$

These three consecutive probabilities produce the following two linear equations of $a$ and $b$.

$\displaystyle \frac{P_2}{P_1}=\frac{0.07938}{0.0567}=a+\frac{b}{2}$
$\displaystyle \frac{P_3}{P_2}=\frac{0.09261}{0.07938}=a+\frac{b}{3}$

Solving these two linear equations produces $a=0.7$ and $b=1.4$. Since $a$ is positive, this is a negative binomial distribution. The corresponding negative binomial parameters are $r=3$ and $p=0.3$. With this information, the (a,b,0) distribution in question is completely determined. The following are the several distributional quantities.

$P_0=0.3^3=0.027$
$P_4=(0.7+\frac{1.4}{4}) \ P_3=0.0972405$
$\displaystyle E(N)=r \frac{1-p}{p}=3 \frac{0.7}{0.3}=7$
$\displaystyle Var(N)=r \frac{1-p}{p^2}=3 \frac{0.7}{0.3^2}=\frac{7}{0.3}=23.3333$

Example 3
Actually any three given probabilities determine the entire (a,b,0) distribution. They do not have to be consecutive. Suppose we are given the following probabilities.

$P_1=0.33554432$
$P_2=0.29360128$
$P_4=0.0458752$

Applying the recursive relation (1) produces the following equations.

$\displaystyle \frac{P_2}{P_1}=\frac{0.29360128}{0.33554432}=a+\frac{b}{2}$
$\displaystyle \frac{P_3}{P_2}=\frac{P_3}{0.29360128}=a+\frac{b}{3}$
$\displaystyle \frac{P_4}{P_3}=\frac{0.0458752}{P_3}=a+\frac{b}{4}$

The above 3 equations lead to the following two equations.

$\displaystyle \frac{P_2}{P_1}=\frac{0.29360128}{0.33554432}=a+\frac{b}{2}$
$\displaystyle P_4=0.0458752=\biggl(a+\frac{b}{4} \biggr) \biggl(a+\frac{b}{3} \biggr) \ 0.29360128$

Of the above two equations, one is a linear equation and one is a quadratic equation. Solving these two equations produces $a=-0.25$ and $b=2.25$. Since $a$ is negative, this is a binomial distribution. Using the translation in Table 1 gives the following equations.

$\displaystyle -\frac{p}{1-p}=-0.25 \ \ \ \ \ \ \ \ \ \ \ (n+1) \ \frac{p}{1-p}=2.25$

Solving these equations gives $n=8$ and $p=0.2$. The (a,b,0) distribution in question is then completely determined.

Factorial Moments

Another distributional quantity that can give insight into the (a,b,0) class is the factorial moment. For any random variable $X$, its $n$th factorial moment is

$(3) \ \ \ \ \ \mu_{(n)}=E[X (X-1) (X-2) \cdots (X-(n-1))]$

For example, the first three factorial moments are:

$\mu_{(1)}=E[X]$

$\mu_{(2)}=E[X (X-1)]$

$\mu_{(3)}=E[X (X-1) (X-2)]$

For any member of the (a,b,0) class with parameters $a$ and $b$, the first factorial moment is:

$\displaystyle (4) \ \ \ \ \ \mu_{(1)}=\frac{a+b}{1-a}$

The higher (a,b,0) factorial moments can be obtained recursively as follows:

$\displaystyle (5) \ \ \ \ \ \mu_{(j)}=\frac{a j +b}{1-a} \ \mu_{(j-1)} \ \ \ \ \ \ \ j \ge 2$

The recursive formula (5) is a good way to determine the raw moments of the member of the (a,b,0) class. For example, the following calculate the second raw moment and the variance of the random variable $N$, assumed to be a member of the (a,b,0) class with parameters $a$ and $b$.

$\displaystyle \mu_{(1)}=E[N]=\frac{a+b}{1-a}$

$\displaystyle \mu_{(2)}=\frac{2 a +b}{1-a} \ \frac{a+b}{1-a}=\frac{(2a+b) (a+b)}{(1-a)^2}=E[N (N-1)]=E[N^2]-E[N]$

$\displaystyle E[N^2]=\frac{(2a+b) (a+b)}{(1-a)^2}+\frac{a+b}{1-a}=\frac{(a+b) (a+b+1)}{(1-a)^2}$

$\displaystyle Var(N)=E[N^2]-E[N]^2=\frac{a+b}{(1-a)^2}$

One interesting characteristic of the (a,b,0) class is that knowing limited distributional information determines the distribution. Example 2 and Example 3 show that knowing three point masses completely determines the (a,b,0) distribution. The above derivation shows that knowing the mean and the variance also completely determines the (a,b,0) distribution.

Fitting (a,b,0) Distributions

If the (a,b,0) recursive formula in (1) generates no new distributions, why study (a,b,0) class and why not just focus on Poisson, binomial and negative binomial distribution individually? One reason for studying the recursive (a,b,0) formula is that it gives a graphical way to choose an appropriate member of the (a,b,0) class. To see this, rewrite (1) as follows:

$\displaystyle (6) \ \ \ \ \ k \ \frac{P_k}{P_{k-1}}=a k+ b \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

Note that the quantity on the right side of (6) is a linear function of the integers $k$. If we plot the left hand side quantity of (6) with $k$ on the x-axis, the plot should be a linear one with the slope being the parameter $a$ and the y-intercept being the parameter $b$ (of course assuming it is an (a,b,0) distribution).

The relation (6) is a way to quickly determine whether a given sample is taken from a member of the (a,b,0) class. To do this, calculate the ratio of two consecutive data categories times $k$. In other words, compute ratio such as the following for values of $k$:

$\displaystyle (7) \ \ \ \ \ k \ \frac{\hat{P}_k}{\hat{P}_{k-1}}=k \ \frac{n_k}{n_{k-1}} \ \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

where $n_k$ is the observed frequency for the category $k$. The ratio of $n_k$ to $n_{k-1}$ multiplied by $k$ is a stand-in for the left hand side of (6). Then plot these values against $k$. A linear trend that is observed in the graph is evidence that the data in the sample is taken from an (a,b,0) distribution.

The slope of the plotted line gives an indication of which (a,b,0) member to use. If the plot is approximately horizontal, then the Poisson model is appropriate. If the plot is a line with negative slope, then the binomial model is more appropriate. If the plot is approximately a line with positive slope, use the negative binomial model. For this approach to work properly, large observed data set is preferred.

The (a,b,1) Class

It is possible that the (a,b,0) distributions do not adequately describe a random counting phenomenon being observed. For example, the sample data may indicate that the probability at zero may be larger than is indicated by the distributions in the (a,b,0) class. One alternative is to assign a larger value for $P_0$ and recursively generate the subsequent probabilities $P_k$ for $k=2,3,\cdots$. The class of the distributions defined by this recursive scheme is called the (a,b,1) class, which is discussed in the next post.

Practice Problems

Practice problems on (a,b,0) class

Practice problems on (a,b,1) class

Reference

1. Panjer H. H., Wilmot G. E., Insurance Risk Models, Society of Actuaries, Chicago, 1992.

Dan Ma actuarial topics
Dan Ma actuarial
Dan Ma math

Daniel Ma actuarial
Daniel Ma mathematics
Daniel Ma actuarial topics

$\copyright$ 2018 – Dan Ma

Revised December 5, 2019

# The big 3 claim frequency models

We now turn the attention to discrete distributions. In particular the focus is on counting distributions. These are the discrete distributions that have positive probabilities only on the non-negative integers 0, 1, 2, 3, … One important application is on finding suitable counting distributions for modeling the number of losses or the number claims to an insurer or more generally the number of other random events that are of interest in actuarial applications. From a claim perspective, these counting distributions would be models for claim frequency. Combining frequency models with models for claim severity would provide a more complete picture of the exposure of risks to the insurer than using claim severity alone. This post and several subsequent posts are preparation for the discussion on modeling aggregate losses and claims. Another upcoming topic would be the effect of insurance coverage modifications (e.g. deductibles) on the claim frequency and claim severity.

This post focuses on the three commonly used counting distributions – Poisson, binomial and negative binomial (the big 3). These three distributions are the basis for defining a large class of other counting distributions.

Probability Generating Function

Let $Y$ be a random variable with positive probabilities only on the non-negative integers, i.e. $P(Y=k)$ is positive only for $k=0,1,2,\cdots$. The function $P(Y=k)$ is the probability of the occurrence of the event $Y=k$, i.e. the observed value of the random variable $Y$ is $k$. It is called the probability function of the random variable $Y$ (also called probability mass function). From the probability function, many other distributional quantities can be derived, e.g. mean, variance and higher moments.

We can also elicit information about $Y$ from its generating function. The generating function (or probability generating function) of $Y$ is defined by:

$\displaystyle P_Y(z)=p_0+p_1 z +p_2 z^2 + \cdots=\sum \limits_{j=0}^\infty p_j z^j$

where each $p_j=P(Y=j)$. The generating function $P_Y(z)$ is defined wherever the infinite sum converges. At minimum, $P_Y(z)$ converges for $\lvert z \lvert \le 1$. Some $P_Y(z)$ converges for all real $z$, e.g. when $Y$ has a Poisson distribution (see below).

One reason for paying attention to generating function is that the moments of $Y$ can be generated from $P_Y(z)$. The $n$th moment of $Y$ is derived from the result of taking the $n$th derivative of $P_Y(z)$ and evaluating at $z=1$.

$\displaystyle E[Y (Y-1) (Y-2) \cdots (Y-(n-1)]=P_Y^{(n)}(1)$

The above expectation is said to be a factorial moment. It follows that $E(Y)=P_{Y}^{(1)}(1)$. Since $E[Y (Y-1)]=P_Y^{(2)}(1)$, the second moment is $E(Y^2)=P_{Y}^{(1)}(1)+P_Y^{(2)}(1)$. In general, the $n$th moment $E(Y^n)$ can be expressed in terms of $P_Y^{(k)}(1)$ for all $k \ge n$.

Another application of generating function is that $P_Y(z)$ encodes the probability function $P(Y=k)$, which is obtained by taking the derivatives of $P_Y(z)$ and then evaluated at $z=0$.

$\displaystyle P(Y=n)=\frac{P_{Y}^{(n)}(0)}{n!}$

where $n=0,1,2,3,\cdots$. Another useful property about generating function is that the probability distribution of a random variable is uniquely determined by its generating function. This fundamental property is useful in determining the distribution of an independent sum. The generating function of an independent sum of random variables is simply the product of the individual generating functions. If the product is the generating function of a certain distribution, then the independent sum must be of the same distribution.

For a more detailed discussion on probability generating function, see this blog post in a companion blog.

Poisson Distribution

We now describe the three counting distributions indicated at the beginning of the post. We start with the Poisson distribution. Consider a random variable $X$ that only takes on the non-negative integers. For each $k=0,1,2,\cdots$, let $p_k=P(X=k)$.

The random variable $X$ has a Poisson distribution if its probability function is:

$\displaystyle p_k=\frac{e^{-\lambda} \lambda^k}{k!} \ \ \ \ \ \ k=0,1,2,\cdots$

for some positive constant $\lambda$. This constant $\lambda$ is the parameter of the Poisson distribution in question. It is also the mean and variance of the Poisson distribution. The following is the probability generating function of the Poisson distribution.

$\displaystyle P_X(z)=e^{\lambda \ (z-1)}$

The Poisson generating function is defined for all real numbers $z$. The mean and variance and higher moments can be computed using the generating function.

$E(X)=P_X^{(1)}(1)=\lambda$

$E[X (X-1)]=P_X^{(2)}(1)=\lambda^2$

$E[X^2]=P_X^{(1)}(1)+P_X^{(2)}(1)=\lambda+\lambda^2$

$Var(X)=E(X^2)-E(X)^2=\lambda$

One interesting characteristic of the Poisson distribution is that its mean is the same the variance. From a mathematical standpoint, the Poisson distribution arises from the Poisson process (see a more detailed discussion here). Another discussion of the Poisson distribution is found here.

One useful characteristic of the Poisson distribution is that combining independent Poisson distributions results in another Poisson distribution. Suppose that $X_1,X_2,\cdots,X_n$ are independent Poisson random variables with means $\lambda_1,\lambda_2,\cdots,\lambda_n$, respectively. Then the probability generating function of the sum $X=X_1+X_2+\cdots +X_n$ is simply the product of the individual probability generating functions.

$\displaystyle P_X(z)=\prod \limits_{j=1}^n e^{\lambda_j (z-1)}=e^{\lambda \ (z-1)}$

where $\lambda=\lambda_1+\lambda_2+\cdots +\lambda_n$. The probability generating function of the sum $X$ is the generating function of a Poisson distribution. Thus independent sum of Poisson distributions is a Poisson distribution with parameter being the sum of the individual Poisson parameters.

Another useful property is that of splitting a Poisson distribution. For example, suppose that the number of claims $N$ in a given year follows a Poisson distribution with mean $\lambda$ per year. Also suppose that the claims can be classified into $W$ distinct types such that the probability of a claim being of type $i$ is $p_i$, $i=1,2,\cdots, W$ and such that $p_1+\cdots+p_W=1$. If we are interested in studying the number $N_i$ of claims in a year that are of type $i$, $i=1,2,\cdots,W$, then $N_1,N_2,\cdots,N_W$ are independent Poisson random variables with means $\lambda_1 p_1, \lambda_2 p_2,\cdots,\lambda_W p_W$, respectively. For a mathematical discussion of this Poisson splitting phenomenon, see this blog post in a companion blog.

Binomial Distribution

Consider a series of independent events each of which results in one of two distinct outcomes (one is called Success and the other Failure) in such a way that the probability of observing a Success in a trial is constant across all trials (these are called Bernoulli trials). For a binomial distribution, we are only interested in observing $n$ such trials and count the number of successes in these $n$ trials.

More specifically, let $p$ be the probability of observing a Success in a Bernoulli trial. Let $X$ be the number of Successes observed in $n$ independent trials. Then the random variable $X$ is said to have a binomial distribution with parameters $n$ and $p$.

Note that the random variable $X$ is the independent sum of $X_1,X_2,\cdots,X_n$ where $X_i$ is the number of Success in the $i$th Bernoulli trial. Thus $X_i$ is 1 with probability $p$ and is 0 with probability $1-p$. Its probability generating function would be:

$\displaystyle g(z)=(1-p) z^0 +p z^1=1-p+p z$

As a result, the probability generating function for $X$ would be $g(z)$ raised to $n$.

$\displaystyle P_X(z)=(1-p+p z)^n$

The generating function $P_X(z)$ is defined for all real values $z$. Differentiating $P_X(z)$ twice produces the mean and variance.

$E(X)=n p$

$E[X (X-1)]=n (n-1) p^2$

$Var(X)=n p (1-p)$

By differentiating $P_X(z)$ and evaluating at $z=0$, we obtain the probability function.

$\displaystyle P(X=k)=\frac{P_X^{(k)}(0)}{k!}=\binom{n}{k} p^k (1-p)^{n-k}$

where $k=0,1,2,\cdots,n$.

By taking the product of probability generating functions, it follows that the independent sum $Y=Y_1+Y_2+\cdots+Y_m$, where each $Y_i$ has a binomial distribution with parameters $n_i$ and $p$, has a binomial distribution with parameters $n=n_1+\cdots+n_m$ and $p$. In other words, as long as the probability of success $p$ is identical in the binomial distributions, the independent sum is always a binomial distribution.

Note that the variance of the binomial distribution is less than the mean. Thus the binomial distribution is suitable candidate for modeling frequency for the situations where the sample variance is smaller than the sample mean.

Negative Binomial Distribution

As mentioned above, the Poisson distribution requires that the mean and the variance are equaled. The binomial binomial distribution requires that the variance is smaller than the mean. Thus these two counting distributions are not appropriate in all cases. The negative binomial distribution is an excellent alternative to the Poisson distribution and the binomial distribution, especially in the cases where the observed variance is greater than the observed mean.

The negative binomial naturally arises from the same probability experiment that generates the binomial distribution. Consider a series of independent Bernoulli trials each of which results in one of two distinct outcomes (called success and failure) in such a way that the probability of success $p$ is constant across the trials. Instead of observing the outcomes in a fixed number of trials, we now observe the trials until $r$ number of success have occurred.

As we observe the Bernoulli trials, let’s $Y$ be the number of failures until the $r$th success has occurred. The random variable $Y$ has a negative binomial distribution with parameters $r$ and $p$. The parameter $r$ is necessarily a positive integer and the parameter $p$ is a real number between 0 and 1. The following is the probability function for the random variable $Y$.

\displaystyle \begin{aligned} P(Y=k)&=\binom{r+k-1}{k} \ p^r (1-p)^k \\&=\frac{(r+k-1)!}{k! \ (r-1)!} \ p^r (1-p)^k \ \ \ \ \ \ k=0,1,2,\cdots \\& \end{aligned}

In the above probability function, the parameter $r$ must be a positive integer. The binomial coefficient $\binom{r+k-1}{k}$ is computed by its usual definition. The above probability probability function can be relaxed so that $r$ can be any positive real number. The key to the relaxation is a reformulation of the binomial coefficient.

$\displaystyle \binom{n}{j}=\left\{ \begin{array}{ll} \displaystyle \frac{n (n-1) (n-2) \cdots (n-(j-1))}{j!} &\ n>j-1, j=1,2,3,\cdots \\ \text{ } & \text{ } \\ \displaystyle 1 &\ j=0 \end{array} \right.$

Note that in the above formulation, the $n$ in $\binom{n}{j}$ does not have to be an integer. If $n$ were to be a positive integer, the usual definition $\binom{n}{j}=\frac{n!}{j! (n-j)!}$ would lead to the same calculation. The reformulation is a generalization of the usual binomial coefficient definition.

With the new definition of binomial coefficient, the following is the probability function of the negative binomial distribution in the general case.

$\displaystyle P(Y=k)=\binom{r+k-1}{k} \ p^r (1-p)^k \ \ \ \ \ \ k=0,1,2,\cdots$

The following is the same probability function with the binomial coefficient explicitly written out.

$\displaystyle P(Y=k)=\left\{ \begin{array}{ll} \displaystyle p^r &\ k=0 \\ \text{ } & \text{ } \\ \displaystyle \frac{(k-1+r) \cdots (1+r) r}{k!} \ p^r \ (1-p)^k &\ k=1,2,\cdots \end{array} \right.$

For either of the above versions, the mean and variance are:

$\displaystyle E(Y)=\frac{r (1-p)}{p}$

$\displaystyle Var(Y)=\frac{r (1-p)}{p^2}$

Another formulation of the negative binomial distribution is that it is a Poisson-gamma mixture. The following is the probability function.

$\displaystyle P(Y=k)=\left\{ \begin{array}{ll} \displaystyle \biggl(\frac{1}{\beta+1} \biggr)^r &\ k=0 \\ \text{ } & \text{ } \\ \displaystyle \frac{(k-1+r) \cdots (1+r) r}{k!} \ \biggl(\frac{1}{\beta+1} \biggr)^r \biggl(\frac{\beta}{\beta+1} \biggr)^k &\ k=1,2,\cdots \end{array} \right.$

It is still a 2-parameter discrete distribution. The parameters $r$ and $\beta$ originate from the parameters of the gamma distribution in the Poisson-gamma mixture. The mean and variance are:

$\displaystyle E(Y) = r \ \beta$

$\displaystyle Var(Y)=r \ \beta \ (1+\beta)$

The negative binomial distribution has been discussed at length in blog posts in several companion blogs. For the natural interpretation of negative binomial distribution based on counting the number of failures until the $r$th success, see this blog post. This is an excellent introduction.

For the general version of the negative binomial distribution where the parameter $r$ can be any positive real number, see this blog post.

For the version of negative binomial distribution from a Poisson-Gamma mixture point of view, see this blog post.

This blog post has additional facts about the negative binomial distribution. This blog post summarizes the various versions as well as focusing on the calculation of probabilities.

More Counting Distributions

The three counting distributions – Poisson, binomial and negative binomial – provide a versatile tool kit in modeling the number of random events such as losses to the insured or claims to the insurer. The tool kit can be greatly expanded by modifying these three distributions to generate additional distributions. The new distributions belong to the (a,b,0) and (a,b,1) classes. This topic is discussed in the subsequent posts.

Dan Ma actuarial topics
Dan Ma actuarial
Dan Ma math

Daniel Ma actuarial
Daniel Ma mathematics
Daniel Ma actuarial topics

$\copyright$ 2018 – Dan Ma

# An introduction to risk measures

This post is an introduction to measures of risk. Examples of risk measures are considered. The post also focuses on coherent risk measures, which possess several desirable properties for risk measures.

In actuarial applications, an important focus is on developing loss distributions for insurance products. When the loss distribution is a probability based model, we can also use the model to derive a description of risk. Often times, a measure of risk is given by one number that describes the level of exposure to risk. One example of a measure of risk is value-at-risk (VaR), discussed here, which can quantify the likelihood of an adverse outcome. VaR can then be used to determine the amount of capital required to withstand such adverse outcomes. Actuaries and risk managers along with investors, regulators and rating agencies are particularly interested in using VaR and other measures of risk to quantify the ability of an enterprise to withstand various potential adverse events.

Two important examples of measures of risk – value-at-risk and tail-value-at-risk – have been discussed in this previous post.

Examples of Risk Measures

A measure of risk (or a risk measure) is a mapping that maps a given loss distribution to the real numbers. In a general discussion, we use the Greek letter $\rho$ to denote the mapping. For example, if $X$ is the random variable that describes the losses, then $\rho(X)$ is a number that is intended to quantify the risk exposure. We first look at some examples.

Some of the early risk measures in actuarial science were based on the so called premium principles. The purpose is to develop an appropriate premium to charge for a given risk. The following gives several risk measures based on premium principles.

$\rho(X)=E[X]$ ………………………………………………….(Equivalence Principle)

$\rho(X)=(1+k) \ E[X]$ ……………………………………….(Expected Value Principle)

$\rho(X)=E[X]+ k \ Var(X)$…………………………………(Variance Principle)

$\rho(X)=\mu+k \ \sqrt{Var(X)}$…………………………………..(Standard Deviation Principle)

Each case is a rule for assigning a premium to an insurance risk. In each of the last three cases, $k$ is a fixed constant where $k \ge 0$. The fixed constant $k$ reflects an adjustment that may include expenses and profit and/or a risk loading. The equivalence principle does not account for any risk loading; it simply charges the expected losses, i.e. the pure premium.

Example 2 (Standard Deviation Principle)
This example highlights the standard deviation premium principle:

$\rho(X)=\mu+k \ \sigma$

where $\mu=E[X]$ and $\sigma^2$ is the variance of $X$.

To shed light on the standard deviation principle, suppose that the loss $X$ has a normal distribution. The $\rho(X)$ resembles a quantile in the normal distribution where $k$ is a z-score. With $k=1.645$, $P(X>\mu+1.645 \ \sigma)=0.05$. This means that $\mu+1.645 \ \sigma$ is a threshold that indicates the level of adverse outcome such that the probability of exceeding this threshold is 5%. With $k=2.326$, $P(X>\mu+2.326 \ \sigma)=0.01$. The higher threshold $\mu+2.326 \ \sigma$ is the level of adverse outcome so that there is a 1% chance of exceeding it.

If the loss distribution $X$ is not a normal distribution, the coefficient $k$ would not be the same as the ones given for normal distribution. Whatever the loss distribution is, the fixed constant $k$ is usually chosen such that the probability of losses exceeding the risk measure equals to some pre-determined small probability level.

Example 3 (Value-at-Risk)
When the loss $X$ does not have a normal distribution, the same constants $k$ in Example 2 cannot be used. Instead, we can start with a small probability level $p$ (5%, 1%, 0.5% etc) and determine the $100p$th percentile. The corresponding $k$ can then be determined. This is the basic idea behind the risk measure of value-at-risk.

The risk measure of value-at-risk (VaR) is mathematically a percentile of the loss distribution. Given that $X$ represents the losses, $\text{VaR}_p(X)$ is the $100p$th percentile of $X$. The value-at-risk (VaR) discussed here is for gauging the exposure of risk with respect to insurance losses such that the probability of exceeding the threshold is $p$, the pre-determined security level. See this previous post for a more detailed discussion.

Example 4 (Tail-Value-at-Risk)
We now briefly discuss Tail-value-at-risk (TVaR). Suppose that $X$ is the random variable that models losses. As before we assume that $X$ takes on positive real numbers. The tail-value-at-risk at the $100p$% security level, denoted by $\text{TVaR}_p(X)$, is the expected loss on the condition that loss exceeds the 100pth percentile of $X$. The following is a more succinct way of describing it.

$\text{TVaR}_p(X)=E(X \lvert X > \pi_p)$

where $\pi_p=\text{VaR}_p(X)$. Tail-value-at-risk is a risk measure that is in many ways superior than VaR. The risk measure VaR is a merely a cutoff point and does not describe the tail behavior beyond the VaR threshold. TVaR reflects the shape of the tail beyond VaR threshold. See this previous post for a more detailed discussion.

Desirable Properties of Risk Measures

One important consideration when choosing risk measures concerns the combining of risk entities in a given company or enterprise. For example, an insurance company may have different divisions specializing in different insurance products and services. The results of the risk measure applied individually to the different divisions should be consistent to the results of the risk measure applied to the entire company. For example, it is desirable for the risk measure for two risks combined be no greater than the sum of the two results applied to two risks individually. The following is a list of four desirable properties for a risk measure. Any risk measure that satisfies these four properties is said to be a coherent risk measure.

Coherent Risk Measure
Let $X$ and $Y$ be two loss random variables. The risk measure $\rho$ is a coherent risk measure if it satisfies the following four properties.

1. $\rho(X+Y) \le \rho(X)+\rho(Y)$ ……………………………………….(Subadditivity)
2. If $X \le Y$ with probability 1, then $\rho(X) \le \rho(Y)$ ………………..(Monotonicity)
3. For any constant $c >0$, $\rho(c X)=c \ \rho(X)$ ………………………(Positive Homogeneity)
4. For any constant $c >0$, $\rho(X+c)=\rho(X)+c$ ……………….(Translation Invariance)

The first one, subadditivity, requires that for any two random loss variables $X$ and $Y$, the risk measure for $X+Y$ be no greater than the risk measures for $X$ and $Y$ separately. This is a sensible requirement for a risk measure. This reflects the notion that combining risks leads to diversification and thus to a reduction of total overall risk. Otherwise, a large enterprise would simply be broken into smaller entities in order to reduce risk.

The property of monotonicity is another sensible requirement. If the random loss $X$ is always less than the random loss $Y$, then it makes sense for the risk measure of $X$ be no greater than the risk measure of $Y$. For example, if the risk measure represents the surplus or reserve that is required to cover a random loss, then we would want the reserve for the lesser loss to be no greater than the reserve for a larger potential loss.

The property of positive homogeneity specifies that the risk measure of a constant multiple of the random loss should be the constant multiple of the risk measure. For example, expressing the loss random variable in another currency should indicate the same level of reserve or surplus.

Translation invariance says that the risk measure of combining a random loss and a fixed loss should be the risk measure of the random loss plus the fixed loss. The reserve to cover a fixed loss should be just the fixed loss.

A Closer Look at Examples

Now that we have a list of desirable properties a risk measure should have, which of the risk measures discussed above possess these properties? We comment on these risk measures one by one.

Example 5 (Equivalence Principle)
The risk measure based on the equivalence principle is a coherent risk measure since it satisfies all four properties. The properties of expected value derive the four properties of coherent measure.

• $\rho(X+Y)=E(X+Y)=E(X)+E(Y)=\rho(X)+\rho(Y)$
• If $X \le Y$ always, then $\rho(X)=E(X) \le E(Y)=\rho(Y)$.
• $\rho(cX)=E(c X)=c \ E(X)=c \ \rho(X)$
• $\rho(X+c)=E(X+c)=E(X)+c=\rho(X)+c$

Example 6 (Expected Value Principle)
For the four measures based on the premium principles, equivalence principle is the only one that is coherent. The risk measure based on the expected value principle fails translation invariance. Here’s the derivation.

• $\rho(X+Y)=(1+k) \ E(X+Y)=(1+k) \ E(X)+(1+k) \ E(Y)=\rho(X)+\rho(Y)$
• If $X \le Y$ always, then $\rho(X)=(1+k) \ E(X) \le (1+k) \ E(Y)=\rho(Y)$.
• $\rho(cX)=(1+k) \ E(c X)=c \ (1+k) \ E(X)=c \ \rho(X)$
• $\rho(X+c)=(1+k) \ E(X+c)=(1+k) \ E(X)+(1+k) \ c \ne \rho(X)+c$

Example 7 (Variance Principle)
Now, consider the risk measure based on the variance principle. Recall the risk measure based on the variance principle is of the form $\rho(L)=E(L)+k Var(L)$ where $L$ represents the random losses and $k$ is some positive constant. First we show that it does not satisfy the subadditivity property.

\displaystyle \begin{aligned} \rho(X+Y)&=E(X+Y)+k Var(X+Y) \\&=E(X)+E(Y)+k Var(X)+k Var(Y)+2 k Cov(X,Y) \\&=\rho(X)+\rho(Y)+2 k Cov(X,Y) \end{aligned}

When $Cov(X,Y)>0$, i.e. the risk $X$ and the risk $Y$ are positively correlated, we have $\rho(X+Y)>\rho(X)+\rho(Y)$. There is no benefit of risk reduction in combing risks that are positively correlated.

Next we demonstrate why the variance principle fails the monotonicity. Recall that $\rho$ satisfies the property of monotonicity if $\rho(X) \le \rho(Y)$ for any risks $X$ and $Y$ such that $X \le Y$ for all scenarios. For the risk measure $\rho(L)=E(L)+k Var(L)$, we show that for each $k>0$, we can find random variables $X$ and $Y$ such that $X \le Y$ always and $\rho(X)>\rho(Y)$. To this end, we define random variables $X_n$ and $Y$ for each positive integer $n$.

Let $n$ be any positive integer. Let $X_n$ be a random variable such that $P(X_n=-n)=0.1$ and $P(X_n=90)=0.9$. Let $Y$ be a constant value of 100, i.e. $P(Y=100)=1$. Then it is straightforward to verify that

$E(X_n)=81-0.1n$

$E(X_n^2)=7290+0.1n^2$

$Var(X_n)=7290+0.1n^2-(81-0.1n)^2=0.09 n^2+16.2 n+729$

$E(Y)=100$

$Var(Y)=0$

For $\rho(X_n)=81-0.1 n^2+k (0.09 n^2+16.2 n+729)>\rho(Y)=100$, the constant $k$ must satisfy the following.

$\displaystyle k>\frac{19+0.1 n}{0.09 n^2+16.2 n+729} \longrightarrow 0$ as $n \longrightarrow \infty$

Note that the quantity on the right approaches zero as $n$ becomes large. For each $k>0$ (however small), we can always choose an integer $n$ large enough such that the above inequality is satisfied. Then for the pair $X_n$ and $Y$, $X_n \le Y$ always and $\rho(X_n) >\rho(Y)$. Thus monotonicity fails for the variance principle.

The following derivation shows that positive homogeneity fails and translation invariance is satisfied.

$\rho(c \ X)=c \ E(x)+k \ c^2 \ Var(X) \ne c \ \rho(X)=c \ E(X)+k \ c \ Var(X)$

$\rho(X+c)=E(X+c)+k \ Var(X+c)=E(X)+c+k \ Var(X)=\rho(X)+c$

Example 8 (Standard Deviation Principle)
The risk measure according to the standard deviation principle is also not coherent even though it is an improvement to the variance principle. The standard deviation principle satisfies all properties except monotonicity.

To show subadditivity, note that the fact that $\sqrt{Var(X+Y)} \le \sqrt{Var(X)}+\sqrt{Var(Y)}$. This is derived by the following:

\displaystyle \begin{aligned} Var(X+Y)&=Var(X)+Var(Y)+2 Cov(X,Y) \\&=Var(X)+Var(Y)+2 Corr(X,Y) \sqrt{Var(X)} \sqrt{Var(Y)} \\&=Var(X)+Var(Y)+2 \sqrt{Var(X)} \sqrt{Var(Y)} \\&\le (\sqrt{Var(X)}+\sqrt{Var(Y)})^2 \end{aligned}

In the above derivation, $Corr(X,Y)$ refers to the correlation coefficient, which is always $\le 1$. The following shows the subadditivity.

\displaystyle \begin{aligned} \rho(X+Y)&=E(X+Y)+k \sqrt{Var(X+Y)} \\&\le E(X)+E(Y) +k (\sqrt{Var(X)}+\sqrt{Var(Y)}) \\&=\rho(X)+\rho(Y) \end{aligned}

The following derivation verifies positive homogeneity and translation invariance.

\displaystyle \begin{aligned} \rho(c \ X)&=E(c \ X)+k \sqrt{Var(c \ X)} \\&=c \ E(X)+ k \sqrt{c^ 2 \ Var(X)} \\&=c \ E(X)+ k \ c \ \sqrt{Var(X)} \\&=c \ \rho(X) \end{aligned}

\displaystyle \begin{aligned} \rho(X+c)&=E(X+c)+k \ \sqrt{Var(X+c)} \\&=E(X)+c+k \ \sqrt{Var(X)} \\&=\rho(X)+c \end{aligned}

We now show that the standard deviation principle does not satisfy monotonicity. The proof is similar to the one for variance principle in Example 7 but with a crucial adjustment. For the risk measure $\rho(L)=E(L)+k \sqrt{Var(L)}$, we show that for each $k>0$, we can find random variables $X$ and $Y$ such that $X \le Y$ always and $\rho(X)>\rho(Y)$. To this end, we define random variables $X_n$ and $Y$ for each positive integer $n$.

Let $n$ be any positive integer. Let $X_n$ be a random variable such that $P(X_n=- x_n)=10^{-n}$ and $P(X_n=90)=1-10^{-n}$ where $x_n$ is the positive number defined by $x_n=10^{n+1}-90$. Let $Y$ be a constant value of 100, i.e. $P(Y=100)=1$. Then it is straightforward to verify that

$E(X_n)=80$

$E(X_n^2)=10^{n+2}+6300$

$Var(X_n)=10^{n+2}+6300-80^2=10^{n+2}-100$

$E(Y)=100$

$Var(Y)=0$

For $\rho(X_n)=80+k \ \sqrt{10^{n+2}-100} > \rho(Y)=100$, the constant $k$ must satisfy the following inequality.

$\displaystyle k > \frac{20}{\sqrt{10^{n+2}-100}} \longrightarrow 0$ as $n \longrightarrow \infty$

For any $k >0$ (however small), we can always find an integer $n$ large enough so that the above inequality holds. Then the pair of random variables $X_n$ and $Y$ are such that $X_n \le Y$ always and $\rho(X_n)>\rho(Y)$. Thus monotonicity fails for the standard deviation principle.

Example 9 (Value-at-Risk)
The risk measure of VaR is not coherent despite its attractive quality. An example for the violation of subadditivity can be found in this article. Example 1 in this article gives two independent risks $X$ and $Y$ such that $\text{VaR}_{0.99}(X+Y)>\text{VaR}_{0.99}(X)+\text{VaR}_{0.99}(Y)$.

Example 10 (Tail-Value-at-Risk)
One advantage of TVaR is that it describes the tail behavior beyond the threshold of VaR. Another is that TVaR is a coherent risk measure. This fact has been verified in the article Artzner, P., Delbaen, F., Eber, J.-M., & Heath, D. (1997). Thinking Coherently. Risk, 10, 68-71.

Dan Ma actuarial topics
Dan Ma actuarial
Dan Ma math

Daniel Ma actuarial
Daniel Ma mathematics
Daniel Ma actuarial topics

$\copyright$ 2018 – Dan Ma

# Tail-value-at-risk of a mixture

The risk measures of value-at-risk and tail-value-at-risk are discussed in the preceding post. This post extends the preceding post with an algorithm on evaluating the tail-value-at-risk of a mixture distribution with discrete mixing weights.

The preceding post introduces the notions of value-at-risk (VaR) and tail-value-at-risk (TVaR). These are two particular examples of risk measures that are useful for insurance companies and other enterprises in a risk management context. For $0, VaR at the security level $p$ gives the threshold that the probability of a loss more adverse than the threshold is at most $1-p$. Thus in our context VaR is a percentile of the loss distribution. TVaR is a conditional expected value. At the security level $p$, TVaR is the expected value of the losses given that the losses exceed the threshold VaR.

The preceding post gives several representations of TVaR. It also gives the formula for TVaR for several distributions – exponential, Pareto, normal and lognormal. We now discuss TVaR of a mixture distribution. If a distribution is the mixture of two distributions and if each of the individual distributions has a clear formulation of TVaR, can we mix the two TVaR's? The answer is that we can provided some adjustments are made. The following gives the formula.

Suppose that the loss $X$ is a mixture of two distributions represented by the random variables $X_1$ and $X_2$, with weights $w$ and $1-w$, respectively. Let $\pi_p$ be the $100p$th percentile of the loss $X$, i.e. $\pi_p=\text{VaR}_p(X)$. Then the tail-value-at-risk at the $100p$ percent security level is:

\displaystyle \begin{aligned} \text{TVaR}_p(X)&=\pi_p+\frac{1}{1-p} \biggl[w \times P(X_1>\pi_p) \times e_{X_1}(\pi_p)\\& \ \ +(1-w) \times P(X_2>\pi_p) \times e_{X_2}(\pi_p)\biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (a) \end{aligned}

The comparison is with formula (3) in the preceding post. That formula shows that TVaR is $\pi_p+e(\pi_p)$. In other words, TVaR is VaR plus the mean excess loss function evaluated at $\pi_p$. The content within the squared brackets in Formula (a) is a weighted average of the two individual mean excess loss functions with the adjustment of multiplying with the probabilities $P(X_1>\pi_p)$ and $P(X_2>\pi_p)$. This formula is useful if the $\pi_p$ (VaR) can be calculated and if the mean excess loss functions are accessible.

We give an example and then show the derivation of Formula (a).

Example 1
The mean excess loss function of an exponential distribution is constant. Let’s consider the mixture of two exponential distributions. Suppose that losses follow a mixture of two exponential distributions where one distribution has mean 5 (75% weight) and the other has mean 10 (25% weight). Determine the VaR and TVaR at the security level 99%.

First, calculate the 99th percentile of the mixture, which is the solution to the following equation.

$\displaystyle 0.75 e^{-x/5}+0.25 e^{-x/10}=1-0.99=0.01$

By letting $y=e^{-x/10}$, we solve the following equation.

$\displaystyle 0.75 y^2+0.25 y-0.01=0$

Use the quadratic formula to solve for $y$. Then solve for $x$. The following is the 99th percentile of the loss $X$.

$\displaystyle \pi_p=-10 \times \text{ln} \biggl(\frac{-1+\sqrt{1.48}}{6} \biggr)=33.2168$

The following gives the TVaR.

\displaystyle \begin{aligned} \text{TVaR}_p(X)&=\pi_p+\frac{1}{1-0.99} \biggl[0.75 \times e^{-\pi_p/5} \times 5 +0.25 \times e^{-\pi_p/10} \times 10 \biggr] \\&=42.7283 \end{aligned}

Note that the mean excess loss for the first exponential distribution is 5 and for the second one is 10 (the unconditional means). The survival functions $P(X_1>\pi_p)$ and $P(X_2>\pi_p)$ are also easy to evaluate. As long as the percentile $\pi_p$ of the mixture is calculated, the formula is very useful. In this example, the two exponential parameters are set so that the calculation of percentiles uses the quadratic formula. If the parameters are set differently, then we can use software to evaluate the required percentile.

Deriving the formula

Suppose that $X$ is the mixture of $X_1$, with weight $w$, and $X_2$, with weight $1-w$. The density function for $X_1$ is $f_1(x)$ and the density function for $X_2$ is $f_2(x)$. The density function of $X$ is then $f(x)=w f_1(x)+(1-w) f_2(x)$. We derive from the basic definition of TVaR. Let $\pi_p$ be the $100p$th percentile of $X$.

\displaystyle \begin{aligned} \text{TVaR}_p(X)&=\frac{\int_{\pi_p}^\infty x f(x) \ dx}{1-p}\\&=\pi_p+\frac{\int_{\pi_p}^\infty (x-\pi_p) f(x) \ dx}{1-p} \\&=\pi_p+\frac{\int_{\pi_p}^\infty (x-\pi_p) (w f_1(x)+(1-w) f_2(x)) \ dx}{1-p} \\&=\pi_p+\frac{1}{1-p} \biggl[w \int_{\pi_p}^\infty (x-\pi_p) f_1(x) \ dx +(1-w) \int_{\pi_p}^\infty (x-\pi_p) f_2(x) \ dx\biggr] \\&=\pi_p+\frac{1}{1-p} \biggl[w \ P(X_1>\pi_p) \ \frac{\int_{\pi_p}^\infty (x-\pi_p) f_1(x) \ dx}{P(X_1>\pi_p)}\\& \ \ +(1-w) \ P(X_2>\pi_p) \ \frac{\int_{\pi_p}^\infty (x-\pi_p) f_2(x) \ dx}{P(X_2>\pi_p)} \biggr] \\&=\pi_p+\frac{1}{1-p} \biggl[w \ P(X_1>\pi_p) \ e_{X_1}(\pi_p) +(1-w) \ P(X_2>\pi_p) \ e_{X_2}(\pi_p) \biggr] \end{aligned}

The formula derived here is for mixtures for two distributions. It is straightforward to extend it for mixtures of any finite-mixture.

Practice Problems

Practice problems are available in the companion blog to reinforce the concepts of value-at-risk and tail-value-at-risk. Practice Problems 10-G and 10-H in that link are for TVaR of mixtures.

actuarial
math

Daniel Ma
mathematics

$\copyright$ 2018 – Dan Ma

# Value-at risk and tail-value-at-risk

In actuarial applications, an important focus is on developing loss distributions for insurance products. In such applications, it is desirable to employ risk measures to evaluate the exposure to risk. Such risk measures are indicators, often one or a small set of numbers, that inform actuaries and risk managers about the degree to which the risk bearing entity is subject to various aspects of risk. This post introduces two risk measures – value-at-risk (VaR) and tail-value-at-risk (TVaR).

Value-at-Risk

One natural question for a risk bearing entity (e.g. insurance companies and other enterprises) is: what is the chance of an adverse outcome? Value-at-risk (VaR) provides a ready answer to this question. Mathematically speaking, VaR is a quantile of the distribution of aggregate losses. For example, VaR at the 99% probability level indicates the level of adverse outcome such that the probability of exceeding this threshold is 1%. More broadly, VaR is the amount of capital required to ensure, with a high level of confidence, that the risk bearing entity does not become insolvent. The security level or probability level is chosen arbitrarily. In practice, it is usually a high number such as 95%, 99% or 99.5%. The preference is for a higher security level when evaluating the risk exposure for the entire enterprise. For a sub unit of the enterprise, the security level may be set to a lower number such as 95%.

Suppose that $X$ is a random variable that models the loss distribution in question. We assume that the support of $X$ is the set of positive real numbers or some appropriate subset. The value-at-risk (VaR) of $X$ at the $100p$th security level, denoted by $\text{VaR}_p(X)$, is the $100p$th percentile of $X$.

In the current discussion, we focus on loss distributions that are continuous random variables. Thus $\text{VaR}_p(X)$ is the value $\pi_p$ such that $P(X>\pi_p)=1-p$. In some ways, VaR is an attractive risk measure. Mathematically speaking, VaR has a clear and simple definition. For certain probability models, VaR can be evaluated in closed form. For those models that have no closed form for percentiles, VaR can be evaluated using software. However, VaR has limitations (this point will be briefly discussed below).

Example 1
Suppose that the loss $X$ is normally distributed with mean $\mu$ and variance $\sigma^2$. Then $\text{VaR}_p(X)=\mu+z_p \cdot \sigma$ where $z_p$ is the $100p$th percentile of the standard normal distribution (i.e. normal with mean 0 and standard deviation 1).

VaR for normal distribution is identical to the risk measure called standard deviation principle. For any loss $X$ with mean $\mu$ and variance $\sigma^2$, the quantity $\mu+k \cdot \sigma$ for some fixed constant $k$ is a risk measure called the standard deviation principle. The constant $k$ is usually chosen such that losses will exceed $\mu+k \cdot \sigma$ with a pre-determined small probability. For losses that are normally distributed, $k=1.645$ for security level $p=0.95$ and $k=2.326$ for security level $p=0.99$.

Comment
The value-at-risk discussed here is for gauging the exposure of risk with respect to insurance losses. Thus we assume that the random variable $X$ is one that takes on positive real values (or some appropriate subset of the positive real number line). In this context, the adverse outcomes would be the extremely high values of the positive real numbers. In other words, the adverse outcomes we wish to guard against would be the right tails of the loss distribution. Thus $\text{VaR}_p(X)$ is evaluated for high values for $p$ such as 0.95 or 0.99 or higher. In the actuarial loss point of view, VaR is the high right tail of the loss distribution.

VaR is also used extensively in banking and investment industry. The random variable $X$ in that setting is a profit and loss distribution, which can extend into the negative real numbers (when there are losses). The adverse outcomes for banking and investment applications would be the extreme left tails of the profit and loss distribution. Thus when VaR is evaluated at the security level 95%, we actually calculate the 5th percentile of the profit and loss distribution.

Tail-Value-at-Risk

Tail-value-at-risk (TVaR) is risk measure that is in many ways superior than VaR. The risk measure VaR is a merely a cutoff point and does not describe the tail behavior beyond the VaR threshold. We will see that TVaR reflects the shape of the tail beyond VaR threshold.

Suppose that $X$ is the random variable that models losses. As before we assume that $X$ takes on positive real numbers. The tail-value-at-risk at the $100p$th security level, denoted by $\text{TVaR}_p(X)$, is the expected loss on the condition that loss exceeds the $100p$th percentile of $X$.

Just as in the discussion of VaR, the discussion of TVaR focuses on continuous distributions. As before, we use $\pi_p$ to denote the $100p$th percentile of $X$. The quantity $\text{TVaR}_p(X)$ can be expressed as follows:

$\displaystyle (1) \ \ \ \ \ \text{TVaR}_p(X)=E(X \lvert X> \pi_p)=\frac{\int_{\pi_p}^\infty x f(x) \ dx}{1-F(\pi_p)}=\frac{\int_{\pi_p}^\infty x f(x) \ dx}{1-p}$

The above is the formulation of TVaR is that based on the definition. To use (1) to obtain TVaR, evaluate VaR and then evaluate the integral on the right side. TVaR is obtained by dividing the integral result by $1-p$. However, there are other formulations that will give more insight into TVaR. The following is another formulation.

$\displaystyle (2) \ \ \ \ \ \text{TVaR}_p(X)=\frac{\int_p^1 \text{VaR}_w(X) \ dw}{1-p}$

The above is derived by the substitution $w=F(x)$ where $F(x)$ is the cumulative distribution function (CDF) of $X$. From a computation standpoint, (2) is not easy to use since it is usually hard to integrate VaR. The value of (2) is in the insight. From (2), we see that TVaR can be viewed as the average of all VaR at the level $w$ above $p$. Thus TVaR tells us much more about the tail of the loss distribution than VaR for just one security level. The following is another formulation.

\displaystyle \begin{aligned} (3) \ \ \ \ \ \text{TVaR}_p(X)&=\frac{\int_{\pi_p}^\infty x f(x) \ dx}{1-p} \\&=\pi_p+\frac{\int_{\pi_p}^\infty (x-\pi_p) f(x) \ dx}{1-p} \\&=\pi_p+e(\pi_p) \\&=\text{VaR}_p(X)+e(\text{VaR}_p(X))\end{aligned}

In (3), the function $e(x)$ is the mean excess loss function evaluated at $x$. Thus TVaR is the VaR plus the mean excess loss evaluated at VaR. Thus TVaR is VaR plus the average excess of all losses exceeding the threshold of VaR. As a result, TVaR gives information on the tail to the right of VaR. For those parametric loss distributions that have accessible formulations for the limited expectation $E(X \wedge x)$, the following is a useful formulation for TVaR.

\displaystyle \begin{aligned} (4) \ \ \ \ \ \text{TVaR}_p(X)&=\text{VaR}_p(X)+e(\text{VaR}_p(X)) \\&=\text{VaR}_p(X)+\frac{E[X]-E[X \wedge \text{VaR}_p(X)]}{1-p} \end{aligned}

See this blog post in a companion blog for more information on mean excess loss function and limited expectation $E(X \wedge x)$.

Tail-value-at-risk is also known as conditional tail expectation (CTE) as well as tail conditional expectation (TCE). CTE and TCE are widely used in North America. In Europe, TVaR is also known as expected shortfall (ES).

Formulas for VaR and TVaR

Many distributions have CDFs that allow relatively easy computation of percentiles. Thus VaR is quite accessible for these parametric distributions. This link has a table of distributional information for parametric distributions that includes value-at-risk. Distributions in this table that have formulas for VaR include Burr distribution, inverse Burr distribution, Pareto distribution, inverse Pareto distribution, loglogistic distribution, paralogistic distribution, inverse paralogistic distribution, Weibull distribution, inverse Weibull distribution, exponential distribution, inverse exponential distribution.

The calculation for TVaR is not so accessible in the table in the given link. In fact, for some distributions that have heavy tails, TVaR is not even defined since TVaR involves an average value of the excess of losses above a threshold. In the remainder of the post, we present formulas for TVaR for four distributions – exponential, Pareto, normal and lognormal.

Distribution VaR TVaR
Exponential $-\theta \text{ln}(1-p)$ $\theta (1- \text{ln}(1-p))$
Pareto $\displaystyle \frac{\theta}{(1-p)^{1/\alpha}}-\theta$ $\displaystyle \text{VaR}_p(X)+\frac{\theta+\text{VaR}_p(X)}{\alpha-1}$ $\alpha>1$
Pareto $\displaystyle \text{VaR}_p(X) \ \frac{\alpha}{\alpha-1}+\frac{\theta}{\alpha-1}$ $\alpha>1$
Pareto $\displaystyle \frac{\theta}{\alpha-1} \biggl[1+ \frac{\alpha}{\theta} \ \text{VaR}_p(X) \biggr]$ $\alpha>1$
Normal $\mu+z_p \ \sigma$ $\displaystyle \mu+ \sigma \ \frac{\phi(z_p)}{1-p}$
Lognormal $\displaystyle e^{\mu+z_p \ \sigma}$ $\displaystyle e^{\mu+0.5 \ \sigma^2} \biggl[\frac{\Phi(\sigma-z_p)}{1-p} \biggr]$

The exponential distribution in the table has parametrized by a scale parameter $\theta$, which happens to be the mean of the distribution (see here for more information).

The Pareto distribution in the table is Pareto Type II Lomax distribution discussed here. It is parametrized by a shape parameter $\alpha$ and a scale parameter $\theta$. The table gives three formulations of TVaR. Note that in order for the mean to exist, the shape parameter must be greater than 1. Thus TVaR is not defined when $\alpha>1$.

For the normal distribution, the parameters are $\mu$ (mean) and $\sigma$ (standard deviation). The lognormal distribution are parametrized by $\mu$ and $\sigma$, meaning that the logarithm of the lognormal distribution is a normal distribution with mean $\mu$ and standard deviation $\sigma$. To use the formulas for TVaR, note that $\phi$ and $\Phi$ are the probability density function and the cumulative distribution function of the standard normal distribution, respectively. Thus $\phi$ and $\Phi$ are:

$\displaystyle \phi(x)=\frac{1}{\sqrt{2 \pi}} \ e^{-0.5 x^2}$

$\displaystyle \Phi(x)=\int_{-\infty}^x \frac{1}{\sqrt{2 \pi}} \ e^{-0.5 t^2} \ dt$

Of course $\Phi(x)$ is evaluated by using a table or software and not by evaluating the integral.

Comment

The risk measure of value-at-risk, though simple in concept and calculation, has shortcomings. One undesirable aspect is that VaR does not possess certain desirable properties among risk measures. VaR is not a coherent risk measure. To be a coherent risk measure, it must satisfy four properties, one of which is subadditivity. When a risk measure is subadditive, the risk measure of two risks combined into one will not be greater than the risk measure for the two risks treated separately. It is desirable to have the benefits of diversification by combing several risks into one. When the risk measure is not subadditive, it cannot model the diversification of risks. It had been shown that VaR is not a subadditive risk measure. On the other hand, TVaR had been shown to be a coherent measure. It satisfies subadditivity and three other desirable properties for risk measures. The topic of risk measures is to be discussed in a subsequent post.

A Formula for a Mixture

A relatively easy to use formula for tail-value-at-risk for a mixture distribution is shown in next post.

Practice Problems

Practice problems are available in the companion blog to reinforce the concepts of value-at-risk and tail-value-at-risk.

actuarial
math

Daniel Ma
mathematics

$\copyright$ 2017 – Dan Ma

# A catalog of parametric severity models

Various parametric continuous probability models have been presented and discussed in this blog. The number of parameters in these models ranges from one to two, and in a small number of cases three. They are all potential candidates for models of severity in insurance applications and in other actuarial applications. This post highlights these models. The list presented here is not exhaustive; it is only a brief catalog. There are other models that are also suitable for actuarial applications but not accounted for here. However, the list is a good place to begin. This post also serves a navigation device (the table shown below contains links to the blog posts).

A Catalog

Many of the models highlighted here are related to gamma distribution either directly or indirectly. So the catalog starts with the gamma distribution at the top and then branches out to the other related models. Mathematically, the gamma distribution is a two-parameter continuous distribution defined using the gamma function. The gamma sub family includes the exponential distribution, Erlang distribution and chi-squared distribution. These are distributions that are gamma distributions with certain restrictions on the one or both of the gamma parameters. Other distributions are obtained by raising a distribution to a power. Others are obtained by mixing distributions.

Here’s a listing of the models. Click on the links to find out more about the distributions.

……Derived From ………………….Model
Gamma function
Gamma sub families
Independent sum of gamma
Exponentiation
Raising to a power Raising exponential to a positive power

Raising exponential to a power

Raising gamma to a power

Raising Pareto to a power

Burr sub families
Mixture
Others

The above table categorizes the distributions according to how they are mathematically derived. For example, the gamma distribution is derived from the gamma function. The Pareto distribution is mathematically an exponential-gamma mixture. The Burr distribution is a transformed Pareto distribution, i.e. obtained by raising a Pareto distribution to a positive power. Even though these distributions can be defined simply by giving the PDF and CDF, knowing how their mathematical origins informs us of the specific mathematical properties of the distributions. Organizing according to the mathematical origin gives us a concise summary of the models.

$\text{ }$

$\text{ }$

From a mathematical standpoint, the gamma distribution is defined using the gamma function.

$\displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt$

In this above integral, the argument $\alpha$ is a positive number. The expression $t^{\alpha-1} \ e^{-t}$ in the integrand is always positive. The area in between the curve $t^{\alpha-1} \ e^{-t}$ and the x-axis is $\Gamma(\alpha)$. When this expression is normalized, i.e. divided by $\Gamma(\alpha)$, it becomes a density function.

$\displaystyle f(t)=\frac{1}{\Gamma(\alpha)} \ t^{\alpha-1} \ e^{-t}$

The above function $f(t)$ is defined over all positive $t$. The integral of $f(t)$ over all positive $t$ is 1. Thus $f(t)$ is a density function. It only has one parameter, the $\alpha$, which is the shape parameter. Adding the scale parameter $\theta$ making it a two-parameter distribution. The result is called the gamma distribution. The following is the density function.

$\displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ x>0$

Both parameters $\alpha$ and $\theta$ are positive real numbers. The first parameter $\alpha$ is the shape parameter and $\theta$ is the scale parameter.

As mentioned above, many of the distributions listed in the above table is related to the gamma distribution. Some of the distributions are sub families of gamma. For example, when $\alpha$ are positive integers, the resulting distributions are called Erlang distribution (important in queuing theory). When $\alpha=1$, the results are the exponential distributions. When $\alpha=\frac{k}{2}$ and $\theta=2$ where $k$ is a positive integer, the results are the chi-squared distributions (the parameter $k$ is referred to the degrees of freedom). The chi-squared distribution plays an important role in statistics.

Taking independent sum of $n$ independent and identically distributed exponential random variables produces the Erlang distribution, a sub gamma family of distribution. Taking independent sum of $n$ exponential random variables, with pairwise distinct means, produces the hypoexponential distributions. On the other hand, the mixture of $n$ independent exponential random variables produces the hyperexponential distribution.

The Pareto distribution (Pareto Type II Lomax) is the mixture of exponential distributions with gamma mixing weights. Despite the connection with the gamma distribution, the Pareto distribution is a heavy tailed distribution. Thus the Pareto distribution is suitable for modeling extreme losses, e.g. in modeling rare but potentially catastrophic losses.

As mentioned earlier, raising a Pareto distribution to a positive power generates the Burr distribution. Restricting the parameters in a Burr distribution in a certain way will produces the paralogistic distribution. The table indicates the relationships in a concise way. For details, go into the blog posts to get more information.

Tail Weight

Another informative way to categorize the distributions listed in the table is through looking at the tail weight. At first glance, all the distributions may look similar. For example, the distributions in the table are right skewed distributions. Upon closer look, some of the distributions put more weights (probabilities) on the larger values. Hence some of the models are more suitable for models of phenomena with significantly higher probabilities of large or extreme values.

When a distribution significantly puts more probabilities on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). In general tail weight is a relative concept. For example, we say model A has a larger tail weight than model B (or model A has a heavier tail than model B). However, there are several ways to check for tail weight of a given distribution. Here are the four criteria.

Tail Weight Measure What to Look for
1 Existence of moments The existence of more positive moments indicates a lighter tailed distribution.
2 Hazard rate function An increasing hazard rate function indicates a lighter tailed distribution.
3 Mean excess loss function An increasing mean excess loss function indicates a heavier tailed distribution.
4 Speed of decay of survival function A survival function that decays rapidly to zero (as compared to another distribution) indicates a lighter tailed distribution.

Existence of moments
For a positive real number $k$, the moment $E(X^k)$ is defined by the integral $\int_0^\infty x^k \ f(x) \ dx$ where $f(x)$ is the density function of the distribution in question. If the distribution puts significantly more probabilities in the larger values in the right tail, this integral may not exist (may not converge) for some $k$. Thus the existence of moments $E(X^k)$ for all positive $k$ is an indication that the distribution is a light tailed distribution.

In the above table, the only distributions for which all positive moments exist are gamma (including all gamma sub families such as exponential), Weibull, lognormal, hyperexponential, hypoexponential and beta. Such distributions are considered light tailed distributions.

The existence of positive moments exists only up to a certain value of a positive integer $k$ is an indication that the distribution has a heavy right tail. All the other distributions in the table are considered heavy tailed distribution as compared to gamma, Weibull and lognormal. Consider a Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta$. Note that the existence of the Pareto higher moments $E(X^k)$ is capped by the shape parameter $\alpha$. If the Pareto distribution is to model a random loss, and if the mean is infinite (when $\alpha=1$), the risk is uninsurable! On the other hand, when $\alpha \le 2$, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

Hazard rate function
The hazard rate function $h(x)$ of a random variable $X$ is defined as the ratio of the density function and the survival function.

$\displaystyle h(x)=\frac{f(x)}{S(x)}$

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged $x$ will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for $x$ units of time.

Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. On the other hand, a distribution with an increasing hazard rate function has a light tailed distribution. If the hazard rate function is decreasing (over time if the random variable is a time variable), then the population die off at a decreasing rate, hence a heavier tail for the distribution in question.

The Pareto distribution is a heavy tailed distribution since the hazard rate is $h(x)=\alpha/x$ (Pareto Type I) and $h(x)=\alpha/(x+\theta)$ (Pareto Type II Lomax). Both hazard rates are decreasing function.

The Weibull distribution is a flexible model in that when its shape parameter is $0<\tau<1$, the Weibull hazard rate is decreasing and when $\tau>1$, the hazard rate is increasing. When $\tau=1$, Weibull is the exponential distribution, which has a constant hazard rate.

The point about decreasing hazard rate as an indication of a heavy tailed distribution has a connection with the fourth criterion. The idea is that a decreasing hazard rate means that the survival function decays to zero slowly. This point is due to the fact that the hazard rate function generates the survival function through the following.

$\displaystyle S(x)=e^{\displaystyle -\int_0^x h(t) \ dt}$

Thus if the hazard rate function is decreasing in $x$, then the survival function will decay more slowly to zero. To see this, let $H(x)=\int_0^x h(t) \ dt$, which is called the cumulative hazard rate function. As indicated above, $S(x)=e^{-H(x)}$. If $h(x)$ is decreasing in $x$, $H(x)$ has a lower rate of increase and consequently $S(x)=e^{-H(x)}$ has a slower rate of decrease to zero.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The mean excess loss function
The mean excess loss is the conditional expectation $e_X(d)=E(X-d \lvert X>d)$. If the random variable $X$ represents insurance losses, mean excess loss is the expected loss in excess of a threshold conditional on the event that the threshold has been exceeded. Suppose that the threshold $d$ is an ordinary deductible that is part of an insurance coverage. Then $e_X(d)$ is the expected payment made by the insurer in the event that the loss exceeds the deductible.

Whenever $e_X(d)$ is an increasing function of the deductible $d$, the loss $X$ is a heavy tailed distribution. If the mean excess loss function is a decreasing function of $d$, then the loss $X$ is a lighter tailed distribution.

The Pareto distribution can also be classified as a heavy tailed distribution based on an increasing mean excess loss function. For a Pareto distribution (Type I) with shape parameter $\alpha$ and scale parameter $\theta$, the mean excess loss is $e(X)=d/(\alpha-1)$, which is increasing. The mean excess loss for Pareto Type II Lomax is $e(X)=(d+\theta)/(\alpha-1)$, which is also decreasing. They are both increasing functions of the deductible $d$! This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If the underlying distribution for a random loss is Pareto, it is a catastrophic risk situation.

In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Speed of decay of the survival function to zero
The survival function $S(x)=P(X>x)$ captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed. This point is touched on when discussing hazard rate function.

The following is a comparison of a Pareto Type II survival function and an exponential survival function. The Pareto survival function has parameters ($\alpha=2$ and $\theta=2$). The two survival functions are set to have the same 75th percentile, which is $x=2$. The following table is a comparison of the two survival functions.

$\displaystyle \begin{array}{llllllll} \text{ } &x &\text{ } & \text{Pareto } S_X(x) & \text{ } & \text{Exponential } S_Y(x) & \text{ } & \displaystyle \frac{S_X(x)}{S_Y(x)} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{ } &2 &\text{ } & 0.25 & \text{ } & 0.25 & \text{ } & 1 \\ \text{ } &10 &\text{ } & 0.027777778 & \text{ } & 0.000976563 & \text{ } & 28 \\ \text{ } &20 &\text{ } & 0.008264463 & \text{ } & 9.54 \times 10^{-7} & \text{ } & 8666 \\ \text{ } &30 &\text{ } & 0.00390625 & \text{ } & 9.31 \times 10^{-10} & \text{ } & 4194304 \\ \text{ } &40 &\text{ } & 0.002267574 & \text{ } & 9.09 \times 10^{-13} & \text{ } & 2.49 \times 10^{9} \\ \text{ } &60 &\text{ } & 0.001040583 & \text{ } & 8.67 \times 10^{-19} & \text{ } & 1.20 \times 10^{15} \\ \text{ } &80 &\text{ } & 0.000594884 & \text{ } & 8.27 \times 10^{-25} & \text{ } & 7.19 \times 10^{20} \\ \text{ } &100 &\text{ } & 0.000384468 & \text{ } & 7.89 \times 10^{-31} & \text{ } & 4.87 \times 10^{26} \\ \text{ } &120 &\text{ } & 0.000268745 & \text{ } & 7.52 \times 10^{-37} & \text{ } & 3.57 \times 10^{32} \\ \text{ } &140 &\text{ } & 0.000198373 & \text{ } & 7.17 \times 10^{-43} & \text{ } & 2.76 \times 10^{38} \\ \text{ } &160 &\text{ } & 0.000152416 & \text{ } & 6.84 \times 10^{-49} & \text{ } & 2.23 \times 10^{44} \\ \text{ } &180 &\text{ } & 0.000120758 & \text{ } & 6.53 \times 10^{-55} & \text{ } & 1.85 \times 10^{50} \\ \text{ } & \text{ } \\ \end{array}$

Note that at the large values, the Pareto right tails retain much more probabilities. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. Using an exponential distribution to model a Pareto random phenomenon would be a severe modeling error even though the exponential distribution may be a good model for describing the loss up to the 75th percentile (in the above comparison). It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

$\displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\theta^\alpha}{(x+\theta)^\alpha}}{e^{-\lambda x}}=\frac{\theta^\alpha e^{\lambda x}}{(x+\theta)^\alpha} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty$

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in $x$. This ratio goes to infinity as $x \rightarrow \infty$.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator.

It is important to examine the tail behavior of a distribution when considering it as a candidate for a model. The four criteria discussed here provide a crucial way to classify parametric models according to the tail weight.

severity models
math

Daniel Ma
mathematics

$\copyright$ 2017 – Dan Ma

# Examples of mixtures

The notion of mixtures is discussed in this previous post. Many probability distributions useful for actuarial modeling are mixture distributions. The previous post touches on some examples – negative binomial distribution (a Poisson-Gamma mixture), Pareto distribution (an exponential-gamma mixture) and the normal-normal mixture. In this post we present additional examples. We discuss the following examples.

1. Poisson-Gamma mixture = Negative Binomial.
2. Normal-Normal mixture = Normal.
3. Exponential-Gamma mixture = Pareto.
4. Exponential-Inverse Gamma mixture = Pareto.
5. Gamma-Gamma mixture = Generalized Pareto.
6. Weibull-Exponential mixture = Loglogistic.
7. Gamma-Geometric mixture = Exponential.
8. Normal-Gamma mixture = Student t.

The first three examples are discussed in the previous post. We discuss the remaining examples in this post.

The Pareto Family

Examples 3 and 4 show that Pareto distributions are mixtures of exponential distributions with either gamma or inverse gamma mixing weights. In Example 3, $X \lvert \Theta$ is an exponential distribution with $\Theta$ being a rate parameter. When $\Theta$ follows a gamma distribution, the resulting mixture is a (Type I Lomax) Pareto distribution. In Example 4, $X \lvert \Theta$ is an exponential distribution with $\Theta$ being a scale parameter. When $\Theta$ follows an inverse gamma distribution, the resulting mixture is also a (Type I Lomax) Pareto distribution.

As a mixture, Example 5 is like Example 3, except that it is a gamma-gamma mixture resulting in a generalized Pareto distribution. Example 3 has been discussed in the previous post. We now discuss Example 4 and Example 5.

Example 4. Suppose that $X \lvert \Theta$ has an exponential distribution where $\Theta$ is a scale parameter.
Further suppose that the random parameter $\Theta$ follows an inverse gamma distribution with parameters $\alpha$ and $\beta$. Then the unconditional distribution for $X$ is a (Type I Lomax) Pareto distribution with shape parameter $\alpha$ and scale parameter $\beta$.

The following gives the cumulative distribution function (CDF) and survival function of the conditional random variable $X \lvert \Theta$.

$F(x \lvert \Theta)=1-e^{- x/\Theta}$

$S(x \lvert \Theta)=e^{- x/\Theta}$

The random parameter $\Theta$ follows an inverse gamma distribution with parameters $\alpha$ and $\beta$. The following is the pdf of $\Theta$:

$\displaystyle g(\theta)=\frac{1}{\Gamma(\alpha)} \ \biggl[\frac{\beta}{\theta}\biggr]^\alpha \ \frac{1}{\theta} \ e^{-\frac{\beta}{ \theta}} \ \ \ \ \ \theta>0$

We show that the unconditional survival function for $X$ is the survival function for the Pareto distribution with parameters $\alpha$ (shape parameter) and $\beta$ (scale parameter).

\displaystyle \begin{aligned} S(x)&=\int_0^\infty S(x \lvert \theta) \ g(\theta) \ d \theta \\&=\int_0^\infty e^{- x/\theta} \ \frac{1}{\Gamma(\alpha)} \ \biggl[\frac{\beta}{\theta}\biggr]^\alpha \ \frac{1}{\theta} \ e^{-\beta / \theta} \ d \theta \\&=\int_0^\infty \frac{1}{\Gamma(\alpha)} \ \biggl[\frac{\beta}{\theta}\biggr]^\alpha \ \frac{1}{\theta} \ e^{-(x+\beta) / \theta} \ d \theta \\&=\frac{\beta^\alpha}{(x+\beta)^\alpha} \ \int_0^\infty \frac{1}{\Gamma(\alpha)} \ \biggl[\frac{x+\beta}{\theta}\biggr]^\alpha \ \frac{1}{\theta} \ e^{-(x+\beta) / \theta} \ d \theta \\&=\biggl(\frac{\beta}{x+\beta} \biggr)^\alpha \end{aligned}

Note that the the integrand in the last integral is a density function for an inverse gamma distribution. Thus the integral is 1 and can be eliminated. The result that remains is the survival function for a Pareto distribution with parameters $\alpha$ and $\beta$. The following gives the CDF and density function of this Pareto distribution.

$\displaystyle F(x)=1-\biggl(\frac{\beta}{x+\beta} \biggr)^\alpha$

$\displaystyle f(x)=\frac{\alpha \ \beta^{\alpha}}{(x+\beta)^{\alpha+1}}$

See here for further information on Pareto Type I Lomax distribution.

Example 5. Suppose that $X \lvert \Theta$ has a gamma distribution with shape parameter $k$ (a known constant) and rate parameter $\Theta$. Further suppose that the random parameter $\Theta$ follows a gamma distribution with shape parameter $\alpha$ and rate parameter $\beta$. Then the unconditional distribution for $X$ is a generalized Pareto distribution with parameters $\alpha$, $\beta$ and $k$.

Conditional on $\Theta=\theta$, the following is the density function of $X$.

$\displaystyle f(x \lvert \theta)=\frac{1}{\Gamma(k)} \ \theta^k \ x^{k-1} \ e^{-\theta x} \ \ \ \ \ x>0$

The following is the density function of the random parameter $\Theta$.

$\displaystyle g(\theta)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ \ \ \ \ \ \theta>0$

The following gives the unconditional density function for $X$.

\displaystyle \begin{aligned} f(x)&=\int_0^\infty f(x \lvert \theta) \ g(\theta) \ d \theta \\&=\int_0^\infty \frac{1}{\Gamma(k)} \ \theta^k \ x^{k-1} \ e^{-\theta x} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ d \theta \\&=\int_0^\infty \frac{1}{\Gamma(k)} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ x^{k-1} \ \theta^{\alpha+k-1} \ e^{-(x+\beta) \theta} \ d \theta \\&= \frac{1}{\Gamma(k)} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ x^{k-1} \frac{\Gamma(\alpha+k)}{(x+\beta)^{\alpha+k}} \int_0^\infty \frac{1}{\Gamma(\alpha+k)} \ (x+\beta)^{\alpha+k} \ \theta^{\alpha+k-1} \ e^{-(x+\beta) \theta} \ d \theta \\&=\frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \frac{\beta^\alpha \ x^{k-1}}{(x+\beta)^{\alpha+k}} \end{aligned}

Any distribution that has a density function described above is said to be a generalized Pareto distribution with the parameters $\alpha$, $\beta$ and $k$. Its CDF cannot be written in closed form but can be expressed using the incomplete beta function.

\displaystyle \begin{aligned} F(x)&=\int_0^x \frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \frac{\beta^\alpha \ t^{k-1}}{(t+\beta)^{\alpha+k}} \ dt \\&=\int_0^x \frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \biggl(\frac{t}{t+\beta} \biggr)^{k-1} \ \biggl(\frac{\beta}{t+\beta} \biggr)^{\alpha-1} \ \frac{\beta}{(t+\beta)^2} \ dt \\&=\frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \int_0^{\frac{x}{x+\beta}} u^{k-1} \ (1-u)^{\alpha-1} \ du, \ \ \ u=\frac{t}{t+\beta} \\&=\frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \int_0^{w} t^{k-1} \ (1-t)^{\alpha-1} \ dt, \ \ \ w=\frac{x}{x+\beta} \end{aligned}

The moments can be easily derived for the generalized Pareto distribution but on a limited basis. Since it is a mixture distribution, the unconditional mean is the weighted average of the conditional means.

\displaystyle \begin{aligned} E(X^w)&=\int_0^\infty E(X \lvert \theta) \ g(\theta) \ d \theta \\&=\int_0^\infty \frac{\Gamma(k+w)}{\theta^w \Gamma(k)} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ d \theta \\&=\frac{\beta^w \ \Gamma(k+w) \ \Gamma(\alpha-w)}{\Gamma(k) \ \Gamma(\alpha)} \int_0^\infty \frac{1}{\Gamma(\alpha-w)} \ \beta^{\alpha-w} \ \theta^{\alpha-w-1} \ e^{-\beta \theta} \ d \theta \\&=\frac{\beta^w \ \Gamma(k+w) \ \Gamma(\alpha-w)}{\Gamma(k) \ \Gamma(\alpha)} \ \ \ \ -k

Note that $E(X)$ has a simple expression $E(X)=\frac{k \beta}{\alpha-1}$ when $1<\alpha$.

When the parameter $k=1$, the conditional distribution for $X \lvert \Theta$ is an exponential distribution. Then the situation reverts back to Example 3, leading to a Pareto distribution. Thus the Pareto distribution is a special case of the generalized Pareto distribution. Both the Pareto distribution and the generalized Pareto distribution have thicker and longer tails than the original conditional gamma distribution.

It turns out that the F distribution is also a special case of the generalized Pareto distribution. The F distribution with $r_1$ and $r_2$ degrees of freedom is the generalized Pareto distribution with parameters $k=r_1/2$, $\alpha=r_2/2$ and $\beta=r_2/r_1$. As a result, the following is the density function.

\displaystyle \begin{aligned} h(x)&=\frac{\Gamma(r_1/2 + r_2/2)}{\Gamma(r_1/2) \ \Gamma(r_2/2)} \ \frac{(r_2/r_1)^{r_2/2} \ x^{r_1/2-1}}{(x+r_2/r_1)^{r_1/2+r_2/2}} \\&=\frac{\Gamma(r_1/2 + r_2/2)}{\Gamma(r_1/2) \ \Gamma(r_2/2)} \ \frac{(r_1/r_2)^{r_1/2} \ x^{r_1/2-1}}{(1+(r_1/r_2)x)^{r_1/2+r_2/2}} \ \ \ \ 0

Another way to generate the F distribution is from taking a ratio of two chi-squared distributions (see Theorem 9 in this previous post). Of course, there is no need to use the explicit form of the density function of the F distribution. In a statistical application, the F distribution is accessed using tables or software.

The Loglogistic Distribution

The loglogistic distribution can be derived as a mixture of Weillbull distribution with exponential mixing weights.

Example 6. Suppose that $X \lvert \Lambda$ has a Weibull distribution with shape parameter $\gamma$ (a known constant) and a parameter $\Lambda$ such that the CDF of $X \lvert \Lambda$ is $F(x \lvert \Lambda)=1-e^{-\Lambda \ x^\gamma}$. Further suppose that the random parameter $\Lambda$ follows an exponential distribution with rate parameter $\theta^{\gamma}$. Then the unconditional distribution for $X$ is a loglogistic distribution with shape parameter $\gamma$ and scale parameter $\theta$.

The following gives the conditional survival function for $X \lvert \Lambda$ and the exponential mixing weight.

$\displaystyle S(x \lvert \lambda)=e^{-\lambda \ x^\gamma}$

$\displaystyle g(\lambda)=\theta^\gamma \ e^{-\theta^\gamma \ \lambda}$

The following gives the unconditional survival function and CDF of $X$ as well as the PDF.

\displaystyle \begin{aligned} S(x)&=\int_0^\infty S(x \lvert \lambda) \ g(\lambda) \ d \lambda \\&=\int_0^\infty e^{-\lambda \ x^\gamma} \ \theta^\gamma \ e^{-\theta^\gamma \ \lambda} \ d \lambda \\&=\int_0^\infty \theta^\gamma \ e^{-(x^\gamma+\theta^\gamma) \ \lambda} \ d \lambda \\&=\frac{\theta^\gamma}{(x^\gamma+\theta^\gamma)} \int_0^\infty (x^\gamma+\theta^\gamma) \ e^{-(x^\gamma+\theta^\gamma) \ \lambda} \ d \lambda \\&=\frac{\theta^\gamma}{x^\gamma+\theta^\gamma} \end{aligned}

\displaystyle \begin{aligned} F(x)&=1-S(x)=1-\frac{\theta^\gamma}{x^\gamma+\theta^\gamma} =\frac{x^\gamma}{x^\gamma+\theta^\gamma} =\frac{(x/\theta)^\gamma}{1+(x/\theta)^\gamma} \end{aligned}

$\displaystyle f(x)=\frac{d}{dx} \biggl( \frac{x^\gamma}{x^\gamma+\theta^\gamma} \biggr)=\frac{\gamma \ (x/\theta)^\gamma}{x [1+(x/\theta)^\gamma]^2}$

Any distribution that has any one of the above three distributional quantities is said to be a loglogistic distribution with shape parameter $\gamma$ and scale parameter $\theta$.

One interesting point about loglogistic distribution that an inverse loglogistic distribution is another loglogistic distribution. Suppose that $X$ has a loglogistic distribution with shape parameter $\gamma$ and scale parameter $\theta$. Let $Y=\frac{1}{X}$. Then $Y$ has a loglogistic distribution with shape parameter $\gamma$ and scale parameter $\theta^{-1}$.

\displaystyle \begin{aligned} P[Y \le y]&=P[\frac{1}{X} \le y] =P[X \ge y^{-1}] =\frac{\theta^\gamma}{y^{-\gamma}+\theta^\gamma} \\&=\frac{\theta^\gamma \ y^\gamma}{1+\theta^\gamma \ y^\gamma} \\&=\frac{y^\gamma}{(\theta^{-1})^\gamma+y^\gamma} \end{aligned}

The above is a survival function for the loglogistic distribution with the desired parameters. Thus there is no need to specially call out the inverse loglogistic distribution.

In order to find the mean and higher moments of the loglogistic distribution, we take the approach of identifying the conditional Weibull means and the weight these means by the exponential mixing weights. Note that the parameter $\Lambda$ in the conditional CDF $F(x \lvert \Lambda)=1-e^{-\Lambda \ x^\gamma}$ is not a scale parameter. The Weibull distribution in this conditional CDF is equivalent to a Weibull distribution with shape parameter $\gamma$ and scale parameter $\Lambda^{-1/\gamma}$. According to formula (4) in this previous post, the $k$th moment of this Weillbull distribution is

$\displaystyle E[ (X \lvert \Lambda)^k]=\Gamma \biggl(1+\frac{k}{\gamma} \biggr) \Lambda^{-k/\gamma}$

The following gives the unconditional $k$th moment of the Weibull-exponential mixure.

\displaystyle \begin{aligned} E[X^k]&=\int_0^\infty E[ (X \lvert \Lambda)^k] \ g(\lambda) \ d \lambda \\&=\int_0^\infty \Gamma \biggl(1+\frac{k}{\gamma} \biggr) \lambda^{-k/\gamma} \ \theta^\gamma \ e^{-\theta^\gamma \ \lambda} \ d \lambda\\&=\Gamma \biggl(1+\frac{k}{\gamma} \biggr) \ \theta^\gamma \int_0^\infty \lambda^{-k/\gamma} \ e^{-\theta^\gamma \ \lambda} \ d \lambda \\&=\theta^k \ \Gamma \biggl(1+\frac{k}{\gamma} \biggr) \int_0^\infty t^{-k/\gamma} \ e^{-t} \ dt \ \ \text{ where } t=\theta^\gamma \lambda \\&=\theta^k \ \Gamma \biggl(1+\frac{k}{\gamma} \biggr) \int_0^\infty t^{[(\gamma-k)/\gamma]-1} \ e^{-t} \ dt \\&=\theta^k \ \Gamma \biggl(1+\frac{k}{\gamma} \biggr) \ \Gamma \biggl(1-\frac{k}{\gamma} \biggr) \ \ \ \ -\gamma

The range $\gamma follows from the fact that the arguments of the gamma function must be positive. Thus the $k$th moments of the loglogistic distribution are limited by its shape parameter $\gamma$. If $\gamma=1$, then $E(X)$ does not exist. For a larger $\gamma$, more moments exist but always a finite number of moments. This is an indication that the loglogistic distribution has a thick (right) tail. This is not surprising since mixture distributions (loglogistic in this case) tend to have thicker tails than the conditional distributions (Weibull in this case). The thicker tail is a result of the uncertainty in the random parameter in the conditional distribution (the Weibull $\Lambda$ in this case).

Another Way to Obtain Exponential Distribution

We now consider Example 7. The following is a precise statement of the gamma-geometric mixture.

Example 7. Suppose that $X \lvert \alpha$ has a gamma distribution with shape parameter $\alpha$ that is a positive integer and rate parameter $\beta$ (a known constant). Further suppose that the random parameter $\alpha$ follows a geometric distribution with probability function $P[Y=\alpha]=p (1-p)^{\alpha-1}$ where $\alpha=1,2,3,\cdots$. Then the unconditional distribution for $X$ is an exponential distribution with rate parameter $\beta p$.

The conditional gamma distribution has an uncertain shape parameter $\alpha$ that can take on positive integers. The parameter $\alpha$ follows a geometric distribution. Here’s the ingredients that go into the mixture.

$\displaystyle f(x \lvert \alpha)=\frac{1}{(\alpha-1)!} \ \beta^\alpha \ x^{\alpha-1} \ e^{-\beta x}$

$P[Y=\alpha]=p (1-p)^{\alpha-1}$

The following is the unconditional probability density function of $X$.

\displaystyle \begin{aligned} f(x)&=\sum \limits_{\alpha=1}^\infty f(x \lvert \alpha) \ P[Y=\alpha] \\&=\sum \limits_{\alpha=1}^\infty \frac{1}{(\alpha-1)!} \ \beta^\alpha \ x^{\alpha-1} \ e^{-\beta x} \ p (1-p)^{\alpha-1} \\&=\beta p \ e^{-\beta x} \sum \limits_{\alpha=1}^\infty \frac{[\beta(1-p) x]^{\alpha-1}}{(\alpha-1)!} \\&=\beta p \ e^{-\beta x} \sum \limits_{\alpha=0}^\infty \frac{[\beta(1-p) x]^{\alpha}}{(\alpha)!} \\&=\beta p \ e^{-\beta x} \ e^{\beta(1-p) x} \end{aligned}

The above density function is that of an exponential distribution with rate parameter $\beta p$.

Student t Distribution

Example 3 (discussed in the previous post) involves a normal distribution with a random mean. Example 8 involves a normal distribution with mean 0 and an uncertain variance, which follows a gamma distribution such that the two gamma parameters are related to a common parameter $r$, which will be the degrees of freedom of the student t distribution. The following is a precise description of the normal-gamma mixture.

Example 8. Suppose that $X \lvert \Lambda$ has a normal distribution with mean 0 and variance $1/\Lambda$. Further suppose that the random parameter $\Lambda$ follows a gamma distribution with shape parameter $\alpha$ and scale parameter $\theta$ such that $2 \alpha=\frac{2}{\theta}=r$ is a positive integer. Then the unconditional distribution for $X$ is a student t distribution with $r$ degrees of freedom.

The following gives the ingredients of the normal-gamma mixture. The first item is the conditional density function of $X$ given $\Lambda$. The second is the density function of the mixing weight $\Lambda$.

$\displaystyle f(x \lvert \lambda)=\frac{1}{\sqrt{1/\lambda} \ \sqrt{2 \pi}} \ e^{-(\lambda/2) \ x^2}=\sqrt{\frac{\lambda}{2 \pi}} \ e^{-(\lambda/2) \ x^2}$

$\displaystyle g(\lambda)=\frac{1}{\Gamma(\alpha)} \biggl( \frac{1}{\theta} \biggr)^\alpha \ \lambda^{\alpha-1} \ e^{-\lambda/\theta}$

The following calculation derives the unconditional density function of $X$.

\displaystyle \begin{aligned} f(x)&=\int_{0}^\infty f(x \lvert \lambda) \ g(\lambda) \ d \lambda \\&=\int_{0}^\infty \sqrt{\frac{\lambda}{2 \pi}} \ e^{-(\lambda/2) \ x^2} \ \frac{1}{\Gamma(\alpha)} \biggl( \frac{1}{\theta} \biggr)^\alpha \ \lambda^{\alpha-1} \ e^{-\lambda/\theta} \ d \lambda \\&=\frac{1}{\Gamma(\alpha)} \ \biggl( \frac{1}{\theta} \biggr)^\alpha \ \frac{1}{\sqrt{2 \pi}} \int_0^\infty \lambda^{\alpha+\frac{1}{2}-1} e^{-(\frac{x^2}{2}+\frac{1}{\theta} ) \lambda} \ d \lambda \\&=\frac{\Gamma(\alpha+\frac{1}{2})}{\Gamma(\alpha)} \ \biggl( \frac{1}{\theta} \biggr)^\alpha \ \frac{1}{\sqrt{2 \pi}} \ \biggl(\frac{2 \theta}{\theta x^2+2} \biggr)^{\alpha+\frac{1}{2}} \\& \times \int_0^\infty \frac{1}{\Gamma(\alpha+\frac{1}{2})} \ \biggl(\frac{\theta x^2+2}{2 \theta} \biggr)^{\alpha+\frac{1}{2}} \lambda^{\alpha+\frac{1}{2}-1} e^{-\frac{\theta x^2+2}{2 \theta} \lambda} \ d \lambda \\&=\frac{\Gamma(\alpha+\frac{1}{2})}{\Gamma(\alpha)} \ \biggl( \frac{1}{\theta} \biggr)^\alpha \ \frac{1}{\sqrt{2 \pi}} \ \biggl(\frac{2 \theta}{\theta x^2+2} \biggr)^{\alpha+\frac{1}{2}} \ \ \ \ \ -\infty

The above density function is in terms of the two parameters $\alpha$ and $\theta$. In the assumptions, the two parameters are related to a common parameter $r$ such that $\alpha=\frac{r}{2}$ and $\theta=\frac{2}{r}$. The following derivation converts to the common $r$.

\displaystyle \begin{aligned} f(x)&=\frac{\Gamma(\frac{r}{2}+\frac{1}{2})}{\Gamma(\frac{r}{2})} \ \biggl( \frac{r}{2} \biggr)^{\frac{r}{2}} \ \frac{1}{\sqrt{2 \pi}} \ \biggl(\frac{2 \frac{2}{r}}{\frac{2}{r} x^2+2} \biggr)^{\frac{r}{2}+\frac{1}{2}} \\&=\frac{\Gamma(\frac{r}{2}+\frac{1}{2})}{\Gamma(\frac{r}{2})} \ \frac{r^{r/2}}{2^{r/2}} \ \frac{1}{2^{1/2} \sqrt{\pi}} \ \biggl(\frac{2/r}{x^2/r+1} \biggr)^{(r+1)/2} \\&=\frac{\Gamma \biggl(\displaystyle \frac{r+1}{2} \biggr)}{\Gamma \biggl(\displaystyle \frac{r}{2} \biggr)} \ \frac{1}{\sqrt{\pi r}} \ \frac{1 \ \ \ \ \ }{\biggl(1+\displaystyle \frac{x^2}{r} \biggr)^{(r+1)/2}} \ \ \ \ \ -\infty

The above density function is that of a student t distribution with $r$ degrees of freedom. Of course, in performing test of significance, the t distribution is accessed by using tables or software. A usual textbook definition of the student t distribution is the ratio of a normal distribution and a chi-squared distribution (see Theorem 6 in this previous post.

$\text{ }$

$\text{ }$

$\text{ }$

$\copyright$ 2017 – Dan Ma

# Mixing probability distributions

This post discusses another way to generate new distributions from old, that of mixing distributions. The resulting distributions are called mixture distributions.

What is a Mixture?

First, let’s start with continuous mixture. Suppose that $X$ is a continuous random variable with probability density function (pdf) $f_{X \lvert \Theta}(x \lvert \theta)$ where $\theta$ is a parameter in the pdf. There may be other parameters in the distribution but they are not relevant at the moment (e.g. these other parameters may be known constants). Suppose that the parameter $\theta$ is an uncertain quantity and is a random variable with pdf $h_\Theta(\theta)$ (if $\Theta$ is a continuous random variable) or with probability function $P(\Theta=\theta)$ (if $\Theta$ a discrete random variable). Then taking the weighted average of $f_{X \lvert \Theta}(x \lvert \theta)$ with $h_\Theta(\theta)$ or $P(\Theta=\theta)$ as weight produces a mixture distribution. The following would be pdf of the resulting mixture distribution.

$\displaystyle (1a) \ \ \ \ \ f_X(x)=\int_{-\infty}^\infty f_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta$

$\displaystyle (1b) \ \ \ \ \ f_X(x)=\sum \limits_{\theta} \biggl(f_{X \lvert \Theta}(x \lvert \theta) \ P(\Theta=\theta) \biggr)$

Thus a continuous random variable $X$ is said to be a mixture (or has a mixture distribution) if its probability density function $f_X(x)$ is a weighted average of a family of pdfs $f_{X \lvert \Theta}(x \lvert \theta)$ where the weight is the density function or probability function of the random parameter $\Theta$. The random variable $\Theta$ is said to be the mixing random variable and its pdf or probability function is called the mixing weight.

Another definition of mixture distribution is that the cumulative distribution function (cdf) of the random variable $X$ is the weighted average of a family of cumulative distribution functions indexed by the mixing random variable $\Theta$.

$\displaystyle (2a) \ \ \ \ \ F_X(x)=\int_{-\infty}^\infty F_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta$

$\displaystyle (2b) \ \ \ \ \ F_X(x)=\sum \limits_{\theta} \biggl(F_{X \lvert \Theta}(x \lvert \theta) \ P(\Theta=\theta) \biggr)$

The idea of discrete mixture is similar. A discrete random variable $X$ is said to be a mixture if its probability function $P(X=x)$ or cumulative distribution function $P(X \le x)$ is a weighted average of a family of probability functions or cumulative distributions indexed by the mixing random variable $\Theta$. The mixing weight can be discrete or continuous. The following shows the probability function and the cdf of a discrete mixture distribution.

$\displaystyle (3a) \ \ \ \ \ P(X=x)=\int_{-\infty}^\infty P(X=x \lvert \Theta=\theta) \ h_\Theta(\theta) \ d \theta$

$\displaystyle (3b) \ \ \ \ \ P(X \le x)=\int_{-\infty}^\infty P(X \le x \lvert \Theta=\theta) \ h_\Theta(\theta) \ d \theta$

$\text{ }$

$\displaystyle (4a) \ \ \ \ \ P(X=x)=\sum \limits_{\theta} \biggl(P(X=x \lvert \Theta=\theta) \ P(\Theta=\theta) \biggr)$

$\displaystyle (4b) \ \ \ \ \ P(X \le x)=\sum \limits_{\theta} \biggl(P(X \le x \lvert \Theta=\theta) \ P(\Theta=\theta) \biggr)$

When the mixture distribution is a weighted average of finitely many distributions, it is called a $n$-point mixture where $n$ is the number of distributions. Suppose that there are $n$ distributions with pdfs

$f_1(x),f_2(x),\cdots,f_n(x)$ (continuous case)

or probability functions

$P(X_1=x),P(X_2=x),\cdots,P(X_n=x)$ (discrete case)

with mixing probabilities $p_1,p_2,\cdots,p_n$ where the sum of the $p_i$ is 1. Then the following gives the pdf or the probability function of the mixture distribution.

$\displaystyle (5a) \ \ \ \ \ f_X(x)=\sum \limits_{j=1}^n p_j \ f_j(x)$

$\displaystyle (5b) \ \ \ \ \ P(X=x)=\sum \limits_{j=1}^n p_j \ P(X_j=x)$

The cdf for the $n$-point mixture is similarly obtained by weighting the respective conditional cdfs as in (4b).

Distributional Quantities

Once the pdf (or probability function) or cdf of a mixture is established, the other distributional quantities can be derived from the pdf or cdf. Some of the distributional quantities can be obtained by taking weighted average of the corresponding conditional counterparts. For example, the following gives the survival function and moments of a mixture distribution. We assume that the mixing weight is continuous. For discrete mixing weight, simply replace the integral with summation.

$\displaystyle (6a) \ \ \ \ \ S_X(x)=\int_{-\infty}^\infty S_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta$

$\displaystyle (6b) \ \ \ \ \ E(X)=\int_{-\infty}^\infty E(X \lvert \theta) \ h_\Theta(\theta) \ d \theta$

$\displaystyle (6c) \ \ \ \ \ E(X^k)=\int_{-\infty}^\infty E(X^k \lvert \theta) \ h_\Theta(\theta) \ d \theta$

Once the moments are obtained, all distributional quantities that are based on moments can be evaluated, calculations such as variance, skewness, and kurtosis. Note that these quantities are not the weighted average of the conditional quantities. For example, variance of a mixture is not the weighted average of the variance of the conditional distributions. In fact, the variance of a mixture has two components.

$\displaystyle (7) \ \ \ \ \ Var(X)=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)]$

The relationship in (7) is called the law of total variance, which is the proper way of computing the unconditional variance $Var(X)$. The first component $E[Var(X \lvert \Theta)]$ is called the expected value of conditional variances, which is the weighted average of the conditional variances. The second component $Var[E(X \lvert \Theta)]$ is called the variance of the conditional means, which represents the additional variance as a result of the uncertainty in the parameter $\Theta$. If there is a great deal of variation among the conditional mean $E(X \lvert \Theta)$, the variation will be reflected in $Var(X)$ through the second component $Var[E(X \lvert \Theta)]$. This will be further illustrated in the examples below.

Motivation

Some of the examples discussed below have gamma distribution as mixing weights. See here for basic information on gamma distribution.

A natural interpretation of mixture is that of the uncertain parameter $\Theta$ in the conditional random variable $X \lvert \Theta$ describes an individual in a large population. For example, the parameter $\Theta$ describes a certain characteristics across the units in a population. In this section, we describe the idea of mixture in an insurance setting. The example is to mix Poisson distributions with a gamma distribution as mixing weight. We will see that the resulting mixture is a negative binomial distribution, which is more dispersed than the conditional Poisson distributions.

Consider a large group of insured drivers for auto collision coverage. Suppose that the claim frequency in a year for an insured driver has a Poisson distribution with mean $\theta$. The conditional probability function for the number of claims in a year for an insured driver is:

$\displaystyle P(X=x \lvert \Theta=\theta)=\frac{e^{-\theta} \ \theta^x}{x!} \ \ \ \ \ \ x=0,1,2,3,\cdots$ where $\theta>0$

The mean number of claims in a year for an insured driver is $\theta$. The parameter $\theta$ reflects the risk characteristics of an insured driver. Since the population of insured drivers is large, there is uncertainty in the parameter $\theta$. Thus it is more appropriate to regard $\theta$ as a random variable in order to capture the wide range of risk characteristics across the individuals in the population. As a result, the above probability function is not unconditional, but, rather, a conditional probability function of $X$.

What about the marginal (unconditional) probability function of $X$? Suppose that the pdf of $\Theta$ has a gamma distribution with the following pdf:

$\displaystyle h_{\Theta}(\theta)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta}$

where $\alpha>0$ and $\beta>0$ are known parameters of the gamma distribution. Then the unconditional pdf of $X$ is the weighted average of the conditional Poisson distribution.

\displaystyle \begin{aligned} P(X=x)&=\int_0^\infty P(X=x \lvert \Theta=\theta) \ h_{\Theta}(\theta) \ d \theta \\&=\int_0^\infty \frac{e^{-\theta} \ \theta^x}{x!} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \\&= \frac{\beta^\alpha}{x! \Gamma(\alpha)} \int_0^\infty \theta^{x+\alpha-1} \ e^{(\beta+1) \theta} \ d \theta \\&=\frac{\beta^\alpha}{x! \Gamma(\alpha)} \ \frac{\Gamma(x+\alpha)}{(\beta+1)^{x+\alpha}} \int_0^\infty \frac{1}{\Gamma(x+\alpha)} \ (\beta+1)^{x+\alpha} \ \theta^{x+\alpha-1} \ e^{(\beta+1) \theta} \ d \theta \\&=\frac{\beta^\alpha}{x! \Gamma(\alpha)} \ \frac{\Gamma(x+\alpha)}{(\beta+1)^{x+\alpha}} \\&=\frac{\Gamma(x+\alpha)}{x! \ \Gamma(\alpha)} \ \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x \ \ x=0,1,2,\cdots \end{aligned}

Note that the integral in the 4th step is 1 since the integrand is a gamma density function. The probability function at the last step is that of a negative binomial distribution. If the parameter $\alpha$ is a positive integer, then the following gives the probability function of $X$ after simplifying the expression with gamma function.

$\displaystyle P(X=x)=\left\{ \begin{array}{ll} \displaystyle \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha &\ x=0 \\ \text{ } & \text{ } \\ \displaystyle \frac{(x-1+\alpha) \cdots (1+\alpha) \alpha}{x!} \ \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x &\ x=1,2,\cdots \end{array} \right.$

This probability function can be further simplified as the following:

$\displaystyle P(X=x)=\binom{x+\alpha-1}{x} \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x$

where $x=0,1,2,\cdots$. This is one form of a negative binomial distribution. The mean is $E(X)=\frac{\alpha}{\beta}$ and the variance is $Var(X)=\frac{\alpha}{\beta} (1+\frac{1}{\beta})$. The variance of the negative binomial distribution is greater than the mean. In a Poisson distribution, the mean equals the variance. Thus the unconditional claim frequency $X$ is more dispersed than its conditional distributions. This is a characteristic of mixture distributions. The uncertainty in the parameter variable $\Theta$ has the effect of increasing the unconditional variance of the mixture distribution of $X$. Recall that the variance of a mixture distribution has two components, the weighted average of the conditional variances and the variance of the conditional means. The second component represents the additional variance introduced by the uncertainty in the parameter $\Theta$.

More Examples

We now further illustrate the notion of mixture with a few more examples. Many familiar distributions are mixture distribution. The negative binomial distribution is a mixture of Poisson distributions with gamma mixing weight as discussed above. The Pareto distribution, more specifically Pareto Type I Lomax, is a mixture of exponential distributions with gamma mixing weight (see Example 2 below). Example 3 discusses the normal-normal mixture. Example 1 demonstrates numerical calculation involving a finite mixture.

Example 1
Suppose that the size of an auto collision claim from a large group of insured drivers is a mixture of three exponential distributions with means 5, 8 and 10 (with respective weights 0.75, 0.15 and 0.10, respectively). Discuss the mixture distribution.

The pdf and cdf are the weighted averages of the respective exponential quantities.

\displaystyle \begin{aligned} f_X(x)&=0.75 \ (0.2 e^{-0.2x} )+0.15 \ (0.125 e^{-0.125x} )+0.10 (0.10 e^{-0.10x}) \\&\text{ } \\&=0.15 \ e^{-0.2x} +0.01875 \ e^{-0.125x}+0.01 \ e^{-0.10x} \end{aligned}

\displaystyle \begin{aligned} F_X(x)&=0.75 \ (1- e^{-0.2x} )+0.15 \ (1- e^{-0.125x} )+0.10 (1- e^{-0.10x}) \\&\text{ } \\&=1-0.75 \ e^{-0.2x} -0.15 \ e^{-0.125x}-0.10 \ e^{-0.10x} \end{aligned}

$\displaystyle S_X(x)=0.75 \ e^{-0.2x} +0.15 \ e^{-0.125x}+0.10 \ e^{-0.10x}$

For a randomly selected claim from this population of insured drivers, what is the probability that it exceeds 10? The answer is $S_X(10)=0.1813$. The pdf and cdf of the mixture will allow us to derive other distributional quantities such as moments and then using the moments to derive skewness and kurtosis. The moments for exponential distribution has a closed form. Then the moments of the mixture distribution is simply the weighted average of the exponential moments.

$\displaystyle E(X^k)=0.75 \ [5^k \ k!]+0.15 \ [8^k \ k!]+0.10 \ [10^k \ k!]$

where $k$ is a positive integer. The following evaluate the first four moments.

$\displaystyle E(X)=0.75 \ 5+0.15 \ 8+0.10 \ 10=5.95$

$\displaystyle E(X^2)=0.75 \ (5^2 \ 2!)+0.15 \ (8^2 \ 2!)+0.10 \ (10^2 \ 2!)=76.7$

$\displaystyle E(X^3)=0.75 \ (5^3 \ 3!)+0.15 \ (8^3 \ 3!)+0.10 \ (10^3 \ 3!)=1623.3$

$\displaystyle E(X^4)=0.75 \ (5^4 \ 4!)+0.15 \ (8^4 \ 4!)+0.10 \ (10^4 \ 4!)=49995.6$

The variance of $X$ is $Var(X)=76.7-5.95^2=41.2975$. The three conditional exponential variances are 25, 64 and 100. The weighted average of these would be 38.35. Because of the uncertainty resulting from not knowing which exponential distribution the claim is from, the unconditional variance is larger than 38.35.

The skewness of a distribution is the third central moments and the kurtosis is defined as the fourth central moment. Each of them can be expressed in terms of the raw moments up to the third or fourth raw moment.

$\displaystyle \gamma=E\biggl[\biggl( \frac{X-\mu}{\sigma} \biggr)^3\biggr]=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{(\sigma^2)^{1.5}}$

$\displaystyle \text{Kurt}[X]=E\biggl[\biggl( \frac{X-\mu}{\sigma} \biggr)^4\biggr]=\frac{E(X^4)-4 \mu E(X^3)+6 \mu^2 E(X^2)-3 \mu^4}{\sigma^4}$

Note that $\mu=E(X)$ and $\sigma^2=Var(X)$. The expressions on the right hand side are in terms of the raw moments $E(X^k)$ up to $k=4$. Plugging in the raw moments produces the skewness $\gamma=2.5453$ and kurtosis $\text{Kurt}[X]=14.0097$. The excess kurtosis is then 11.0097 (subtracting 3 from the kurtosis).

The skewness and excess kurtosis of an exponential distribution are always 2 and 6, respectively. One take way is that skewness and kurtosis of a mixture is not the weighted average of the conditional counterparts. In this particular case, the mixture is more skewed than the individual exponential distributions. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution (the kurtosis of a normal distribution is 3). Since the excess kurtosis for exponential distributions is 6, this mixture distribution is considered to be heavy tailed and to have higher likelihood of outliers.

Example 2 (Exponential-Gamma Mixture)
The Pareto distribution (Type I Lomax) is a mixture of exponential distributions with gamma mixing weight. Suppose $X$ has the exponential pdf $f_{X \lvert \Theta}(x \lvert \theta)=\theta \ e^{-\theta x}$, where $x>0$, conditional on the parameter $\Theta$. Suppose that the pdf of $\Theta$ has a gamma distribution with the following pdf:

$\displaystyle h_{\Theta}(\theta)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta}$

Then the following gives the unconditional pdf of the random variable $X$.

\displaystyle \begin{aligned} f_X(x)&=\int_0^\infty f_{X \lvert \Theta}(x \lvert \theta) \ h_{\Theta}(\theta) \ d \theta \\&=\int_0^\infty \theta \ e^{-\theta x} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ d \theta \\&= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^\infty \theta^{\alpha+1-1} \ e^{-(x+\beta) \theta} \ d \theta \\&= \frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+1)}{(x+\beta)^{\alpha+1}} \int_0^\infty \frac{1}{\Gamma(\alpha+1)} \ (x+\beta)^{\alpha+1} \ \theta^{\alpha+1-1} \ e^{-(x+\beta) \theta} \ d \theta \\&=\frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+1)}{(x+\beta)^{\alpha+1}} \\&= \frac{\alpha \ \beta^{\alpha}}{(x+\beta)^{\alpha+1}} \end{aligned}

The above is the density of the Pareto Type I Lomax distribution. Pareto distribution is discussed here. The example of exponential-gamma mixture is discussed here.

Example 3 (Normal-Normal Mixture)
Conditional on $\Theta=\theta$, consider a normal random variable $X$ with mean $\theta$ and variance $v$ where $v$ is known. The following is the conditional density function of $X$.

$\displaystyle f_{X \lvert \Theta}(x \lvert \theta)=\frac{1}{\sqrt{2 \pi v}} \ \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 \biggr] \ \ \ -\infty

Suppose that the parameter $\Theta$ is normally distributed with mean $\mu$ and variance $a$ (both known parameters). The following is the density function of $\Theta$.

$\displaystyle f_{\Theta}(\theta)=\frac{1}{\sqrt{2 \pi a}} \ \text{exp}\biggl[-\frac{1}{2a}(\theta-\mu)^2 \biggr] \ \ \ -\infty

Determine the unconditional pdf of $X$.

\displaystyle \begin{aligned} f_X(x)&=\int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi v}} \ \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 \biggr] \ \frac{1}{\sqrt{2 \pi a}} \ \text{exp}\biggl[-\frac{1}{2a}(\theta-\mu)^2 \biggr] \ d \theta \\&=\frac{1}{2 \pi \sqrt{va}} \int_{-\infty}^\infty \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 -\frac{1}{2a}(\theta-\mu)^2\biggr] \ d \theta \end{aligned}

The expression in the exponent has the following equivalent expression.

$\displaystyle \frac{(x-\theta)^2}{v}+\frac{(\theta-\mu)^2}{a}=\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 +\frac{(x-\mu)^2}{a+v}$

Continuing the derivation:

\displaystyle \begin{aligned} f_X(x)&=\frac{1}{2 \pi \sqrt{va}} \int_{-\infty}^\infty \text{exp}\biggl[-\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 +\frac{(x-\mu)^2}{a+v} \biggr) \biggr] \ d \theta \\&\displaystyle =\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{2 \pi \sqrt{va}} \int_{-\infty}^\infty \text{exp}\biggl[\displaystyle -\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 \biggr) \biggr] \ d \theta \\&=\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{\sqrt{2 \pi (a+v)} } \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi}} \sqrt{\frac{a+v}{va}} \ \text{exp}\biggl[-\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 \biggr) \biggr] \ d \theta \\&=\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{\sqrt{2 \pi (a+v)} } \end{aligned}

Note that the integrand in the integral in the third line is the density function of a normal distribution with mean $\frac{ax+v \mu}{a+v}$ and variance $\frac{va}{a+v}$. Hence the integral is 1. The last expression is the unconditional pdf of $X$, repeated as follows.

$\displaystyle f_X(x)=\frac{1}{\sqrt{2 \pi (a+v)}} \ \text{exp}\biggl[-\frac{(x-\mu)^2}{2(a+v)} \biggr] \ \ \ \ -\infty

The above is the pdf of a normal distribution with mean $\mu$ and variance $a+v$. Thus the mixing normal distribution with mean $\Theta$ and variance $v$ with the mixing weight $\Theta$ being normally distributed with mean $\mu$ and variance $a$ produces a normal distribution with mean $\mu$ (same mean as the mixing weight) and variance $a+v$ (sum of the conditional variance and the mixing variance).

The mean of the conditional normal distribution is uncertain. When the mean $\Theta$ follows a normal distribution with mean $\mu$, the mixture is a normal distribution that centers around $\mu$, however, with increased variance $a+v$. The increased variance of the unconditional distribution reflects the uncertainty of the parameter $\Theta$.

Remarks

Mixture distributions can be used to model a statistical population with subpopulations, where the conditional density functions are the densities on the subpopulations, and the mixing weights are the proportions of each subpopulation in the overall population. If the population can be divided into finite number of homogeneous subpopulations, then the model would be a finite mixture as in Example 1. In certain situations, continuous mixing weights may be more appropriate (e.g. Poisson-Gamma mixture).

Many other familiar distributions are mixture distributions and are discussed in the next post.

$\text{ }$

$\text{ }$

$\text{ }$

$\copyright$ 2017 – Dan Ma

# Transformed Pareto distribution

One way to generate new probability distributions from old ones is to raise a distribution to a power. Two previous posts are devoted on this topic – raising exponential distribution to a power and raising a gamma distribution to a power. Many familiar and useful models can be generated in this fashion. For example, Weibull distribution is generated by raising an exponential distribution to a positive power. This post discusses the raising of a Pareto distribution to a power, as a result generating Burr distribution and inverse Burr distribution.

Raising to a Power

Let $X$ be a random variable. Let $\tau$ be a positive constant. The random variables $Y=X^{1/\tau}$, $Y=X^{-1}$ and $Y=X^{-1/\tau}$ are called transformed, inverse and inverse transformed, respectively.

Let $f_X(x)$, $F_X(x)$ and $S_X(x)=1-F_X(x)$ be the probability density function (PDF), the cumulative distribution function (CDF) and the survival function of the random variable $X$ (the base distribution). The goal is to express the CDFs of the “transformed” variables in terms of the base CDF $F_X(x)$. The following table shows how.

Name of Distribution Random Variable CDF
Transformed $Y=X^{1 / \tau}, \ \tau >0$ $F_Y(y)=F_X(y^\tau)$
Inverse $Y=X^{-1}$ $F_Y(y)=1-F_X(y^{-1})$
Inverse Transformed $Y=X^{-1 / \tau}, \ \tau >0$ $F_Y(y)=1-F_X(y^{-\tau})$

If the CDF of the base distribution, as represented by the random variable $X$, is known, then the CDF of the “transformed” distribution can be derived using $F_X(x)$ as shown in this table. Thus the CDF, in many cases, is a good entry point of the transformed distribution.

Pareto Information

Before the transformation, we first list out the information on the Pareto distribution. The Pareto distribution of interest here is the Type II Lomax distribution (discussed here). The following table gives several distributional quantities for a Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta$.

Pareto Type II Lomax
Survival Function $S(x)=\displaystyle \biggl( \frac{\theta}{x+\theta} \biggr)^\alpha \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x >0$
Cumulative Distribution Function $F(x)=1-\displaystyle \biggl( \frac{\theta}{x+\theta} \biggr)^\alpha \ \ \ \ \ \ \ \ \ \ \ \ \ x >0$
Probability Density Function $\displaystyle f(x)=\frac{\alpha \ \theta^\alpha}{(x+\theta)^{\alpha+1}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x >0$
Mean $\displaystyle E(X)=\frac{\theta}{\alpha-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \alpha>1$
Median $\displaystyle \theta \ 2^{\frac{\alpha}{2}}-\theta$
Mode 0
Variance $\displaystyle Var(X)=\frac{\theta^2 \ \alpha}{(\alpha-1)^2 \ (\alpha-2)} \ \ \ \ \ \ \ \alpha>2$
Higher Moments $\displaystyle E(X^k)=\frac{k! \ \theta^k}{(\alpha-1) \cdots (\alpha-k)} \ \ \ \ \ \ \alpha>k \ \ \ k$ is integer
Higher Moments $\displaystyle E(X^k)=\frac{\theta^k \ \Gamma(k+1) \Gamma(\alpha-k)}{\Gamma(\alpha)} \ \ \ \ \alpha>k$

The higher moments in the general case use $\Gamma(\cdot)$, which is the gamma function.

The Distributions Derived from Pareto

Let $X$ be a random variable that has a Pareto distribution (as described in the table in the preceding section). Assume that $X$ has a shape parameter $\alpha$ and scale parameter $\theta$. Let $\tau$ be a positive number. When raising $X$ to the power $1/\tau$, the resulting distribution is a transformed Pareto distribution and is also called a Burr distribution, which then is a distribution with three parameters – $\alpha$, $\theta$ and $\tau$.

When raising $X$ to the power $-1/\tau$, the resulting distribution is an inverse transformed Pareto distribution and it is also called an inverse Burr distribution. When raising $X$ to the power -1, the resulting distribution is an inverse Pareto distribution (it does not have a special name other than inverse Pareto).

The paralogistic family of distributions is created from the Burr distribution by collapsing two of the parameters into one. Let $\alpha$, $\theta$ and $\tau$ be the parameters of a Burr distribution. By equating $\tau=\alpha$, the resulting distribution is a paralogistic distribution. By equating $\tau=\alpha$ in the corresponding inverse Burr distribution, the resulting distribution is an inverse paralogistic distribution.

Transformed Pareto = Burr

There are two ways to create the transformed Pareto distribution. One is to start with a base Pareto with shape parameter $\alpha$ and scale parameter 1 and then raise it to $1/\tau$. The scale parameter $\theta$ is added at the end. Another way is to start with a base Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta^\tau$ and then raise it to the power $1/\tau$. Both ways would generate the same CDF. We take the latter approach since it generates both the CDF and moments quite conveniently.

Let $X$ be a Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta^\tau$. The following table gives the distribution information on $Y^{1/\tau}$.

Burr Distribution
CDF $F_Y(y)=\displaystyle 1-\biggl( \frac{1}{(y/\theta )^\tau+1} \biggr)^\alpha$ $y >0$
Survival Function $S_Y(x)=\displaystyle \biggl( \frac{1}{(y/\theta )^\tau+1} \biggr)^\alpha$ $y >0$
Probability Density Function $\displaystyle f_Y(y)=\frac{\alpha \ \tau \ (y/\theta)^\tau}{y \ [(y/\theta)^\tau+1 ]^{\alpha+1}}$ $y >0$
Mean $\displaystyle E(Y)=\frac{\theta \ \Gamma(1/\tau+1) \Gamma(\alpha-1/\tau)}{\Gamma(\alpha)}$ $1 <\alpha \ \tau$
Median $\displaystyle \theta \ (2^{1/\alpha}-1)^{1/\tau}$
Mode $\displaystyle \theta \ \biggl(\frac{\tau-1}{\alpha \tau+1} \biggr)^{1/\tau}$ $\tau >1$, else 0
Higher Moments $\displaystyle E(Y^k)=\frac{\theta^k \ \Gamma(k/\tau+1) \Gamma(\alpha-k/\tau)}{\Gamma(\alpha)}$ $-\tau

The distribution displayed in the above table is a three-parameter distribution. It is called the Burr distribution with parameters $\alpha$ (shape), $\theta$ (scale) and $\tau$ (power).

To obtain the moments, note that $E(Y^k)=E(X^{k/\tau})$, which is derived using the Pareto moments. The Burr CDF has a closed form that is relatively easy to compute. Thus percentiles are very accessible. The moments rely on the gamma function and are usually calculated by software.

Inverse Transformed Pareto = Inverse Burr

One way to generate inverse transformed Pareto distribution is to raise a Pareto distribution with shape parameter $\alpha$ and scale parameter 1 to the power of -1 and then add the scale parameter. Another way is to raise a Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta^{-\tau}$. Both ways derive the same CDF. As in the preceding case, we take the latter approach.

Let $X$ be a Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta^{-\tau}$. The following table gives the distribution information on $Y^{-1/\tau}$.

Inverse Burr Distribution
CDF $F_Y(y)=\displaystyle \biggl( \frac{(y/\theta)^\tau}{(y/\theta )^\tau+1} \biggr)^\alpha$ $y >0$
Survival Function $S_Y(x)=\displaystyle 1-\biggl( \frac{(y/\theta)^\tau}{(y/\theta )^\tau+1} \biggr)^\alpha$ $y >0$
Probability Density Function $\displaystyle f_Y(y)=\frac{\alpha \ \tau \ (y/\theta)^{\tau \alpha}}{y \ [1+(y/\theta)^\tau]^{\alpha+1}}$ $y >0$
Mean $\displaystyle E(Y)=\frac{\theta \ \Gamma(1-1/\tau) \Gamma(\alpha+1/\tau)}{\Gamma(\alpha)}$ $1 <\tau$
Median $\displaystyle \theta \ \biggl[\frac{1}{ 2^{1/\alpha}-1} \biggr]^{1/\tau}$
Mode $\displaystyle \theta \ \biggl(\frac{\alpha \tau-1}{\tau+1} \biggr)^{1/\tau}$ $\alpha \tau >1$, else 0
Higher Moments $\displaystyle E(Y^k)=\frac{\theta^k \ \Gamma(1-k/\tau) \Gamma(\alpha+k/\tau)}{\Gamma(\alpha)}$ $-\alpha \tau

The distribution displayed in the above table is a three-parameter distribution. It is called the Inverse Burr distribution with parameters $\alpha$ (shape), $\theta$ (scale) and $\tau$ (power).

Note that both the moments for Burr and inverse Burr distributions are limited, the Burr limited by the product of the parameters $\alpha$ and $\tau$ and the inverse Burr limited by the parameter $\tau$. This is not surprising since the base Pareto distribution has limited moments. This is one indication that all of these distributions have a heavy right tail.

The Paralogistic Family

With the facts of the Burr distribution and the inverse Burr distribution established, paralogistic and inverse paralogistic distributions can now be obtained. A paralogistic distribution is simply a Burr distribution with $\tau=\alpha$. An inverse paralogistic distribution is simply an inverse Burr distribution with $\tau=\alpha$. In the above tables for Burr and inverse Burr, replacing $\tau$ by $\alpha$ gives the following table.

Paralogistic Distribution
CDF $F_Y(y)=\displaystyle 1-\biggl( \frac{1}{(y/\theta )^\alpha+1} \biggr)^\alpha$ $y >0$
Survival Function $S_Y(x)=\displaystyle \biggl( \frac{1}{(y/\theta )^\alpha+1} \biggr)^\alpha$ $y >0$
Probability Density Function $\displaystyle f_Y(y)=\frac{\alpha^2 \ \ (y/\theta)^\alpha}{y \ [(y/\theta)^\alpha+1 ]^{\alpha+1}}$ $y >0$
Mean $\displaystyle E(Y)=\frac{\theta \ \Gamma(1/\alpha+1) \Gamma(\alpha-1/\alpha)}{\Gamma(\alpha)}$ $1 <\alpha^2$
Median $\displaystyle \theta \ (2^{1/\alpha}-1)^{1/\alpha}$
Mode $\displaystyle \theta \ \biggl(\frac{\alpha-1}{\alpha^2+1} \biggr)^{1/\alpha}$ $\alpha >1$, else 0
Higher Moments $\displaystyle E(Y^k)=\frac{\theta^k \ \Gamma(k/\alpha+1) \Gamma(\alpha-k/\alpha)}{\Gamma(\alpha)}$ $-\alpha
Inverse Paralogistic Distribution
CDF $F_Y(y)=\displaystyle \biggl( \frac{(y/\theta)^\alpha}{(y/\theta )^\alpha+1} \biggr)^\alpha$ $y >0$
Survival Function $S_Y(x)=\displaystyle 1-\biggl( \frac{(y/\theta)^\alpha}{(y/\theta )^\alpha+1} \biggr)^\alpha$ $y >0$
Probability Density Function $\displaystyle f_Y(y)=\frac{\alpha^2 \ (y/\theta)^{\alpha^2}}{y \ [1+(y/\theta)^\alpha]^{\alpha+1}}$ $y >0$
Mean $\displaystyle E(Y)=\frac{\theta \ \Gamma(1-1/\alpha) \Gamma(\alpha+1/\alpha)}{\Gamma(\alpha)}$ $1 <\alpha$
Median $\displaystyle \theta \ \biggl[\frac{1}{ 2^{1/\alpha}-1} \biggr]^{1/\alpha}$
Mode $\displaystyle \theta \ (\alpha-1)^{1/\alpha}$ $\alpha^2 >1$, else 0
Higher Moments $\displaystyle E(Y^k)=\frac{\theta^k \ \Gamma(1-k/\alpha) \Gamma(\alpha+k/\alpha)}{\Gamma(\alpha)}$ $-\alpha^2

Inverse Pareto Distribution

The distribution that has not been discussed is the inverse Pareto. Again, we have the option of deriving it by raising to a base Pareto with just the shape parameter to -1 and then add the scale parameter. We take the approach of raising a base Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta^{-1}$. Both approaches lead to the same CDF.

Inverse Pareto Distribution
CDF $F_Y(y)=\displaystyle \biggl( \frac{y}{\theta+y} \biggr)^\alpha$ $y >0$
Survival Function $S_Y(x)=\displaystyle 1-\biggl( \frac{y}{\theta+y} \biggr)^\alpha$ $y >0$
Probability Density Function $\displaystyle f_Y(y)=\frac{\alpha \ \theta \ y^{\alpha-1}}{[\theta+y ]^{\alpha+1}}$ $y >0$
Median $\displaystyle \frac{\theta}{2^{1/\alpha}-1}$
Mode $\displaystyle \theta \ \frac{\alpha-1}{2}$ $\alpha >1$, else 0
Higher Moments $\displaystyle E(Y^k)=\frac{\theta^k \ \Gamma(1-k) \Gamma(\alpha+k)}{\Gamma(\alpha)}$ $-\alpha

The distribution described in the above table is an inverse Pareto distribution with parameters $\alpha$ (shape) and $\theta$ (scale). Note that the moments are even more limited than the Burr and inverse Burr distributions. For inverse Pareto, even the mean $E(Y)$ is nonexistent.

Remarks

The Burr and paralogistic families of distributions are derived from the Pareto family (Pareto Type II Lomax). The Pareto connection helps put Burr and paralogistic distributions in perspective. The Pareto distribution itself can be generated as a mixture of exponential distributions with gamma mixing weight (see here). Thus from basic building blocks (exponential and gamma), vast families of distributions can be created, thus expanding the toolkit for modeling. The distributions discussed here are found in the appendix that is found in this link.

$\copyright$ 2017 – Dan Ma