A catalog of parametric severity models

Various parametric continuous probability models have been presented and discussed in this blog. The number of parameters in these models ranges from one to two, and in a small number of cases three. They are all potential candidates for models of severity in insurance applications and in other actuarial applications. This post highlights these models. The list presented here is not exhaustive; it is only a brief catalog. There are other models that are also suitable for actuarial applications but not accounted for here. However, the list is a good place to begin. This post also serves a navigation device (the table shown below contains links to the blog posts).

A Catalog

Many of the models highlighted here are related to gamma distribution either directly or indirectly. So the catalog starts with the gamma distribution at the top and then branches out to the other related models. Mathematically, the gamma distribution is a two-parameter continuous distribution defined using the gamma function. The gamma sub family includes the exponential distribution, Erlang distribution and chi-squared distribution. These are distributions that are gamma distributions with certain restrictions on the one or both of the gamma parameters. Other distributions are obtained by raising a distribution to a power. Others are obtained by mixing distributions.

Here’s a listing of the models. Click on the links to find out more about the distributions.

……Derived From ………………….Model
Gamma function
Gamma sub families
Independent sum of gamma
Raising to a power Raising exponential to a positive power

Raising exponential to a power

Raising gamma to a power

Raising Pareto to a power

Burr sub families

The above table categorizes the distributions according to how they are mathematically derived. For example, the gamma distribution is derived from the gamma function. The Pareto distribution is mathematically an exponential-gamma mixture. The Burr distribution is a transformed Pareto distribution, i.e. obtained by raising a Pareto distribution to a positive power. Even though these distributions can be defined simply by giving the PDF and CDF, knowing how their mathematical origins informs us of the specific mathematical properties of the distributions. Organizing according to the mathematical origin gives us a concise summary of the models.

\text{ }

\text{ }

Further Comments on the Table

From a mathematical standpoint, the gamma distribution is defined using the gamma function.

    \displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt

In this above integral, the argument \alpha is a positive number. The expression t^{\alpha-1} \ e^{-t} in the integrand is always positive. The area in between the curve t^{\alpha-1} \ e^{-t} and the x-axis is \Gamma(\alpha). When this expression is normalized, i.e. divided by \Gamma(\alpha), it becomes a density function.

    \displaystyle f(t)=\frac{1}{\Gamma(\alpha)} \ t^{\alpha-1} \ e^{-t}

The above function f(t) is defined over all positive t. The integral of f(t) over all positive t is 1. Thus f(t) is a density function. It only has one parameter, the \alpha, which is the shape parameter. Adding the scale parameter \theta making it a two-parameter distribution. The result is called the gamma distribution. The following is the density function.

    \displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ x>0

Both parameters \alpha and \theta are positive real numbers. The first parameter \alpha is the shape parameter and \theta is the scale parameter.

As mentioned above, many of the distributions listed in the above table is related to the gamma distribution. Some of the distributions are sub families of gamma. For example, when \alpha are positive integers, the resulting distributions are called Erlang distribution (important in queuing theory). When \alpha=1, the results are the exponential distributions. When \alpha=\frac{k}{2} and \theta=2 where k is a positive integer, the results are the chi-squared distributions (the parameter k is referred to the degrees of freedom). The chi-squared distribution plays an important role in statistics.

Taking independent sum of n independent and identically distributed exponential random variables produces the Erlang distribution, a sub gamma family of distribution. Taking independent sum of n exponential random variables, with pairwise distinct means, produces the hypoexponential distributions. On the other hand, the mixture of n independent exponential random variables produces the hyperexponential distribution.

The Pareto distribution (Pareto Type II Lomax) is the mixture of exponential distributions with gamma mixing weights. Despite the connection with the gamma distribution, the Pareto distribution is a heavy tailed distribution. Thus the Pareto distribution is suitable for modeling extreme losses, e.g. in modeling rare but potentially catastrophic losses.

As mentioned earlier, raising a Pareto distribution to a positive power generates the Burr distribution. Restricting the parameters in a Burr distribution in a certain way will produces the paralogistic distribution. The table indicates the relationships in a concise way. For details, go into the blog posts to get more information.

Tail Weight

Another informative way to categorize the distributions listed in the table is through looking at the tail weight. At first glance, all the distributions may look similar. For example, the distributions in the table are right skewed distributions. Upon closer look, some of the distributions put more weights (probabilities) on the larger values. Hence some of the models are more suitable for models of phenomena with significantly higher probabilities of large or extreme values.

When a distribution significantly puts more probabilities on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). In general tail weight is a relative concept. For example, we say model A has a larger tail weight than model B (or model A has a heavier tail than model B). However, there are several ways to check for tail weight of a given distribution. Here are the four criteria.

Tail Weight Measure What to Look for
1 Existence of moments The existence of more positive moments indicates a lighter tailed distribution.
2 Hazard rate function An increasing hazard rate function indicates a lighter tailed distribution.
3 Mean excess loss function An increasing mean excess loss function indicates a heavier tailed distribution.
4 Speed of decay of survival function A survival function that decays rapidly to zero (as compared to another distribution) indicates a lighter tailed distribution.

Existence of moments
For a positive real number k, the moment E(X^k) is defined by the integral \int_0^\infty x^k \ f(x) \ dx where f(x) is the density function of the distribution in question. If the distribution puts significantly more probabilities in the larger values in the right tail, this integral may not exist (may not converge) for some k. Thus the existence of moments E(X^k) for all positive k is an indication that the distribution is a light tailed distribution.

In the above table, the only distributions for which all positive moments exist are gamma (including all gamma sub families such as exponential), Weibull, lognormal, hyperexponential, hypoexponential and beta. Such distributions are considered light tailed distributions.

The existence of positive moments exists only up to a certain value of a positive integer k is an indication that the distribution has a heavy right tail. All the other distributions in the table are considered heavy tailed distribution as compared to gamma, Weibull and lognormal. Consider a Pareto distribution with shape parameter \alpha and scale parameter \theta. Note that the existence of the Pareto higher moments E(X^k) is capped by the shape parameter \alpha. If the Pareto distribution is to model a random loss, and if the mean is infinite (when \alpha=1), the risk is uninsurable! On the other hand, when \alpha \le 2, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

Hazard rate function
The hazard rate function h(x) of a random variable X is defined as the ratio of the density function and the survival function.

    \displaystyle h(x)=\frac{f(x)}{S(x)}

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged x will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for x units of time.

Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. On the other hand, a distribution with an increasing hazard rate function has a light tailed distribution. If the hazard rate function is decreasing (over time if the random variable is a time variable), then the population die off at a decreasing rate, hence a heavier tail for the distribution in question.

The Pareto distribution is a heavy tailed distribution since the hazard rate is h(x)=\alpha/x (Pareto Type I) and h(x)=\alpha/(x+\theta) (Pareto Type II Lomax). Both hazard rates are decreasing function.

The Weibull distribution is a flexible model in that when its shape parameter is 0<\tau<1, the Weibull hazard rate is decreasing and when \tau>1, the hazard rate is increasing. When \tau=1, Weibull is the exponential distribution, which has a constant hazard rate.

The point about decreasing hazard rate as an indication of a heavy tailed distribution has a connection with the fourth criterion. The idea is that a decreasing hazard rate means that the survival function decays to zero slowly. This point is due to the fact that the hazard rate function generates the survival function through the following.

    \displaystyle S(x)=e^{\displaystyle -\int_0^x h(t) \ dt}

Thus if the hazard rate function is decreasing in x, then the survival function will decay more slowly to zero. To see this, let H(x)=\int_0^x h(t) \ dt, which is called the cumulative hazard rate function. As indicated above, S(x)=e^{-H(x)}. If h(x) is decreasing in x, H(x) has a lower rate of increase and consequently S(x)=e^{-H(x)} has a slower rate of decrease to zero.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The mean excess loss function
The mean excess loss is the conditional expectation e_X(d)=E(X-d \lvert X>d). If the random variable X represents insurance losses, mean excess loss is the expected loss in excess of a threshold conditional on the event that the threshold has been exceeded. Suppose that the threshold d is an ordinary deductible that is part of an insurance coverage. Then e_X(d) is the expected payment made by the insurer in the event that the loss exceeds the deductible.

Whenever e_X(d) is an increasing function of the deductible d, the loss X is a heavy tailed distribution. If the mean excess loss function is a decreasing function of d, then the loss X is a lighter tailed distribution.

The Pareto distribution can also be classified as a heavy tailed distribution based on an increasing mean excess loss function. For a Pareto distribution (Type I) with shape parameter \alpha and scale parameter \theta, the mean excess loss is e(X)=d/(\alpha-1), which is increasing. The mean excess loss for Pareto Type II Lomax is e(X)=(d+\theta)/(\alpha-1), which is also decreasing. They are both increasing functions of the deductible d! This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If the underlying distribution for a random loss is Pareto, it is a catastrophic risk situation.

In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Speed of decay of the survival function to zero
The survival function S(x)=P(X>x) captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed. This point is touched on when discussing hazard rate function.

The following is a comparison of a Pareto Type II survival function and an exponential survival function. The Pareto survival function has parameters (\alpha=2 and \theta=2). The two survival functions are set to have the same 75th percentile, which is x=2. The following table is a comparison of the two survival functions.

    \displaystyle \begin{array}{llllllll} \text{ } &x &\text{ } & \text{Pareto } S_X(x) & \text{ } & \text{Exponential } S_Y(x) & \text{ } & \displaystyle \frac{S_X(x)}{S_Y(x)} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{ } &2 &\text{ } & 0.25 & \text{ } & 0.25 & \text{ } & 1  \\    \text{ } &10 &\text{ } & 0.027777778 & \text{ } & 0.000976563 & \text{ } & 28  \\  \text{ } &20 &\text{ } & 0.008264463 & \text{ } & 9.54 \times 10^{-7} & \text{ } & 8666  \\   \text{ } &30 &\text{ } & 0.00390625 & \text{ } & 9.31 \times 10^{-10} & \text{ } & 4194304  \\  \text{ } &40 &\text{ } & 0.002267574 & \text{ } & 9.09 \times 10^{-13} & \text{ } & 2.49 \times 10^{9}  \\  \text{ } &60 &\text{ } & 0.001040583 & \text{ } & 8.67 \times 10^{-19} & \text{ } & 1.20 \times 10^{15}  \\  \text{ } &80 &\text{ } & 0.000594884 & \text{ } & 8.27 \times 10^{-25} & \text{ } & 7.19 \times 10^{20}  \\  \text{ } &100 &\text{ } & 0.000384468 & \text{ } & 7.89 \times 10^{-31} & \text{ } & 4.87 \times 10^{26}  \\  \text{ } &120 &\text{ } & 0.000268745 & \text{ } & 7.52 \times 10^{-37} & \text{ } & 3.57 \times 10^{32}  \\  \text{ } &140 &\text{ } & 0.000198373 & \text{ } & 7.17 \times 10^{-43} & \text{ } & 2.76 \times 10^{38}  \\  \text{ } &160 &\text{ } & 0.000152416 & \text{ } & 6.84 \times 10^{-49} & \text{ } & 2.23 \times 10^{44}  \\  \text{ } &180 &\text{ } & 0.000120758 & \text{ } & 6.53 \times 10^{-55} & \text{ } & 1.85 \times 10^{50}  \\  \text{ } & \text{ } \\    \end{array}

Note that at the large values, the Pareto right tails retain much more probabilities. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. Using an exponential distribution to model a Pareto random phenomenon would be a severe modeling error even though the exponential distribution may be a good model for describing the loss up to the 75th percentile (in the above comparison). It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

    \displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\theta^\alpha}{(x+\theta)^\alpha}}{e^{-\lambda x}}=\frac{\theta^\alpha e^{\lambda x}}{(x+\theta)^\alpha} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in x. This ratio goes to infinity as x \rightarrow \infty.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator.

It is important to examine the tail behavior of a distribution when considering it as a candidate for a model. The four criteria discussed here provide a crucial way to classify parametric models according to the tail weight.

severity models

Daniel Ma

\copyright 2017 – Dan Ma

Mixing probability distributions

This post discusses another way to generate new distributions from old, that of mixing distributions. The resulting distributions are called mixture distributions.

What is a Mixture?

First, let’s start with continuous mixture. Suppose that X is a continuous random variable with probability density function (pdf) f_{X \lvert \Theta}(x \lvert \theta) where \theta is a parameter in the pdf. There may be other parameters in the distribution but they are not relevant at the moment (e.g. these other parameters may be known constants). Suppose that the parameter \theta is an uncertain quantity and is a random variable with pdf h_\Theta(\theta) (if \Theta is a continuous random variable) or with probability function P(\Theta=\theta) (if \Theta a discrete random variable). Then taking the weighted average of f_{X \lvert \Theta}(x \lvert \theta) with h_\Theta(\theta) or P(\Theta=\theta) as weight produces a mixture distribution. The following would be pdf of the resulting mixture distribution.

    \displaystyle (1a) \ \ \ \ \ f_X(x)=\int_{-\infty}^\infty f_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (1b) \ \ \ \ \ f_X(x)=\sum \limits_{\theta} \biggl(f_{X \lvert \Theta}(x \lvert \theta) \ P(\Theta=\theta) \biggr)

Thus a continuous random variable X is said to be a mixture (or has a mixture distribution) if its probability density function f_X(x) is a weighted average of a family of pdfs f_{X \lvert \Theta}(x \lvert \theta) where the weight is the density function or probability function of the random parameter \Theta. The random variable \Theta is said to be the mixing random variable and its pdf or probability function is called the mixing weight.

Another definition of mixture distribution is that the cumulative distribution function (cdf) of the random variable X is the weighted average of a family of cumulative distribution functions indexed by the mixing random variable \Theta.

    \displaystyle (2a) \ \ \ \ \ F_X(x)=\int_{-\infty}^\infty F_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (2b) \ \ \ \ \ F_X(x)=\sum \limits_{\theta} \biggl(F_{X \lvert \Theta}(x \lvert \theta) \ P(\Theta=\theta) \biggr)

The idea of discrete mixture is similar. A discrete random variable X is said to be a mixture if its probability function P(X=x) or cumulative distribution function P(X \le x) is a weighted average of a family of probability functions or cumulative distributions indexed by the mixing random variable \Theta. The mixing weight can be discrete or continuous. The following shows the probability function and the cdf of a discrete mixture distribution.

    \displaystyle (3a) \ \ \ \ \ P(X=x)=\int_{-\infty}^\infty P(X=x \lvert \Theta=\theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (3b) \ \ \ \ \ P(X \le x)=\int_{-\infty}^\infty P(X \le x \lvert \Theta=\theta) \ h_\Theta(\theta) \ d \theta

    \text{ }

    \displaystyle (4a) \ \ \ \ \ P(X=x)=\sum \limits_{\theta} \biggl(P(X=x \lvert \Theta=\theta) \ P(\Theta=\theta) \biggr)

    \displaystyle (4b) \ \ \ \ \ P(X \le x)=\sum \limits_{\theta} \biggl(P(X \le x \lvert \Theta=\theta) \ P(\Theta=\theta) \biggr)

When the mixture distribution is a weighted average of finitely many distributions, it is called a n-point mixture where n is the number of distributions. Suppose that there are n distributions with pdfs

    f_1(x),f_2(x),\cdots,f_n(x) (continuous case)

or probability functions

    P(X_1=x),P(X_2=x),\cdots,P(X_n=x) (discrete case)

with mixing probabilities p_1,p_2,\cdots,p_n where the sum of the p_i is 1. Then the following gives the pdf or the probability function of the mixture distribution.

    \displaystyle (5a) \ \ \ \ \ f_X(x)=\sum \limits_{j=1}^n p_j \ f_j(x)

    \displaystyle (5b) \ \ \ \ \ P(X=x)=\sum \limits_{j=1}^n p_j \ P(X_j=x)

The cdf for the n-point mixture is similarly obtained by weighting the respective conditional cdfs as in (4b).

Distributional Quantities

Once the pdf (or probability function) or cdf of a mixture is established, the other distributional quantities can be derived from the pdf or cdf. Some of the distributional quantities can be obtained by taking weighted average of the corresponding conditional counterparts. For example, the following gives the survival function and moments of a mixture distribution. We assume that the mixing weight is continuous. For discrete mixing weight, simply replace the integral with summation.

    \displaystyle (6a) \ \ \ \ \ S_X(x)=\int_{-\infty}^\infty S_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (6b) \ \ \ \ \ E(X)=\int_{-\infty}^\infty E(X \lvert \theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (6c) \ \ \ \ \ E(X^k)=\int_{-\infty}^\infty E(X^k \lvert \theta) \ h_\Theta(\theta) \ d \theta

Once the moments are obtained, all distributional quantities that are based on moments can be evaluated, calculations such as variance, skewness, and kurtosis. Note that these quantities are not the weighted average of the conditional quantities. For example, variance of a mixture is not the weighted average of the variance of the conditional distributions. In fact, the variance of a mixture has two components.

    \displaystyle (7) \ \ \ \ \ Var(X)=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)]

The relationship in (7) is called the law of total variance, which is the proper way of computing the unconditional variance Var(X). The first component E[Var(X \lvert \Theta)] is called the expected value of conditional variances, which is the weighted average of the conditional variances. The second component Var[E(X \lvert \Theta)] is called the variance of the conditional means, which represents the additional variance as a result of the uncertainty in the parameter \Theta. If there is a great deal of variation among the conditional mean E(X \lvert \Theta), the variation will be reflected in Var(X) through the second component Var[E(X \lvert \Theta)]. This will be further illustrated in the examples below.


Some of the examples discussed below have gamma distribution as mixing weights. See here for basic information on gamma distribution.

A natural interpretation of mixture is that of the uncertain parameter \Theta in the conditional random variable X \lvert \Theta describes an individual in a large population. For example, the parameter \Theta describes a certain characteristics across the units in a population. In this section, we describe the idea of mixture in an insurance setting. The example is to mix Poisson distributions with a gamma distribution as mixing weight. We will see that the resulting mixture is a negative binomial distribution, which is more dispersed than the conditional Poisson distributions.

Consider a large group of insured drivers for auto collision coverage. Suppose that the claim frequency in a year for an insured driver has a Poisson distribution with mean \theta. The conditional probability function for the number of claims in a year for an insured driver is:

    \displaystyle P(X=x \lvert \Theta=\theta)=\frac{e^{-\theta} \ \theta^x}{x!}  \ \ \ \ \ \ x=0,1,2,3,\cdots where \theta>0

The mean number of claims in a year for an insured driver is \theta. The parameter \theta reflects the risk characteristics of an insured driver. Since the population of insured drivers is large, there is uncertainty in the parameter \theta. Thus it is more appropriate to regard \theta as a random variable in order to capture the wide range of risk characteristics across the individuals in the population. As a result, the above probability function is not unconditional, but, rather, a conditional probability function of X.

What about the marginal (unconditional) probability function of X? Suppose that the pdf of \Theta has a gamma distribution with the following pdf:

    \displaystyle h_{\Theta}(\theta)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta}

where \alpha>0 and \beta>0 are known parameters of the gamma distribution. Then the unconditional pdf of X is the weighted average of the conditional Poisson distribution.

    \displaystyle \begin{aligned} P(X=x)&=\int_0^\infty P(X=x \lvert \Theta=\theta) \ h_{\Theta}(\theta) \ d \theta \\&=\int_0^\infty \frac{e^{-\theta} \ \theta^x}{x!} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta}  \\&= \frac{\beta^\alpha}{x! \Gamma(\alpha)} \int_0^\infty \theta^{x+\alpha-1} \ e^{(\beta+1) \theta} \ d \theta  \\&=\frac{\beta^\alpha}{x! \Gamma(\alpha)} \ \frac{\Gamma(x+\alpha)}{(\beta+1)^{x+\alpha}} \int_0^\infty \frac{1}{\Gamma(x+\alpha)} \ (\beta+1)^{x+\alpha} \ \theta^{x+\alpha-1} \ e^{(\beta+1) \theta} \ d \theta \\&=\frac{\beta^\alpha}{x! \Gamma(\alpha)} \ \frac{\Gamma(x+\alpha)}{(\beta+1)^{x+\alpha}} \\&=\frac{\Gamma(x+\alpha)}{x! \ \Gamma(\alpha)} \ \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x \ \ x=0,1,2,\cdots \end{aligned}

Note that the integral in the 4th step is 1 since the integrand is a gamma density function. The probability function at the last step is that of a negative binomial distribution. If the parameter \alpha is a positive integer, then the following gives the probability function of X after simplifying the expression with gamma function.

    \displaystyle  P(X=x)=\left\{ \begin{array}{ll}                     \displaystyle  \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha &\ x=0 \\           \text{ } & \text{ } \\           \displaystyle  \frac{(x-1+\alpha) \cdots (1+\alpha) \alpha}{x!} \ \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x &\ x=1,2,\cdots           \end{array} \right.

This probability function can be further simplified as the following:

    \displaystyle P(X=x)=\binom{x+\alpha-1}{x} \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x

where x=0,1,2,\cdots. This is one form of a negative binomial distribution. The mean is E(X)=\frac{\alpha}{\beta} and the variance is Var(X)=\frac{\alpha}{\beta} (1+\frac{1}{\beta}). The variance of the negative binomial distribution is greater than the mean. In a Poisson distribution, the mean equals the variance. Thus the unconditional claim frequency X is more dispersed than its conditional distributions. This is a characteristic of mixture distributions. The uncertainty in the parameter variable \Theta has the effect of increasing the unconditional variance of the mixture distribution of X. Recall that the variance of a mixture distribution has two components, the weighted average of the conditional variances and the variance of the conditional means. The second component represents the additional variance introduced by the uncertainty in the parameter \Theta.

More Examples

We now further illustrate the notion of mixture with a few more examples. Many familiar distributions are mixture distribution. The negative binomial distribution is a mixture of Poisson distributions with gamma mixing weight as discussed above. The Pareto distribution, more specifically Pareto Type I Lomax, is a mixture of exponential distributions with gamma mixing weight (see Example 2 below). Example 3 discusses the normal-normal mixture. Example 1 demonstrates numerical calculation involving a finite mixture.

Example 1
Suppose that the size of an auto collision claim from a large group of insured drivers is a mixture of three exponential distributions with means 5, 8 and 10 (with respective weights 0.75, 0.15 and 0.10, respectively). Discuss the mixture distribution.

The pdf and cdf are the weighted averages of the respective exponential quantities.

    \displaystyle \begin{aligned} f_X(x)&=0.75 \ (0.2 e^{-0.2x} )+0.15 \ (0.125 e^{-0.125x} )+0.10 (0.10 e^{-0.10x}) \\&\text{ } \\&=0.15 \ e^{-0.2x} +0.01875 \ e^{-0.125x}+0.01 \ e^{-0.10x} \end{aligned}

    \displaystyle \begin{aligned} F_X(x)&=0.75 \ (1- e^{-0.2x} )+0.15 \ (1- e^{-0.125x} )+0.10 (1- e^{-0.10x}) \\&\text{ } \\&=1-0.75 \ e^{-0.2x} -0.15 \ e^{-0.125x}-0.10 \ e^{-0.10x} \end{aligned}

    \displaystyle S_X(x)=0.75 \ e^{-0.2x} +0.15 \ e^{-0.125x}+0.10 \ e^{-0.10x}

For a randomly selected claim from this population of insured drivers, what is the probability that it exceeds 10? The answer is S_X(10)=0.1813. The pdf and cdf of the mixture will allow us to derive other distributional quantities such as moments and then using the moments to derive skewness and kurtosis. The moments for exponential distribution has a closed form. Then the moments of the mixture distribution is simply the weighted average of the exponential moments.

    \displaystyle E(X^k)=0.75 \ [5^k \ k!]+0.15 \ [8^k \ k!]+0.10 \ [10^k \ k!]

where k is a positive integer. The following evaluate the first four moments.

    \displaystyle E(X)=0.75 \ 5+0.15 \ 8+0.10 \ 10=5.95

    \displaystyle E(X^2)=0.75 \ (5^2 \ 2!)+0.15 \ (8^2 \ 2!)+0.10 \ (10^2 \ 2!)=76.7

    \displaystyle E(X^3)=0.75 \ (5^3 \ 3!)+0.15 \ (8^3 \ 3!)+0.10 \ (10^3 \ 3!)=1623.3

    \displaystyle E(X^4)=0.75 \ (5^4 \ 4!)+0.15 \ (8^4 \ 4!)+0.10 \ (10^4 \ 4!)=49995.6

The variance of X is Var(X)=76.7-5.95^2=41.2975. The three conditional exponential variances are 25, 64 and 100. The weighted average of these would be 38.35. Because of the uncertainty resulting from not knowing which exponential distribution the claim is from, the unconditional variance is larger than 38.35.

The skewness of a distribution is the third central moments and the kurtosis is defined as the fourth central moment. Each of them can be expressed in terms of the raw moments up to the third or fourth raw moment.

    \displaystyle \gamma=E\biggl[\biggl( \frac{X-\mu}{\sigma} \biggr)^3\biggr]=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{(\sigma^2)^{1.5}}

    \displaystyle \text{Kurt}[X]=E\biggl[\biggl( \frac{X-\mu}{\sigma} \biggr)^4\biggr]=\frac{E(X^4)-4 \mu E(X^3)+6 \mu^2 E(X^2)-3 \mu^4}{\sigma^4}

Note that \mu=E(X) and \sigma^2=Var(X). The expressions on the right hand side are in terms of the raw moments E(X^k) up to k=4. Plugging in the raw moments produces the skewness \gamma=2.5453 and kurtosis \text{Kurt}[X]=14.0097. The excess kurtosis is then 11.0097 (subtracting 3 from the kurtosis).

The skewness and excess kurtosis of an exponential distribution are always 2 and 6, respectively. One take way is that skewness and kurtosis of a mixture is not the weighted average of the conditional counterparts. In this particular case, the mixture is more skewed than the individual exponential distributions. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution (the kurtosis of a normal distribution is 3). Since the excess kurtosis for exponential distributions is 6, this mixture distribution is considered to be heavy tailed and to have higher likelihood of outliers.

Example 2 (Exponential-Gamma Mixture)
The Pareto distribution (Type I Lomax) is a mixture of exponential distributions with gamma mixing weight. Suppose X has the exponential pdf f_{X \lvert \Theta}(x \lvert \theta)=\theta \ e^{-\theta x}, where x>0, conditional on the parameter \Theta. Suppose that the pdf of \Theta has a gamma distribution with the following pdf:

    \displaystyle h_{\Theta}(\theta)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta}

Then the following gives the unconditional pdf of the random variable X.

    \displaystyle \begin{aligned} f_X(x)&=\int_0^\infty f_{X \lvert \Theta}(x \lvert \theta) \  h_{\Theta}(\theta) \ d \theta \\&=\int_0^\infty \theta \ e^{-\theta x} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ d \theta \\&= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^\infty \theta^{\alpha+1-1} \ e^{-(x+\beta) \theta} \ d \theta \\&= \frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+1)}{(x+\beta)^{\alpha+1}} \int_0^\infty \frac{1}{\Gamma(\alpha+1)} \ (x+\beta)^{\alpha+1} \  \theta^{\alpha+1-1} \ e^{-(x+\beta) \theta} \ d \theta \\&=\frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+1)}{(x+\beta)^{\alpha+1}} \\&= \frac{\alpha \ \beta^{\alpha}}{(x+\beta)^{\alpha+1}} \end{aligned}

The above is the density of the Pareto Type I Lomax distribution. Pareto distribution is discussed here. The example of exponential-gamma mixture is discussed here.

Example 3 (Normal-Normal Mixture)
Conditional on \Theta=\theta, consider a normal random variable X with mean \theta and variance v where v is known. The following is the conditional density function of X.

    \displaystyle f_{X \lvert \Theta}(x \lvert \theta)=\frac{1}{\sqrt{2 \pi v}} \ \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 \biggr] \ \ \ -\infty<x<\infty

Suppose that the parameter \Theta is normally distributed with mean \mu and variance a (both known parameters). The following is the density function of \Theta.

    \displaystyle f_{\Theta}(\theta)=\frac{1}{\sqrt{2 \pi a}} \ \text{exp}\biggl[-\frac{1}{2a}(\theta-\mu)^2 \biggr] \ \ \ -\infty<x<\infty

Determine the unconditional pdf of X.

    \displaystyle \begin{aligned} f_X(x)&=\int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi v}} \ \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 \biggr] \ \frac{1}{\sqrt{2 \pi a}} \ \text{exp}\biggl[-\frac{1}{2a}(\theta-\mu)^2 \biggr] \ d \theta \\&=\frac{1}{2 \pi \sqrt{va}} \int_{-\infty}^\infty \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 -\frac{1}{2a}(\theta-\mu)^2\biggr] \ d \theta  \end{aligned}

The expression in the exponent has the following equivalent expression.

    \displaystyle \frac{(x-\theta)^2}{v}+\frac{(\theta-\mu)^2}{a}=\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 +\frac{(x-\mu)^2}{a+v}

Continuing the derivation:

    \displaystyle \begin{aligned} f_X(x)&=\frac{1}{2 \pi \sqrt{va}} \int_{-\infty}^\infty \text{exp}\biggl[-\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 +\frac{(x-\mu)^2}{a+v}  \biggr) \biggr] \ d \theta \\&\displaystyle =\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{2 \pi \sqrt{va}}  \int_{-\infty}^\infty  \text{exp}\biggl[\displaystyle -\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 \biggr) \biggr] \ d \theta \\&=\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{\sqrt{2 \pi (a+v)} }  \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi}} \sqrt{\frac{a+v}{va}} \ \text{exp}\biggl[-\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 \biggr) \biggr] \ d \theta \\&=\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{\sqrt{2 \pi (a+v)} }  \end{aligned}

Note that the integrand in the integral in the third line is the density function of a normal distribution with mean \frac{ax+v \mu}{a+v} and variance \frac{va}{a+v}. Hence the integral is 1. The last expression is the unconditional pdf of X, repeated as follows.

    \displaystyle f_X(x)=\frac{1}{\sqrt{2 \pi (a+v)}} \ \text{exp}\biggl[-\frac{(x-\mu)^2}{2(a+v)} \biggr] \ \ \ \ -\infty<x<\infty

The above is the pdf of a normal distribution with mean \mu and variance a+v. Thus the mixing normal distribution with mean \Theta and variance v with the mixing weight \Theta being normally distributed with mean \mu and variance a produces a normal distribution with mean \mu (same mean as the mixing weight) and variance a+v (sum of the conditional variance and the mixing variance).

The mean of the conditional normal distribution is uncertain. When the mean \Theta follows a normal distribution with mean \mu, the mixture is a normal distribution that centers around \mu, however, with increased variance a+v. The increased variance of the unconditional distribution reflects the uncertainty of the parameter \Theta.


Mixture distributions can be used to model a statistical population with subpopulations, where the conditional density functions are the densities on the subpopulations, and the mixing weights are the proportions of each subpopulation in the overall population. If the population can be divided into finite number of homogeneous subpopulations, then the model would be a finite mixture as in Example 1. In certain situations, continuous mixing weights may be more appropriate (e.g. Poisson-Gamma mixture).

Many other familiar distributions are mixture distributions and are discussed in the next post.

\text{ }

\text{ }

\text{ }

\copyright 2017 – Dan Ma