Mixing probability distributions

This post discusses another way to generate new distributions from old, that of mixing distributions. The resulting distributions are called mixture distributions.

What is a Mixture?

First, let’s start with continuous mixture. Suppose that X is a continuous random variable with probability density function (pdf) f_{X \lvert \Theta}(x \lvert \theta) where \theta is a parameter in the pdf. There may be other parameters in the distribution but they are not relevant at the moment (e.g. these other parameters may be known constants). Suppose that the parameter \theta is an uncertain quantity and is a random variable with pdf h_\Theta(\theta) (if \Theta is a continuous random variable) or with probability function P(\Theta=\theta) (if \Theta a discrete random variable). Then taking the weighted average of f_{X \lvert \Theta}(x \lvert \theta) with h_\Theta(\theta) or P(\Theta=\theta) as weight produces a mixture distribution. The following would be pdf of the resulting mixture distribution.

    \displaystyle (1a) \ \ \ \ \ f_X(x)=\int_{-\infty}^\infty f_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (1b) \ \ \ \ \ f_X(x)=\sum \limits_{\theta} \biggl(f_{X \lvert \Theta}(x \lvert \theta) \ P(\Theta=\theta) \biggr)

Thus a continuous random variable X is said to be a mixture (or has a mixture distribution) if its probability density function f_X(x) is a weighted average of a family of pdfs f_{X \lvert \Theta}(x \lvert \theta) where the weight is the density function or probability function of the random parameter \Theta. The random variable \Theta is said to be the mixing random variable and its pdf or probability function is called the mixing weight.

Another definition of mixture distribution is that the cumulative distribution function (cdf) of the random variable X is the weighted average of a family of cumulative distribution functions indexed by the mixing random variable \Theta.

    \displaystyle (2a) \ \ \ \ \ F_X(x)=\int_{-\infty}^\infty F_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (2b) \ \ \ \ \ F_X(x)=\sum \limits_{\theta} \biggl(F_{X \lvert \Theta}(x \lvert \theta) \ P(\Theta=\theta) \biggr)

The idea of discrete mixture is similar. A discrete random variable X is said to be a mixture if its probability function P(X=x) or cumulative distribution function P(X \le x) is a weighted average of a family of probability functions or cumulative distributions indexed by the mixing random variable \Theta. The mixing weight can be discrete or continuous. The following shows the probability function and the cdf of a discrete mixture distribution.

    \displaystyle (3a) \ \ \ \ \ P(X=x)=\int_{-\infty}^\infty P(X=x \lvert \Theta=\theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (3b) \ \ \ \ \ P(X \le x)=\int_{-\infty}^\infty P(X \le x \lvert \Theta=\theta) \ h_\Theta(\theta) \ d \theta

    \text{ }

    \displaystyle (4a) \ \ \ \ \ P(X=x)=\sum \limits_{\theta} \biggl(P(X=x \lvert \Theta=\theta) \ P(\Theta=\theta) \biggr)

    \displaystyle (4b) \ \ \ \ \ P(X \le x)=\sum \limits_{\theta} \biggl(P(X \le x \lvert \Theta=\theta) \ P(\Theta=\theta) \biggr)

When the mixture distribution is a weighted average of finitely many distributions, it is called a n-point mixture where n is the number of distributions. Suppose that there are n distributions with pdfs

    f_1(x),f_2(x),\cdots,f_n(x) (continuous case)

or probability functions

    P(X_1=x),P(X_2=x),\cdots,P(X_n=x) (discrete case)

with mixing probabilities p_1,p_2,\cdots,p_n where the sum of the p_i is 1. Then the following gives the pdf or the probability function of the mixture distribution.

    \displaystyle (5a) \ \ \ \ \ f_X(x)=\sum \limits_{j=1}^n p_j \ f_j(x)

    \displaystyle (5b) \ \ \ \ \ P(X=x)=\sum \limits_{j=1}^n p_j \ P(X_j=x)

The cdf for the n-point mixture is similarly obtained by weighting the respective conditional cdfs as in (4b).

Distributional Quantities

Once the pdf (or probability function) or cdf of a mixture is established, the other distributional quantities can be derived from the pdf or cdf. Some of the distributional quantities can be obtained by taking weighted average of the corresponding conditional counterparts. For example, the following gives the survival function and moments of a mixture distribution. We assume that the mixing weight is continuous. For discrete mixing weight, simply replace the integral with summation.

    \displaystyle (6a) \ \ \ \ \ S_X(x)=\int_{-\infty}^\infty S_{X \lvert \Theta}(x \lvert \theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (6b) \ \ \ \ \ E(X)=\int_{-\infty}^\infty E(X \lvert \theta) \ h_\Theta(\theta) \ d \theta

    \displaystyle (6c) \ \ \ \ \ E(X^k)=\int_{-\infty}^\infty E(X^k \lvert \theta) \ h_\Theta(\theta) \ d \theta

Once the moments are obtained, all distributional quantities that are based on moments can be evaluated, calculations such as variance, skewness, and kurtosis. Note that these quantities are not the weighted average of the conditional quantities. For example, variance of a mixture is not the weighted average of the variance of the conditional distributions. In fact, the variance of a mixture has two components.

    \displaystyle (7) \ \ \ \ \ Var(X)=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)]

The relationship in (7) is called the law of total variance, which is the proper way of computing the unconditional variance Var(X). The first component E[Var(X \lvert \Theta)] is called the expected value of conditional variances, which is the weighted average of the conditional variances. The second component Var[E(X \lvert \Theta)] is called the variance of the conditional means, which represents the additional variance as a result of the uncertainty in the parameter \Theta. If there is a great deal of variation among the conditional mean E(X \lvert \Theta), the variation will be reflected in Var(X) through the second component Var[E(X \lvert \Theta)]. This will be further illustrated in the examples below.


Some of the examples discussed below have gamma distribution as mixing weights. See here for basic information on gamma distribution.

A natural interpretation of mixture is that of the uncertain parameter \Theta in the conditional random variable X \lvert \Theta describes an individual in a large population. For example, the parameter \Theta describes a certain characteristics across the units in a population. In this section, we describe the idea of mixture in an insurance setting. The example is to mix Poisson distributions with a gamma distribution as mixing weight. We will see that the resulting mixture is a negative binomial distribution, which is more dispersed than the conditional Poisson distributions.

Consider a large group of insured drivers for auto collision coverage. Suppose that the claim frequency in a year for an insured driver has a Poisson distribution with mean \theta. The conditional probability function for the number of claims in a year for an insured driver is:

    \displaystyle P(X=x \lvert \Theta=\theta)=\frac{e^{-\theta} \ \theta^x}{x!}  \ \ \ \ \ \ x=0,1,2,3,\cdots where \theta>0

The mean number of claims in a year for an insured driver is \theta. The parameter \theta reflects the risk characteristics of an insured driver. Since the population of insured drivers is large, there is uncertainty in the parameter \theta. Thus it is more appropriate to regard \theta as a random variable in order to capture the wide range of risk characteristics across the individuals in the population. As a result, the above probability function is not unconditional, but, rather, a conditional probability function of X.

What about the marginal (unconditional) probability function of X? Suppose that the pdf of \Theta has a gamma distribution with the following pdf:

    \displaystyle h_{\Theta}(\theta)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta}

where \alpha>0 and \beta>0 are known parameters of the gamma distribution. Then the unconditional pdf of X is the weighted average of the conditional Poisson distribution.

    \displaystyle \begin{aligned} P(X=x)&=\int_0^\infty P(X=x \lvert \Theta=\theta) \ h_{\Theta}(\theta) \ d \theta \\&=\int_0^\infty \frac{e^{-\theta} \ \theta^x}{x!} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta}  \\&= \frac{\beta^\alpha}{x! \Gamma(\alpha)} \int_0^\infty \theta^{x+\alpha-1} \ e^{(\beta+1) \theta} \ d \theta  \\&=\frac{\beta^\alpha}{x! \Gamma(\alpha)} \ \frac{\Gamma(x+\alpha)}{(\beta+1)^{x+\alpha}} \int_0^\infty \frac{1}{\Gamma(x+\alpha)} \ (\beta+1)^{x+\alpha} \ \theta^{x+\alpha-1} \ e^{(\beta+1) \theta} \ d \theta \\&=\frac{\beta^\alpha}{x! \Gamma(\alpha)} \ \frac{\Gamma(x+\alpha)}{(\beta+1)^{x+\alpha}} \\&=\frac{\Gamma(x+\alpha)}{x! \ \Gamma(\alpha)} \ \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x \ \ x=0,1,2,\cdots \end{aligned}

Note that the integral in the 4th step is 1 since the integrand is a gamma density function. The probability function at the last step is that of a negative binomial distribution. If the parameter \alpha is a positive integer, then the following gives the probability function of X after simplifying the expression with gamma function.

    \displaystyle  P(X=x)=\left\{ \begin{array}{ll}                     \displaystyle  \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha &\ x=0 \\           \text{ } & \text{ } \\           \displaystyle  \frac{(x-1+\alpha) \cdots (1+\alpha) \alpha}{x!} \ \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x &\ x=1,2,\cdots           \end{array} \right.

This probability function can be further simplified as the following:

    \displaystyle P(X=x)=\binom{x+\alpha-1}{x} \biggl(\frac{\beta}{\beta+1} \biggr)^\alpha \biggl(\frac{1}{\beta+1} \biggr)^x

where x=0,1,2,\cdots. This is one form of a negative binomial distribution. The mean is E(X)=\frac{\alpha}{\beta} and the variance is Var(X)=\frac{\alpha}{\beta} (1+\frac{1}{\beta}). The variance of the negative binomial distribution is greater than the mean. In a Poisson distribution, the mean equals the variance. Thus the unconditional claim frequency X is more dispersed than its conditional distributions. This is a characteristic of mixture distributions. The uncertainty in the parameter variable \Theta has the effect of increasing the unconditional variance of the mixture distribution of X. Recall that the variance of a mixture distribution has two components, the weighted average of the conditional variances and the variance of the conditional means. The second component represents the additional variance introduced by the uncertainty in the parameter \Theta.

More Examples

We now further illustrate the notion of mixture with a few more examples. Many familiar distributions are mixture distribution. The negative binomial distribution is a mixture of Poisson distributions with gamma mixing weight as discussed above. The Pareto distribution, more specifically Pareto Type I Lomax, is a mixture of exponential distributions with gamma mixing weight (see Example 2 below). Example 3 discusses the normal-normal mixture. Example 1 demonstrates numerical calculation involving a finite mixture.

Example 1
Suppose that the size of an auto collision claim from a large group of insured drivers is a mixture of three exponential distributions with means 5, 8 and 10 (with respective weights 0.75, 0.15 and 0.10, respectively). Discuss the mixture distribution.

The pdf and cdf are the weighted averages of the respective exponential quantities.

    \displaystyle \begin{aligned} f_X(x)&=0.75 \ (0.2 e^{-0.2x} )+0.15 \ (0.125 e^{-0.125x} )+0.10 (0.10 e^{-0.10x}) \\&\text{ } \\&=0.15 \ e^{-0.2x} +0.01875 \ e^{-0.125x}+0.01 \ e^{-0.10x} \end{aligned}

    \displaystyle \begin{aligned} F_X(x)&=0.75 \ (1- e^{-0.2x} )+0.15 \ (1- e^{-0.125x} )+0.10 (1- e^{-0.10x}) \\&\text{ } \\&=1-0.75 \ e^{-0.2x} -0.15 \ e^{-0.125x}-0.10 \ e^{-0.10x} \end{aligned}

    \displaystyle S_X(x)=0.75 \ e^{-0.2x} +0.15 \ e^{-0.125x}+0.10 \ e^{-0.10x}

For a randomly selected claim from this population of insured drivers, what is the probability that it exceeds 10? The answer is S_X(10)=0.1813. The pdf and cdf of the mixture will allow us to derive other distributional quantities such as moments and then using the moments to derive skewness and kurtosis. The moments for exponential distribution has a closed form. Then the moments of the mixture distribution is simply the weighted average of the exponential moments.

    \displaystyle E(X^k)=0.75 \ [5^k \ k!]+0.15 \ [8^k \ k!]+0.10 \ [10^k \ k!]

where k is a positive integer. The following evaluate the first four moments.

    \displaystyle E(X)=0.75 \ 5+0.15 \ 8+0.10 \ 10=5.95

    \displaystyle E(X^2)=0.75 \ (5^2 \ 2!)+0.15 \ (8^2 \ 2!)+0.10 \ (10^2 \ 2!)=76.7

    \displaystyle E(X^3)=0.75 \ (5^3 \ 3!)+0.15 \ (8^3 \ 3!)+0.10 \ (10^3 \ 3!)=1623.3

    \displaystyle E(X^4)=0.75 \ (5^4 \ 4!)+0.15 \ (8^4 \ 4!)+0.10 \ (10^4 \ 4!)=49995.6

The variance of X is Var(X)=76.7-5.95^2=41.2975. The three conditional exponential variances are 25, 64 and 100. The weighted average of these would be 38.35. Because of the uncertainty resulting from not knowing which exponential distribution the claim is from, the unconditional variance is larger than 38.35.

The skewness of a distribution is the third central moments and the kurtosis is defined as the fourth central moment. Each of them can be expressed in terms of the raw moments up to the third or fourth raw moment.

    \displaystyle \gamma=E\biggl[\biggl( \frac{X-\mu}{\sigma} \biggr)^3\biggr]=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{(\sigma^2)^{1.5}}

    \displaystyle \text{Kurt}[X]=E\biggl[\biggl( \frac{X-\mu}{\sigma} \biggr)^4\biggr]=\frac{E(X^4)-4 \mu E(X^3)+6 \mu^2 E(X^2)-3 \mu^4}{\sigma^4}

Note that \mu=E(X) and \sigma^2=Var(X). The expressions on the right hand side are in terms of the raw moments E(X^k) up to k=4. Plugging in the raw moments produces the skewness \gamma=2.5453 and kurtosis \text{Kurt}[X]=14.0097. The excess kurtosis is then 11.0097 (subtracting 3 from the kurtosis).

The skewness and excess kurtosis of an exponential distribution are always 2 and 6, respectively. One take way is that skewness and kurtosis of a mixture is not the weighted average of the conditional counterparts. In this particular case, the mixture is more skewed than the individual exponential distributions. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution (the kurtosis of a normal distribution is 3). Since the excess kurtosis for exponential distributions is 6, this mixture distribution is considered to be heavy tailed and to have higher likelihood of outliers.

Example 2 (Exponential-Gamma Mixture)
The Pareto distribution (Type I Lomax) is a mixture of exponential distributions with gamma mixing weight. Suppose X has the exponential pdf f_{X \lvert \Theta}(x \lvert \theta)=\theta \ e^{-\theta x}, where x>0, conditional on the parameter \Theta. Suppose that the pdf of \Theta has a gamma distribution with the following pdf:

    \displaystyle h_{\Theta}(\theta)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta}

Then the following gives the unconditional pdf of the random variable X.

    \displaystyle \begin{aligned} f_X(x)&=\int_0^\infty f_{X \lvert \Theta}(x \lvert \theta) \  h_{\Theta}(\theta) \ d \theta \\&=\int_0^\infty \theta \ e^{-\theta x} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ d \theta \\&= \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^\infty \theta^{\alpha+1-1} \ e^{-(x+\beta) \theta} \ d \theta \\&= \frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+1)}{(x+\beta)^{\alpha+1}} \int_0^\infty \frac{1}{\Gamma(\alpha+1)} \ (x+\beta)^{\alpha+1} \  \theta^{\alpha+1-1} \ e^{-(x+\beta) \theta} \ d \theta \\&=\frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+1)}{(x+\beta)^{\alpha+1}} \\&= \frac{\alpha \ \beta^{\alpha}}{(x+\beta)^{\alpha+1}} \end{aligned}

The above is the density of the Pareto Type I Lomax distribution. Pareto distribution is discussed here. The example of exponential-gamma mixture is discussed here.

Example 3 (Normal-Normal Mixture)
Conditional on \Theta=\theta, consider a normal random variable X with mean \theta and variance v where v is known. The following is the conditional density function of X.

    \displaystyle f_{X \lvert \Theta}(x \lvert \theta)=\frac{1}{\sqrt{2 \pi v}} \ \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 \biggr] \ \ \ -\infty<x<\infty

Suppose that the parameter \Theta is normally distributed with mean \mu and variance a (both known parameters). The following is the density function of \Theta.

    \displaystyle f_{\Theta}(\theta)=\frac{1}{\sqrt{2 \pi a}} \ \text{exp}\biggl[-\frac{1}{2a}(\theta-\mu)^2 \biggr] \ \ \ -\infty<x<\infty

Determine the unconditional pdf of X.

    \displaystyle \begin{aligned} f_X(x)&=\int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi v}} \ \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 \biggr] \ \frac{1}{\sqrt{2 \pi a}} \ \text{exp}\biggl[-\frac{1}{2a}(\theta-\mu)^2 \biggr] \ d \theta \\&=\frac{1}{2 \pi \sqrt{va}} \int_{-\infty}^\infty \text{exp}\biggl[-\frac{1}{2v}(x-\theta)^2 -\frac{1}{2a}(\theta-\mu)^2\biggr] \ d \theta  \end{aligned}

The expression in the exponent has the following equivalent expression.

    \displaystyle \frac{(x-\theta)^2}{v}+\frac{(\theta-\mu)^2}{a}=\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 +\frac{(x-\mu)^2}{a+v}

Continuing the derivation:

    \displaystyle \begin{aligned} f_X(x)&=\frac{1}{2 \pi \sqrt{va}} \int_{-\infty}^\infty \text{exp}\biggl[-\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 +\frac{(x-\mu)^2}{a+v}  \biggr) \biggr] \ d \theta \\&\displaystyle =\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{2 \pi \sqrt{va}}  \int_{-\infty}^\infty  \text{exp}\biggl[\displaystyle -\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 \biggr) \biggr] \ d \theta \\&=\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{\sqrt{2 \pi (a+v)} }  \int_{-\infty}^\infty \frac{1}{\sqrt{2 \pi}} \sqrt{\frac{a+v}{va}} \ \text{exp}\biggl[-\frac{1}{2} \biggl(\frac{a+v}{va} \biggl[\theta-\frac{ax+v \mu}{a+v}\biggr]^2 \biggr) \biggr] \ d \theta \\&=\frac{\text{exp}\biggl[\displaystyle -\frac{(x-\mu)^2}{2(a+v)} \biggr]}{\sqrt{2 \pi (a+v)} }  \end{aligned}

Note that the integrand in the integral in the third line is the density function of a normal distribution with mean \frac{ax+v \mu}{a+v} and variance \frac{va}{a+v}. Hence the integral is 1. The last expression is the unconditional pdf of X, repeated as follows.

    \displaystyle f_X(x)=\frac{1}{\sqrt{2 \pi (a+v)}} \ \text{exp}\biggl[-\frac{(x-\mu)^2}{2(a+v)} \biggr] \ \ \ \ -\infty<x<\infty

The above is the pdf of a normal distribution with mean \mu and variance a+v. Thus the mixing normal distribution with mean \Theta and variance v with the mixing weight \Theta being normally distributed with mean \mu and variance a produces a normal distribution with mean \mu (same mean as the mixing weight) and variance a+v (sum of the conditional variance and the mixing variance).

The mean of the conditional normal distribution is uncertain. When the mean \Theta follows a normal distribution with mean \mu, the mixture is a normal distribution that centers around \mu, however, with increased variance a+v. The increased variance of the unconditional distribution reflects the uncertainty of the parameter \Theta.


Mixture distributions can be used to model a statistical population with subpopulations, where the conditional density functions are the densities on the subpopulations, and the mixing weights are the proportions of each subpopulation in the overall population. If the population can be divided into finite number of homogeneous subpopulations, then the model would be a finite mixture as in Example 1. In certain situations, continuous mixing weights may be more appropriate (e.g. Poisson-Gamma mixture).

Many other familiar distributions are mixture distributions and are discussed in the next post.

\text{ }

\text{ }

\text{ }

\copyright 2017 – Dan Ma