A catalog of parametric severity models

Various parametric continuous probability models have been presented and discussed in this blog. The number of parameters in these models ranges from one to two, and in a small number of cases three. They are all potential candidates for models of severity in insurance applications and in other actuarial applications. This post highlights these models. The list presented here is not exhaustive; it is only a brief catalog. There are other models that are also suitable for actuarial applications but not accounted for here. However, the list is a good place to begin. This post also serves a navigation device (the table shown below contains links to the blog posts).

A Catalog

Many of the models highlighted here are related to gamma distribution either directly or indirectly. So the catalog starts with the gamma distribution at the top and then branches out to the other related models. Mathematically, the gamma distribution is a two-parameter continuous distribution defined using the gamma function. The gamma sub family includes the exponential distribution, Erlang distribution and chi-squared distribution. These are distributions that are gamma distributions with certain restrictions on the one or both of the gamma parameters. Other distributions are obtained by raising a distribution to a power. Others are obtained by mixing distributions.

Here’s a listing of the models. Click on the links to find out more about the distributions.

……Derived From ………………….Model
Gamma function
Gamma sub families
Independent sum of gamma
Exponentiation
Raising to a power Raising exponential to a positive power

Raising exponential to a power

Raising gamma to a power

Raising Pareto to a power

Burr sub families
Mixture
Others

The above table categorizes the distributions according to how they are mathematically derived. For example, the gamma distribution is derived from the gamma function. The Pareto distribution is mathematically an exponential-gamma mixture. The Burr distribution is a transformed Pareto distribution, i.e. obtained by raising a Pareto distribution to a positive power. Even though these distributions can be defined simply by giving the PDF and CDF, knowing how their mathematical origins informs us of the specific mathematical properties of the distributions. Organizing according to the mathematical origin gives us a concise summary of the models.

\text{ }

\text{ }

Further Comments on the Table

From a mathematical standpoint, the gamma distribution is defined using the gamma function.

    \displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt

In this above integral, the argument \alpha is a positive number. The expression t^{\alpha-1} \ e^{-t} in the integrand is always positive. The area in between the curve t^{\alpha-1} \ e^{-t} and the x-axis is \Gamma(\alpha). When this expression is normalized, i.e. divided by \Gamma(\alpha), it becomes a density function.

    \displaystyle f(t)=\frac{1}{\Gamma(\alpha)} \ t^{\alpha-1} \ e^{-t}

The above function f(t) is defined over all positive t. The integral of f(t) over all positive t is 1. Thus f(t) is a density function. It only has one parameter, the \alpha, which is the shape parameter. Adding the scale parameter \theta making it a two-parameter distribution. The result is called the gamma distribution. The following is the density function.

    \displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ x>0

Both parameters \alpha and \theta are positive real numbers. The first parameter \alpha is the shape parameter and \theta is the scale parameter.

As mentioned above, many of the distributions listed in the above table is related to the gamma distribution. Some of the distributions are sub families of gamma. For example, when \alpha are positive integers, the resulting distributions are called Erlang distribution (important in queuing theory). When \alpha=1, the results are the exponential distributions. When \alpha=\frac{k}{2} and \theta=2 where k is a positive integer, the results are the chi-squared distributions (the parameter k is referred to the degrees of freedom). The chi-squared distribution plays an important role in statistics.

Taking independent sum of n independent and identically distributed exponential random variables produces the Erlang distribution, a sub gamma family of distribution. Taking independent sum of n exponential random variables, with pairwise distinct means, produces the hypoexponential distributions. On the other hand, the mixture of n independent exponential random variables produces the hyperexponential distribution.

The Pareto distribution (Pareto Type II Lomax) is the mixture of exponential distributions with gamma mixing weights. Despite the connection with the gamma distribution, the Pareto distribution is a heavy tailed distribution. Thus the Pareto distribution is suitable for modeling extreme losses, e.g. in modeling rare but potentially catastrophic losses.

As mentioned earlier, raising a Pareto distribution to a positive power generates the Burr distribution. Restricting the parameters in a Burr distribution in a certain way will produces the paralogistic distribution. The table indicates the relationships in a concise way. For details, go into the blog posts to get more information.

Tail Weight

Another informative way to categorize the distributions listed in the table is through looking at the tail weight. At first glance, all the distributions may look similar. For example, the distributions in the table are right skewed distributions. Upon closer look, some of the distributions put more weights (probabilities) on the larger values. Hence some of the models are more suitable for models of phenomena with significantly higher probabilities of large or extreme values.

When a distribution significantly puts more probabilities on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). In general tail weight is a relative concept. For example, we say model A has a larger tail weight than model B (or model A has a heavier tail than model B). However, there are several ways to check for tail weight of a given distribution. Here are the four criteria.

Tail Weight Measure What to Look for
1 Existence of moments The existence of more positive moments indicates a lighter tailed distribution.
2 Hazard rate function An increasing hazard rate function indicates a lighter tailed distribution.
3 Mean excess loss function An increasing mean excess loss function indicates a heavier tailed distribution.
4 Speed of decay of survival function A survival function that decays rapidly to zero (as compared to another distribution) indicates a lighter tailed distribution.

Existence of moments
For a positive real number k, the moment E(X^k) is defined by the integral \int_0^\infty x^k \ f(x) \ dx where f(x) is the density function of the distribution in question. If the distribution puts significantly more probabilities in the larger values in the right tail, this integral may not exist (may not converge) for some k. Thus the existence of moments E(X^k) for all positive k is an indication that the distribution is a light tailed distribution.

In the above table, the only distributions for which all positive moments exist are gamma (including all gamma sub families such as exponential), Weibull, lognormal, hyperexponential, hypoexponential and beta. Such distributions are considered light tailed distributions.

The existence of positive moments exists only up to a certain value of a positive integer k is an indication that the distribution has a heavy right tail. All the other distributions in the table are considered heavy tailed distribution as compared to gamma, Weibull and lognormal. Consider a Pareto distribution with shape parameter \alpha and scale parameter \theta. Note that the existence of the Pareto higher moments E(X^k) is capped by the shape parameter \alpha. If the Pareto distribution is to model a random loss, and if the mean is infinite (when \alpha=1), the risk is uninsurable! On the other hand, when \alpha \le 2, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

Hazard rate function
The hazard rate function h(x) of a random variable X is defined as the ratio of the density function and the survival function.

    \displaystyle h(x)=\frac{f(x)}{S(x)}

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged x will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for x units of time.

Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. On the other hand, a distribution with an increasing hazard rate function has a light tailed distribution. If the hazard rate function is decreasing (over time if the random variable is a time variable), then the population die off at a decreasing rate, hence a heavier tail for the distribution in question.

The Pareto distribution is a heavy tailed distribution since the hazard rate is h(x)=\alpha/x (Pareto Type I) and h(x)=\alpha/(x+\theta) (Pareto Type II Lomax). Both hazard rates are decreasing function.

The Weibull distribution is a flexible model in that when its shape parameter is 0<\tau<1, the Weibull hazard rate is decreasing and when \tau>1, the hazard rate is increasing. When \tau=1, Weibull is the exponential distribution, which has a constant hazard rate.

The point about decreasing hazard rate as an indication of a heavy tailed distribution has a connection with the fourth criterion. The idea is that a decreasing hazard rate means that the survival function decays to zero slowly. This point is due to the fact that the hazard rate function generates the survival function through the following.

    \displaystyle S(x)=e^{\displaystyle -\int_0^x h(t) \ dt}

Thus if the hazard rate function is decreasing in x, then the survival function will decay more slowly to zero. To see this, let H(x)=\int_0^x h(t) \ dt, which is called the cumulative hazard rate function. As indicated above, S(x)=e^{-H(x)}. If h(x) is decreasing in x, H(x) has a lower rate of increase and consequently S(x)=e^{-H(x)} has a slower rate of decrease to zero.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The mean excess loss function
The mean excess loss is the conditional expectation e_X(d)=E(X-d \lvert X>d). If the random variable X represents insurance losses, mean excess loss is the expected loss in excess of a threshold conditional on the event that the threshold has been exceeded. Suppose that the threshold d is an ordinary deductible that is part of an insurance coverage. Then e_X(d) is the expected payment made by the insurer in the event that the loss exceeds the deductible.

Whenever e_X(d) is an increasing function of the deductible d, the loss X is a heavy tailed distribution. If the mean excess loss function is a decreasing function of d, then the loss X is a lighter tailed distribution.

The Pareto distribution can also be classified as a heavy tailed distribution based on an increasing mean excess loss function. For a Pareto distribution (Type I) with shape parameter \alpha and scale parameter \theta, the mean excess loss is e(X)=d/(\alpha-1), which is increasing. The mean excess loss for Pareto Type II Lomax is e(X)=(d+\theta)/(\alpha-1), which is also decreasing. They are both increasing functions of the deductible d! This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If the underlying distribution for a random loss is Pareto, it is a catastrophic risk situation.

In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Speed of decay of the survival function to zero
The survival function S(x)=P(X>x) captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed. This point is touched on when discussing hazard rate function.

The following is a comparison of a Pareto Type II survival function and an exponential survival function. The Pareto survival function has parameters (\alpha=2 and \theta=2). The two survival functions are set to have the same 75th percentile, which is x=2. The following table is a comparison of the two survival functions.

    \displaystyle \begin{array}{llllllll} \text{ } &x &\text{ } & \text{Pareto } S_X(x) & \text{ } & \text{Exponential } S_Y(x) & \text{ } & \displaystyle \frac{S_X(x)}{S_Y(x)} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{ } &2 &\text{ } & 0.25 & \text{ } & 0.25 & \text{ } & 1  \\    \text{ } &10 &\text{ } & 0.027777778 & \text{ } & 0.000976563 & \text{ } & 28  \\  \text{ } &20 &\text{ } & 0.008264463 & \text{ } & 9.54 \times 10^{-7} & \text{ } & 8666  \\   \text{ } &30 &\text{ } & 0.00390625 & \text{ } & 9.31 \times 10^{-10} & \text{ } & 4194304  \\  \text{ } &40 &\text{ } & 0.002267574 & \text{ } & 9.09 \times 10^{-13} & \text{ } & 2.49 \times 10^{9}  \\  \text{ } &60 &\text{ } & 0.001040583 & \text{ } & 8.67 \times 10^{-19} & \text{ } & 1.20 \times 10^{15}  \\  \text{ } &80 &\text{ } & 0.000594884 & \text{ } & 8.27 \times 10^{-25} & \text{ } & 7.19 \times 10^{20}  \\  \text{ } &100 &\text{ } & 0.000384468 & \text{ } & 7.89 \times 10^{-31} & \text{ } & 4.87 \times 10^{26}  \\  \text{ } &120 &\text{ } & 0.000268745 & \text{ } & 7.52 \times 10^{-37} & \text{ } & 3.57 \times 10^{32}  \\  \text{ } &140 &\text{ } & 0.000198373 & \text{ } & 7.17 \times 10^{-43} & \text{ } & 2.76 \times 10^{38}  \\  \text{ } &160 &\text{ } & 0.000152416 & \text{ } & 6.84 \times 10^{-49} & \text{ } & 2.23 \times 10^{44}  \\  \text{ } &180 &\text{ } & 0.000120758 & \text{ } & 6.53 \times 10^{-55} & \text{ } & 1.85 \times 10^{50}  \\  \text{ } & \text{ } \\    \end{array}

Note that at the large values, the Pareto right tails retain much more probabilities. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. Using an exponential distribution to model a Pareto random phenomenon would be a severe modeling error even though the exponential distribution may be a good model for describing the loss up to the 75th percentile (in the above comparison). It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

    \displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\theta^\alpha}{(x+\theta)^\alpha}}{e^{-\lambda x}}=\frac{\theta^\alpha e^{\lambda x}}{(x+\theta)^\alpha} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in x. This ratio goes to infinity as x \rightarrow \infty.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator.

It is important to examine the tail behavior of a distribution when considering it as a candidate for a model. The four criteria discussed here provide a crucial way to classify parametric models according to the tail weight.

severity models
math

Daniel Ma
mathematics

\copyright 2017 – Dan Ma

Transformed exponential distributions

The processes of creating distributions from existing ones are an important topic in the study of probability models. Such processes expand the tool kit in the modeling process. Two examples: new distributions can be generated by taking independent sum of old ones or by mixing distributions (the result would be called a mixture). Another way to generate distributions is through raising a distribution to a power, which is the subject of this post. Start with a random variable X (the base distribution). Then raising it to a constant generates a new distribution. In this post, the base distribution is the exponential distribution. The next post discusses transforming the gamma distribution.

______________________________________________________________________________

Raising to a Power

Let X be a random variable. Let \tau be a nonzero constant. The new distribution is generated when X is raised to the power of 1 / \tau. Thus the random variable Y=X^{1 / \tau} is the subject of the discussion in this post.

When \tau >0, the distribution for Y=X^{1 / \tau} is called transformed. When \tau=-1, the distribution for Y=X^{1 / \tau} is called inverse. When \tau <0 and \tau \ne -1, the distribution for Y=X^{1 / \tau} is called inverse transformed.

If the base distribution is exponential, then raising it to 1 / \tau would produce a transformed exponential distribution for the case of \tau >0, an inverse exponential distribution for the case of \tau=-1 and an inverse transformed exponential distribution for the case \tau <0 with \tau \ne -1. If the base distribution is a gamma distribution, the three new distributions would be transformed gamma distribution, inverse gamma distribution and inverse transformed gamma distribution.

For the case of inverse transformed, we make the random variable Y=X^{-1 / \tau} by letting \tau >0. The following summarizes the definition.

Name of Distribution Parameter \tau Random Variable
Transformed \tau >0 Y=X^{1 / \tau}
Inverse \tau=-1 Y=X^{1 / \tau}
Inverse Transformed \tau >0 Y=X^{-1 / \tau}

______________________________________________________________________________

Transforming Exponential

The “transformed” distributions discussed here have two parameters, \tau and \theta (\tau=1 for inverse exponential). The parameter \tau is the shape parameter, which comes from the exponent 1 / \tau. The scale parameter \theta is added after raising the base distribution to a power.

Let X be the random variable for the base exponential distribution. The following shows the information on the base exponential distribution.

Base Exponential
Density Function f_X(x)=e^{-x} \ \ \ \ \ \ \ \ \ \ x>0
CDF F_X(x)=1-e^{-x} \ \ \ \ x>0
Survival Function S_X(x)=e^{-x} \ \ \ \ \ \ \ \ \ \ x>0

Note that the above density function and CDF do not have the scale parameter. Once the base distribution is raised to a power, the scale parameter will be added to the newly created distribution.

The following gives the CDF and the density function of the transformed exponential distribution. The density function is obtained by taking the derivative of the CDF.

    \displaystyle \begin{aligned} F_Y(y)&=P(Y \le y) \\&=P(X^{1 / \tau} \le y) \\&=P(X \le y^\tau)\\&=F_X(y^\tau) \\&=1-e^{- y^\tau} \ \ \ \ \ \ \ \ \ \ \ \ \ y>0 \end{aligned}

    \displaystyle f_Y(y)=\tau \ y^{\tau-1} \ e^{- y^\tau} \ \ \ \ \ \ \ \ \ y>0

The following gives the CDF and the density function of the inverse exponential distribution.

    \displaystyle \begin{aligned} F_Y(y)&=P(Y \le y) \\&=P(X^{-1} \le y) \\&=P(X \ge 1/y)\\&=S_X(1/y) \\&=e^{- 1/y} \ \ \ \ \ \ \ \ \ \ \ \ \ y>0 \end{aligned}

    \displaystyle f_Y(y)=\frac{1}{y^2} \ e^{- 1/y} \ \ \ \ \ \ \ \ \ y>0

The following gives the CDF and the density function of the inverse transformed exponential distribution.

    \displaystyle \begin{aligned} F_Y(y)&=P(Y \le y) \\&=P(X^{- 1 / \tau} \le y) \\&=P(X \ge y^{- \tau})\\&=S_X(y^{- \tau}) \\&=e^{- y^{- \tau}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y>0 \end{aligned}

    \displaystyle f_Y(y)= \frac{\tau}{y^{\tau+1}} \ e^{- 1/y^\tau} \ \ \ \ \ \ \ \ \ y>0

The above derivation does not involve the scale parameter. Now it is added to the results.

Transformed Distribution
Transformed Exponential
CDF F_Y(y)=1-e^{- (y/\theta)^\tau} y>0
Survival Function S_Y(y)=e^{- (y/\theta)^\tau} y>0
Density Function f_Y(y)=(\tau / \theta) \ (y/\theta)^{\tau-1} \ e^{- (y/\theta)^\tau} y>0
Inverse Exponential
CDF F_Y(y)=e^{- \theta/y} y>0
Survival Function S_Y(y)=1-e^{- \theta/y} y>0
Density Function f_Y(y)=\frac{\theta}{y^2} \ e^{- \theta/y} y>0
Inverse Transformed Exponential
CDF F_Y(y)=e^{- (\theta/y)^{\tau}} y>0
Survival Function S_Y(y)=1-e^{- (\theta/y)^{\tau}} y>0
Density Function f_Y(y)=\tau ( \theta / y )^\tau \ (1/y) \ e^{- (\theta/y)^{\tau}} y>0

The transformed exponential distribution and the inverse transformed distribution have two parameters \tau and \theta. The inverse exponential distribution has only one parameter \theta. The parameter \theta is the scale parameter. The parameter \tau, when there is one, is the shape parameter and it comes from the exponent when the exponential is raised to a power.

The above transformation starts with the exponential distribution with mean 1 (without the scale parameter) and the scale parameter \theta is added back in at the end. We can also accomplish the same result by starting with an exponential variable X with mean (scale parameter) \theta^\tau. Then raising X to 1/\tau, -1, and -1/\tau would generate the three distributions described in the above table. In this process, the scale parameter \theta is baked into the base distribution. This makes it easier to obtain the moments of the “transformed” exponential distributions since the moments would be derived from exponential moments.
______________________________________________________________________________

Connection with Weibull Distribution

Compare the density function for the transformed exponential distribution with the density of the Weibull distribution discussed here. Note that the two are identical. Thus raising an exponential distribution to 1 / \tau where \tau >0 produces a Weibull distribution.

On the other hand, raising a Weillbull distribution to -1 produces an inverse Weillbull distribution (by definition). Let F_X(x)=1-e^{- x^\tau} be the CDF of the base Weibull distribution where \tau >0. Let’s find the CDF of Y=X^{-1}. Then add the scale parameter.

    \displaystyle F_Y(y)=P(Y \le y)=P(X \ge 1 / y)=e^{- (1/y)^\tau}

    \displaystyle F_Y(y)=e^{- (\theta/y)^\tau} (scale parameter added)

Note the the CDF of inverse Weibull distribution is identical to the one for inverse transformed exponential distribution. Thus transformed exponential distribution is identical to a Weibull distribution and inverse transformed exponential distribution is identical to an inverse Weibull distribution.

Since Weibull distribution is the same as transformed exponential distribution, the previous post on Weibull distribution can inform us on transformed exponential distribution. For example, assuming that the Weibull distribution (or transformed exponential) is a model for the time until death of a life, varying the shape parameter \tau yields different mortality patterns. The following are two graphics from the proevious post.

Figure 1

weibull-pdfs-2

Figure 2

weibull-failure-rates

Figure 1 shows the Weibull density functions for different values of the shape parameter (the scale parameter \theta is fixed at 1). The curve for \tau=1 is the exponential density curve. It is clear that the green density curve (\tau=2) approaches the x-axis at a faster rate then the other two curves and thus has a lighter tail than the other two density curves. In general, the Weibull (transformed exponential) distribution with shape parameter \tau >1 has a lighter tail than the Weibull with shape parameter 0<\tau <1.

Figure 2 shows the failure rates for the Weibull (transformed exponential) distributions with the same three values of \tau. Note that the failure rate for \tau=0.5 (blue) decreases over time and the failure rate for \tau=2 increases over time. The failure rate for \tau=1 is constant since it is the exponential distribution.

What is being displayed in Figure 2 describes a general pattern. When the shape parameter is 0<\tau<0.5, the failure rate decreases as time increases and the Weibull (transformed exponential) distribution is a model for infant mortality, or early-life failures. Hence these Weibull distributions have a thicker tail as shown in Figure 1.

When the shape parameter is \tau >1, the failure rate increases as time increases and the Weibull (transformed exponential) distribution is a model for wear-out failures. As times go by, the lives are fatigued and “die off.” Hence these Weibull distributions have a lighter tail as shown in Figure 1.

When \tau=1, the resulting Weibull (transformed exponential) distribution is exponential. The failure rate is constant and it is a model for random failures (failures that are independent of age).

Thus the transformed exponential family has a great deal of flexibility for modeling the failures of objects (machines, devices).

______________________________________________________________________________

Moments and Other Distributional Quantities

The moments for the three “transformed” exponential distributions are based on the gamma function. The two inverse distributions have limited moments. Since the transformed exponential distribution is identical to Weibull, its moments are identical to that of the Weibull distribution. The moments of the “transformed” exponential distributions are E(Y)=E(X^{1 / \tau}) where X has an exponential distribution with mean (scale parameter) \theta^\tau. See here for the information on exponential moments. The following shows the moments of the “transformed” exponential distributions.

Name of Distribution Moment
Transformed Exponential E(Y^k)=\theta^k \Gamma(1+k/\tau) k >- \tau
Inverse Exponential E(Y^k)=\theta^k \Gamma(1-k) k <1
Inverse Transformed Exponential E(Y^k)=\theta^k \Gamma(1-k/\tau) k <\tau

The function \Gamma(\cdot) is the Gamma function. The transformed exponential moment E(Y^k) exists for all k >- \tau. The moments are limited for the other two distributions. The first moment E(Y) does not exist for the inverse exponential distribution. The inverse transformed exponential moment E(Y^k) exist only for k<\tau. Thus the inverse transformed exponential mean and variance exist only if the shape parameter \tau is larger than 2.

The distributional quantities that are based on moments can be calculated (e.g. variance, skewness and kurtosis) when the moments are available. For all three "transformed" exponential distributions, percentiles are easily computed since the CDFs contain only one instance of the unknown y. The following gives the mode of the three distributions.

Name of Distribution Mode
Transformed Exponential \displaystyle \theta \biggl(\frac{\tau-1}{\tau} \biggr)^{1/\tau} for \tau >1, else 0
Inverse Exponential \theta / 2
Inverse Transformed Exponential \displaystyle \theta \biggl(\frac{\tau}{\tau+1} \biggr)^{1/\tau}

\text{ }

\text{ }

\text{ }

\copyright 2017 – Dan Ma

The Weibull distribution

Mathematically, the Weibull distribution has a simple definition. It is mathematically tractable. It is also a versatile model. The Weibull distribution is widely used in life data analysis, particularly in reliability engineering. In addition to analysis of fatigue data, the Weibull distribution can also be applied to other engineering problems, e.g. for modeling the so called weakest link model.This post gives an introduction to the Weibull distribution.

_______________________________________________________________________________________________

Defining the Weibull Distribution

A random variable Y is said to follow a Weibull distribution if Y has the following density function

    \displaystyle f(y)=\frac{\tau}{\lambda} \ \biggl( \frac{y}{\lambda} \biggr)^{\tau-1} \text{exp} \biggl[- \biggl(\frac{y}{\lambda}\biggr)^\tau \ \biggr] \ \ \ ;y>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

where \tau>0 and \lambda>0 are some fixed constants. The notation \text{exp}[x] refers to the exponential function e^{x}. As defined here, the Weibull distribution is a two-parameter distribution with \tau being the shape parameter and \lambda being the scale parameter. The following is the cumulative distribution function (CDF).

    \displaystyle F(y)=1- \text{exp} \biggl[- \biggl(\frac{y}{\lambda}\biggr)^\tau \ \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

_______________________________________________________________________________________________

Connection with the Exponential Distribution

The Weibull distribution can also arise naturally from the random sampling of an exponential random variable. A better way to view Weibull is through the lens of exponential. Taking an observation from an exponential distribution and raising it to a positive power will result in a Weibull observation. Specifically, the random variable \displaystyle Y=X^{\frac{1}{\tau}} has the same CDF as in (2) if X is an exponential random variable with mean \lambda^\tau. To see this, consider the following:

    \displaystyle \begin{aligned} P[Y \le y]&=P[X^{\frac{1}{\tau}} \le y] \\&=P[X \le y^{\tau}] \\&=1-e^{-\frac{y^{\tau}}{\lambda^\tau}} \\&=1-e^{-(\frac{y}{\lambda})^\tau}  \end{aligned}

_______________________________________________________________________________________________

Basic Properties

The idea of the Weibull distribution as a power of an exponential distribution simplifies certain calculation on the Weibull distribution. For example, a raw moment of the Weibull distribution is simply another raw moment of the exponential distribution. For an exponential random variable X with mean \theta, the raw moments E[X^k] are (details can be found here):

    \displaystyle  E[X^k]=\left\{ \begin{array}{ll}                     \displaystyle  \Gamma(1+k) \ \theta^k &\ k>-1 \\           \text{ } & \text{ } \\           \displaystyle  k! \ \theta^k &\ k \text{ is a positive integer}           \end{array} \right.

where \Gamma(\cdot) is the gamma function. To see how the gamma function can be evaluated in Microsoft Excel, see the last section of this post.

For the Weibull random variable Y with parameters \tau and \lambda, i.e. Y=X^{1 / \tau} where X is the exponential random variable with mean \lambda^\tau, the following shows the mean and higher moments.

    \displaystyle E[Y]=E[X^{\frac{1}{\tau}}]=\Gamma \biggl(1+\frac{1}{\tau} \biggr) \ \lambda \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

    \displaystyle E[Y^k]=E[X^{\frac{k}{\tau}}]=\Gamma \biggl(1+\frac{k}{\tau} \biggr) \ \lambda^k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

With the moments established, several other distributional quantities that are based on moments can also be established. The following shows the variance, skewness and kurtosis.

    \displaystyle Var[Y]=E[Y^2]-E[Y]^2=\biggl[ \Gamma \biggl(1+\frac{2}{\tau} \biggr)-\Gamma \biggl(1+\frac{1}{\tau} \biggr)^2 \biggr] \ \lambda^2 \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

    \displaystyle \begin{aligned} \gamma_1&=\frac{E[(Y-\mu)^3}{\sigma^3} \\&=\frac{E[Y^3]-3 \ \mu \ E[Y^2]+2 \ \mu^3}{\sigma^3} \\&\displaystyle =\frac{\Gamma \biggl(1+\frac{3}{\tau} \biggr) \lambda^3-3 \ \mu \ \Gamma \biggl(1+\frac{2}{\tau} \biggr) \lambda^2+2 \ \mu^3}{(\sigma^2)^{\frac{3}{2}}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6) \end{aligned}

    \displaystyle \begin{aligned} \gamma_2&=\frac{E[(Y-\mu)^4}{\sigma^4} \\&=\frac{E[Y^4]-4 \ \mu \ E[Y^3]+6 \ \mu^2 \ E[Y^2]-3 \ \mu^4}{\sigma^4} \\&\displaystyle =\frac{\Gamma \biggl(1+\frac{4}{\tau} \biggr) \lambda^4-4 \ \mu \ \Gamma \biggl(1+\frac{3}{\tau} \biggr) \lambda^3+6 \ \mu^2 \ \Gamma \biggl(1+\frac{2}{\tau} \biggr) \lambda^2-3 \ \mu^4}{\sigma^4} \ \ \ \ \ (7) \end{aligned}

The notation \gamma_1 here is the skewness. The notation \gamma_2 is kurtosis. The excess kurtosis is \gamma_2-3. In some sources, the notation \gamma_2 is to denote excess kurtosis. Of course \mu and \sigma^2 are the mean and variance, respectively.

Another calculation that is easily accessible for the Weibull distribution is that of the percentiles. It is easy to solve for y in the CDF in (2). For example, to find the median, set the CDF equals to 0.5 and solves for y, producing the following.

    \displaystyle \text{median}=\lambda \ \biggl(\text{ln}(2) \biggr)^{\frac{1}{\tau}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)

Another basic and important property to examine is the failure rate. The failure rate of a distribution is the ratio of the density function to its survival function. The following is the failure of the Weibull distribution.

    \displaystyle \mu(t)=\frac{f(t)}{1-F(t)}=\frac{\tau}{\lambda} \ \biggl( \frac{t}{\lambda} \biggr)^{\tau-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)

See here for a discussion of the failure rate in conjunction with the exponential distribution. Suppose that the distribution in question is a lifetime distribution (time until termination or death). Then the failure rate \mu(t) can be interpreted as the rate of failure at the next instant given that the life has survived to time t.

_______________________________________________________________________________________________

When the Parameters Vary

The discussion in the previous section might give the impression that all Weibull distributions (when the parameters vary) behave in the same way. We now look at examples showing that as \tau (shape parameter) and/or \lambda (scale parameter) vary, the distribution will exhibit markedly different behavior. Note that when \tau=1, the Weibull distribution is reduced to the exponential distribution.

Example 1
The following diagram shows the PDFs of the Weibull distribution with \tau=0.5, \tau=1 and \tau=2 where \lambda=1 in all three cases.

Figure 1

weibull-pdfs-2

Figure 1 shows the effect of the shape parameter taking on different values while keeping the scale parameter fixed. The effect is very pronounced on the skewness. All three density curves are right skewed. The PDF with \tau=0.5 (the blue curve) has a very strong right skew. The PDF with \tau=1 (the red curve) is exponential and has, by comparison, a much smaller skewness. The PDF with \tau=2 looks almost symmetric, though there is a clear and small right skew. This observation is borne out by the calculation. The skewness coefficients are \gamma_1=6.62 (blue curve), \gamma_1=2 (red curve) and \gamma_1=0.6311 (green curve).

Another clear effect of the shape parameter is the thickness of the tail (in this case the right tail). Figure 1 suggests that the PDF with \tau=0.5 (the blue curve) is higher than the other two density curves on the interval x>2. As a result, the blue curve has more probability mass in the right tail. Thus the blue curve has a thicker tail comparing to the other two PDFs. For a numerical confirmation, the following table compares the probabilities in the right tail.

    \displaystyle \begin{array}{lllllll} \text{ } &\text{ } & \tau=0.5  & \text{ } & \tau=1 & \text{ } & \tau=2  \\  \text{ } & \text{ } &\text{Blue Curve} & \text{ } & \text{Red Curve} & \text{ }& \text{Green Curve} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  P[X>2] &\text{ } & 0.2431 & \text{ } & 0.1353  & \text{ } & 0.0183\\     \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  P[X>3] &\text{ } & 0.1769 & \text{ } & 0.0498 & \text{ } & 0.0003355 \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  P[X>4] &\text{ } & 0.1353 & \text{ } & 0.0183 & \text{ } & \displaystyle 1.125 \times 10^{-7} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  P[X>5] &\text{ } & 0.1069 & \text{ } & 0.00674 & \text{ } & 1.389 \times 10^{-11} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  P[X>10] &\text{ } & 0.0423 & \text{ } & 0.0000454 & \text{ } & 3.72 \times 10^{-44} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  P[X>25] &\text{ } & 0.00673 & \text{ } & 1.39 \times 10^{-11} & \text{ } & 3.68 \times 10^{-272}  \end{array}

It is clear from the above table that the Weibull distribution with the blue curve assigns more probabilities to the higher values. The mean of the distribution for blue curve is 2. The right tail x>10 (over 5 times the mean) contains 4.23% of the probability mass (a small probability for sure but not negligible). The right tail x>25 (over 12.5 times the mean) still has a small probability of 0.00673 that cannot be totally ignored. On the other hand, the Weibull distribution for the green curve has a light tail. The mean of the distribution for the green curve is about 0.89. At x>4 (over 4.5 times of its mean), the tail probability is already negligible at 1.125 \times 10^{-7}. At x>10 (over 11 times of its mean), the tail probability is 3.72 \times 10^{-44}, practically zero.

Let’s compare the density curves in Figure 1 with their failure rates. The following figure shows the failure rates for these three Weibull distributions.

Figure 2

weibull-failure-rates

According to the definition in (9), the following shows the failure rate function for the three Weibull distributions.

    \mu(y)=\frac{0.5}{y^{0.5}} \ \ \ \ \ y>0 \ \ \ \ \ \ \ \tau=0.5 \ \ \text{(blue curve)}

    \mu(y)=1 \ \ \ \ \ \ \ \ y>0 \ \ \ \ \ \ \ \tau=1 \ \ \text{(red curve)}

    \mu(y)=2y \ \ \ \ \ \ y>0 \ \ \ \ \ \ \ \tau=2 \ \ \text{(green curve)}

The blue curve in Figure 1 (\tau=0.5) has a decreasing failure rate as shown in Figure 2. The failure rate function is constant for the case of \tau=1 (the exponential case). It is an increasing function for the case of \tau=2. This comparison shows that the Weibull distribution is particularly useful for engineers and researchers who study the reliability of machines and devices. If the engineers believe that the failure rate is constant, then an exponential model is appropriate. If they believe that the failure rate increases with time or age, then a Weibull distribution with shape parameter \tau>1 is more appropriate. If the engineers believe that the failure rate decreases with time, then a Weibull distribution with shape parameter \tau<1 is more appropriate.

Example 2
We now compare Weibull distributions with various values for \lambda (scale parameter) while keeping the shape parameter \tau fixed. The following shows the density curves for the Weibull distributions with \lambda=1, 2, 3 while keeping \tau=2.

Figure 3

weibull-pdfs-3

The effect of the scale parameter \lambda is to compress or stretch out the standard Weibull density curve, i.e. the Weibull distribution with \lambda=1. For example, the density function for \lambda=2 is obtained by stretching out the density curve for the one with \lambda=1. The same overall shape is maintained while the density curve is being stretched or compressed. According to (3), the mean of the transformed distribution is increased (stretching) or decreased (compressing). For example, as \lambda is increased from 1 to 2, the mean has a two-fold increase. As the density curve is stretched, the resulting distribution is more spread out and the peak of the density curve decreases. The overall effect of changing the scale parameter is essentially a change in the scale in the x-axis.

The next example is a computational exercise.

Example 3
The time until failure (in months) of a semiconductor device has a Weibull distribution with shape parameter \tau=2.2 and scale parameter \lambda=400.

  • Give the density function and the survival function.
  • Determine the probability that the device will last at least 500 hours.
  • Determine the probability that the device will last at least 600 hours given that it has been running for over 500 hours.
  • Find the mean and standard deviation of the time until failure.
  • Determine the failure rate function of the Weibull time until failure.

To obtain the density function, the survival function and the failure rate, follow the relationships in (1), (2) and (9).

    \displaystyle f(y)=\frac{2.2}{400} \ \biggl( \frac{y}{400} \biggr)^{1.2} \text{exp} \biggl[- \biggl(\frac{y}{400}\biggr)^{2.2} \ \biggr]

    \displaystyle S(y)=\text{exp} \biggl[- \biggl(\frac{y}{400}\biggr)^{2.2} \biggr]

    \displaystyle \mu(y)=\frac{2.2}{400} \ \biggl( \frac{y}{400} \biggr)^{1.2}

Note that the Weibull failure rate is the ratio of the density function to the survival function. In this case, the failure is an increasing function of y. Since Y is a time scale, then this is a model for machines that wear out over time (see the next section).

The probability that the device will last over 500 hours is e^{-(\frac{500}{400})^{2.2}}=0.1952. The unconditional probability that the device will last over 600 hours is e^{-(\frac{600}{400})^{2.2}}=0.0872 The conditional probability that the device will last more than 600 hours given that it has lasted more 500 hours is the ratio \frac{S(600)}{S(500)}=0.4465.

To find the mean and variance, we need to evaluate the gamma function. Using Excel, we obtain the following two values of the gamma function (as shown here):

    \Gamma(1+\frac{1}{2.2})=0.88562476

    \Gamma(1+\frac{2}{2.2})=0.964912489

The mean and standard deviation of the time until failure are:

    E[Y]=\Gamma(1+\frac{1}{2.2}) \times 400=354.2499042

    E[Y^2]=\Gamma(1+\frac{2}{2.2}) \times 400^2=154385.9982

    Var[Y]=E[Y^2]-E[Y]^2=28893.00363

    \sigma_Y=\sqrt{Var[Y]}=169.9794212

_______________________________________________________________________________________________

The Weibull Failure Rates

Looking at the failure rate function indicated in (9) and looking at Figure 2, it is clear that when the shape parameter 0<\tau<1, the failure rate decreases with time (if the distribution is a model for the time until death of a life). When the shape parameter \tau=1, the failure rate is constant. When the shape parameter \tau>1, the failure rate increases with time. As a result, the Weibull family of distribution has a great deal of flexibility for modeling the failures of objects (machines, devices).

When the shape parameter 0<\tau<1, the failure rate decreases with time. Such a Weibull distribution is a model for infant mortality, or early-life failures. When the shape parameter \tau=1 (or near 1), the failure rate is constant or near constant. The resulting Weibull distribution (an exponential model) is a model for random failures (failures that are independent of age). When the shape parameter \tau>1, the failure rate increases with time. The resulting Weibull distribution is a model for wear-out failures.

In some applications, it may be necessary to model each phase of a lifetime separately, e.g. the early phase with a Weibull distribution with 0<\tau<1, the useful phase with a Weibull distribution with \tau close to 1 and the wear-out phase with a Weibull distribution with \tau>1. The resulting failure rate curve resembles a bathtub curve. The following is an idealized bathtub curve.

Figure 4

weibull-bathtub-failure-curve

The blue part of the bathtub curve is the early phase of the lifetime, which is characterized by decreasing failure rate. This is the early-life period in which the defective products die off and are taken out of the study. The next period is the useful-life period, the red part of the curve, in which the failures are random that are independent of age. In this phase, the failure rate is constant or near constant. The green part of the bathtub curve is characterized by increasing failure rates, which is the wear-out phase of the lifetime being studied.

_______________________________________________________________________________________________

Weakest Link Model

Another attractiveness of the Weibull model is that it can be used to model the so called the weakest link model. Consider a machine or device that has multiple components. Suppose that the device dies or fails when any one of the components fails. The lifetime of such a machine or device is the time to the first failure. Such a lifetime model is called the weakest link model. It can be shown that under these conditions a Weibull distribution is a good model for the distribution of the lifetime of such a machine or device.

If the time until failure of the individual components are indpendent and identically distributed Weibull random variables, then it follows that the minimum of of the Weibull random variables is also a Weibull random variable. To see this, let X_1,X_2,\cdots,X_n be independent and identically distributed Weibull random varaibles. Let \tau and \lambda be the parameters for the common Weibull distribution. Let Y=\text{min}(X_1,X_2,\cdots,X_n). The following gives the survival function of Y.

    \displaystyle \begin{aligned} P[Y >y]&=P[\text{all } X_i >y]  \\&=\biggl(e^{-(\frac{y}{\lambda})^\tau} \biggr)^n \\&=e^{-n (\frac{y}{\lambda})^\tau} \\&=e^{-(\frac{y}{\lambda_1})^\tau}  \end{aligned}

where \lambda_1=\frac{\lambda}{t} and t=n^{\frac{1}{\tau}}. This shows that Y has a Weibull distribution with shape parameter \tau (as before) and scale parameter \lambda_1. Under the condition that the times to failure for the multiple components are indentically Weibull distributed, the lifetime of the device is also a Weibull model.

_______________________________________________________________________________________________

The Moment Generating Function

The post is concluded with a comment on the moment generating function for the Weibull distribution. Note that relationship (4) indicates that all positive moments for the Weibull distribution exist. A natural question is on whether the moment generating function (MGF) exists. It turns out that MGF does not always exist for the Weibull distribution. The MGF exists for the Weibull distribution whenever the shape parameter \tau \ge 1. However the MGF cannot be expressed in terms of any familiar functions. Instead, the Weibull MGF can be expressed as a power series.

    \displaystyle M(t)=\sum \limits_{n=0}^\infty \ \frac{(\lambda t)^n}{n!} \ \Gamma \biggl(1+\frac{n}{\tau} \biggr) \ \ \ \ \ \tau \ge 1

To see this, start with the power series for e^x.

    \displaystyle e^X=\sum \limits_{n=0}^\infty \ \frac{X^n}{n!}

    \displaystyle e^{t X}=\sum \limits_{n=0}^\infty \ \frac{(t X)^n}{n!}

    \displaystyle M(t)=E[e^{t X}]=\sum \limits_{n=0}^\infty \ \frac{t^n}{n!} \ E[X^n]

    \displaystyle M(t)=\sum \limits_{n=0}^\infty \ \frac{(\lambda t)^n}{n!} \ \Gamma \biggl(1+\frac{n}{\tau} \biggr)

Since the positive moments exist for the Weibull distribution, the higher moments from (4) are plugged into the power series. It can be shown that the last series converges when \tau \ge 1.

When 0<\tau<1, the power series does not converge. For a specific example, let \tau=0.5. Then the term in the series is simplified to 2 \lambda t (2n+1), when goes to infinity when n \rightarrow \infty. Thus the Weilbull distribution with the shape parameter \tau=0.5 is an example of a distribution where all the positive moments exist but the MGF does not exist.

_______________________________________________________________________________________________

Further Information

Further information can be found here and here.

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

The hyperexponential and hypoexponential distributions

This post continues with the discussion on the exponential distribution. The previous posts on the exponential distribution are an introduction, a post on the relation with the Poisson process and a post on more properties. This post discusses the hyperexponential distribution and the hypoexponential distribution.

Suppose X_1,X_2,\cdots,X_n are independent exponential random variables. If they are identically distributed where the common mean is \frac{1}{\lambda} (the common rate is \lambda), then the sum X_1+X_2+\cdots + X_n follows the gamma distribution with shape parameter n and rate parameter \lambda. Since the shape parameter in this case is an integer n, the gamma distribution is also called the Erlang distribution, a sub family of the gamma family.

What if the exponential X_1,X_2,\cdots,X_n are not identically distributed? Suppose that the rate parameter of X_i is \lambda_i such that \lambda_i \ne \lambda_j for i \ne j. Then the sum X_1+X_2+\cdots + X_n is said to follow a hypoexponential distribution.

To contrast, a hyperexponential distribution is also a “sum” of independent exponential random variables X_1,X_2,\cdots,X_n. The notion of “sum” is not the arithmetic sum but rather is the notion of a mixture. The random variable X can be one of the n independent exponential random variables X_1,X_2,\cdots,X_n such that X is X_i with probability p_i with p_1+\cdots+p_n=1. Such a random variable X is said to follow a hyperexponential distribution.

The Erlang distribution, the hypoexponential distribution and the hyperexponential distribution are special cases of phase-type distributions that are useful in queuing theory. These distributions are models for interarrival times or service times in queuing systems. They are obtained by breaking down the total time into a number of phases, each having an exponential distribution, where the parameters of the exponential distributions may be identical or may be different. Furthermore, the phases may be in series or in parallel (or both). The phases are in series means that the phases are run sequentially, one at a time. The Erlang distribution is one example of a phase-type distribution where the n phases are in series and that the phases have the same parameter for the expoenential distributions. The hypoexponential distribution is an example of a phase-type distribution where the n phases are in series and that the phases have distinct exponential parameters. The hyperexponential distribution is an example of a phase-type distribution where the phases are in parallel, which means that the system randomly selects one of the phases to process each time according to specified probabilities.

The Erlang distribution is a special case of the gamma distribution. For basic properties of the Erlang distribution, see the previous posts on the gamma distribution, starting with this post. The remainder of the post focuses on some basic properties of the hyper and hypo exponential distributions.

_______________________________________________________________________________________________

The Hyperexponential Distribution

The hyperexponential distribution is the mixture of a set of independent exponential distributions. To generate a hyperexponential distribution, let X_1,X_2,\cdots,X_n be independent exponential random variables with rates parameters \lambda_1, \lambda_2,\cdots,\lambda_n, and with weights p_1,p_2,\cdots,p_n, respectively. Then X is a hyperexponential random variable if X is X_i with probability p_i. Thus the hyperexponential distribution has a variable number of parameters. There are 2 \times n many parameters, n for the exponential rate parameters and n for their respective weights. We use the same notations in the discussion of the hyperexponential distribution that follows.

Basic facts about mixture distributions can give a great deal of information on the hyperexponential distribution. For example, many distributional quantities of a mixture distribution are obtained by taking weighted averages of the corresponding quantities of the individual components. Specifically, the density function, the cumulative distribution function (CDF), the survival function, along with the mean, the higher moments can all be obtained by taking weighted averages. However, the the variance of a mixture is not the weighted average of the individual variances. The following shows the probability density function f_X(x), the survival function S_X(x) and the cumulative distribution function F_X(x) of the hyperexponential distribution.

    \displaystyle f_X(x)=p_1 \cdot (\lambda_1 e^{-\lambda_1 \ x})+p_2 \cdot (\lambda_2 e^{-\lambda_2 \ x})+\cdots+p_n \cdot (\lambda_n e^{-\lambda_n \ x}) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

    \displaystyle S_X(x)=p_1 \cdot e^{-\lambda_1 \ x}+p_2 \cdot e^{-\lambda_2 \ x}+\cdots+p_n \cdot e^{-\lambda_n \ x} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

    \displaystyle F_X(x)=p_1 \cdot (1- e^{-\lambda_1 \ x})+p_2 \cdot (1- e^{-\lambda_2 \ x})+\cdots+p_n \cdot (1- e^{-\lambda_n \ x}) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

The above three distributional quantities are the weighted averages of the corresponding quantities of the individual exponential distributions. The hyperexponential distribution is sometimes called a finite mixture since there are a finite number of components in the weighted average. The raw moments of the hyperexponential distribution are also weighted averages of the corresponding exponential raw moments. The following shows the mean and the higher moments E[X^k] where k is any positive integer.

    \displaystyle E[X]=p_1 \cdot \biggl(\frac{1}{\lambda_1} \biggr)+p_2 \cdot \biggl(\frac{1}{\lambda_2} \biggr)+\cdots+p_n \cdot \biggl(\frac{1}{\lambda_n} \biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

    \displaystyle E[X^k]=p_1 \cdot \biggl(\frac{k!}{\lambda_1^k} \biggr)+p_2 \cdot \biggl(\frac{k!}{\lambda_2^k} \biggr)+\cdots+p_n \cdot \biggl(\frac{k!}{\lambda_n^k} \biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

Interestingly, the hyperexponential variance is not the weighted average of the individual variances.

    \displaystyle Var[X] \ne \sum \limits_{i=1}^n \ p_i \biggl(\frac{1}{\lambda_i^2} \biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)

The following gives the correct variance of the hyperexponential distribution.

    \displaystyle Var[X]=E[X^2]-E[X]^2=\sum \limits_{i=1}^n \ p_i \biggl(\frac{2}{\lambda_i^2} \biggr)-\biggl(\sum \limits_{i=1}^n \ p_i \frac{1}{\lambda_i} \biggr)^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (7)

It is instructive to rearrange the above variance as follows:

    \displaystyle \begin{aligned} Var[X]&=\sum \limits_{i=1}^n \ p_i \biggl(\frac{2}{\lambda_i^2} \biggr)-\biggl(\sum \limits_{i=1}^n \ p_i \frac{1}{\lambda_i} \biggr)^2 \\&=\sum \limits_{i=1}^n \ p_i \biggl(\frac{1}{\lambda_i^2} \biggr)+\biggl[ \sum \limits_{i=1}^n \ p_i \biggl(\frac{1}{\lambda_i^2} \biggr)-\biggl(\sum \limits_{i=1}^n \ p_i \frac{1}{\lambda_i} \biggr)^2 \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8) \end{aligned}

Compare (8) with the right hand side of (6). The difference is the content within the squared brackets in (8), which represents the added uncertainty in a mixture. Note that the content within the squared brackets is a positive value and is the variance of the random variable that has value \frac{1}{\lambda_i} with probability p_i. In other words, the content inside the squared brackets is the variance of the conditional exponential means (see the random variable W described below).

The derivation in (8) shows that the hyperexponential variance is more than the weighted average of the individual exponential variances. This is because the mixing of the exponential random variables has added uncertainty. There are two levels of uncertainty in a mixture – the uncertain in each component (e.g. in each exponential distribution) and the uncertainty in not knowing which component will be realized, which is the added uncertainty. The content inside the squared brackets quantifies the added uncertainty. See here for a more detailed discussion on the variance of a mixture.

The coefficient of variation of a probability distribution is the ratio of its standard deviation to its mean (we only consider the case where the mean is positive). It is sometimes called the relative standard deviation and is a standardized measure dispersion of a probability distribution. For the exponential distribution, the coefficient of variation is always one. For the hyperexponential distribution, the coefficient of variation is always more than one. To see why, consider the random variable W as defined below (note that this is the same random variable that is used above to explain the added undertainty in a mixture).

    \displaystyle  W = \left\{ \begin{array}{cc}                     \displaystyle  \frac{1}{\lambda_1} & \ \ \ \ \text{probability } p_1 \\           \text{ } & \text{ } \\           \displaystyle  \frac{1}{\lambda_2} &\ \ \ \ \text{probability } p_2 \\           \text{ } & \text{ } \\           \cdots & \cdots \\           \text{ } & \text{ } \\           \displaystyle  \frac{1}{\lambda_n} &\ \ \ \ \text{probability } p_n           \end{array} \right.

The variance of W is given by the following:

    \displaystyle Var[W]=\sum \limits_{i=1}^n \ p_i \biggl(\frac{1}{\lambda_i^2} \biggr)-\biggl(\sum \limits_{i=1}^n \ p_i \frac{1}{\lambda_i} \biggr)^2

We now rearrange Var[W] to show that the coefficient of variation of the hyperexponential distribution is more than one.

    \displaystyle \begin{aligned} &\sum \limits_{i=1}^n \ p_i \biggl(\frac{1}{\lambda_i^2} \biggr)-\biggl(\sum \limits_{i=1}^n \ p_i \frac{1}{\lambda_i} \biggr)^2 > 0 \\&\text{ } \\&\sum \limits_{i=1}^n \ p_i \biggl(\frac{2}{\lambda_i^2} \biggr)-2 \times \biggl(\sum \limits_{i=1}^n \ p_i \frac{1}{\lambda_i} \biggr)^2 > 0 \\&\text{ } \\&\sum \limits_{i=1}^n \ p_i \biggl(\frac{2}{\lambda_i^2} \biggr)-\biggl(\sum \limits_{i=1}^n \ p_i \frac{1}{\lambda_i} \biggr)^2 > \biggl(\sum \limits_{i=1}^n \ p_i \frac{1}{\lambda_i} \biggr)^2 \\&\text{ } \\&Var[X] > E[X]^2 \\&\text{ } \\& \sigma_X > E[X] \\&\text{ } \\& \frac{\sigma_X}{E[X]} > 1 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9) \end{aligned}

The variance Var[W] is identical to the content within the squared brackets in (8). The above derivation shows that the variance of the hyperexponential distribution is always greater than the mean, hence making the coefficient of variation always more than one.

The following is the failure rate of the hyperexponential distribution.

    \displaystyle \mu_X(t)=\frac{f_X(t)}{S_X(t)}=\frac{\sum \limits_{i=1}^n \ p_i \cdot \lambda_i e^{-\lambda_i \ t}}{\sum \limits_{i=1}^n \ p_i \cdot e^{-\lambda_i \ t}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (10)

The failure rate (also called the hazard rate) can be interpreted as the rate of failure at the instant right after the life has survived to age t. If the lifetime of a system is modeled by an exponential distribution, then the failure rate is constant, which is another way to state the memoryless property of the exponential distribution. Since the exponential distribution is the only continuous distribution with the memoryless property, the failure rate in (10) is obviously not constant. However, as the life goes up in age, the failure is approaching a constant failure rate as shown by the following:

    \displaystyle \lim \limits_{t \rightarrow \infty} \ \mu_X(t)=\lambda_k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (11)

where \lambda_k=\text{min}(\lambda_1,\cdots,\lambda_n). That is, as the life (or system or process) goes up in age, the hyperexponential failure rate approaches that of the exponential component with the smallest failure rate (the longest lifetime). This makes intuitive sense. As the system or process survives to a high age, it is more likely the hyperexponential random variable is the exponential component that lasts the longest.

To see why (11) is true, multiple both the numerator and the denominator in (10) by e^{\lambda_k t}. Then the terms with i=k become p_k \lambda_k (in the numerator) and p_k (in the denominator). As t \rightarrow \infty, the terms with i \ne k go to zero. As a result, the ratio goes to \lambda_k.

Though not discussed here, several distributional quantities that are based on moments can be calculated, properties such as skewness and kurtosis, using the moments obtained in (5).

_______________________________________________________________________________________________

The Hypoexponential Density Function

The remainder of the post discusses the basic properties of the hypoexponential distribution. We first examine the probability density function of a hypoexponential distribution. Since such a distribution is an independent sum, the concept of convolution can be used. The following example show how it is done for the hypoexponential distribution with 2 phases. It turns out that the density function in general has a nice form.

Example 1
Suppose the rate parameters are \lambda_1 and \lambda_2 for the two independent exponential random variables X_1 and X_2 where \lambda_1 \ne \lambda_2. Determine the density function of the independent sum X=X_1+X_2.

The integral for the density function is:

    \displaystyle \begin{aligned} f_{X_1+X_2}(x)&=\int_0^x \ f_{X_1}(t) \ f_{X_2}(x-t) \ dt \\&=\int_0^x \ \lambda_1 e^{-\lambda_1 \ t} \ \lambda_2 e^{-\lambda_2 \ (x-t)} \ dt \\&=\lambda_1 \ \lambda_2 \ e^{-\lambda_2 \ x} \ \int_0^x \ e^{-(\lambda_1-\lambda_2) \ t} \ dt \\&=\lambda_1 \ \lambda_2 \ e^{-\lambda_2 \ x} \ \frac{1}{\lambda_1-\lambda_2} \ (1-e^{-(\lambda_1-\lambda_2) \ x}) \\&=\frac{\lambda_1}{\lambda_1-\lambda_2} \ (\lambda_2 \ e^{-\lambda_2 \ x}) + \frac{\lambda_2}{\lambda_2-\lambda_1} \ (\lambda_1 \ e^{-\lambda_1 \ x}) \end{aligned}

Note that the density f_{X_1+X_2}(x) appears to be a weighted average of the two individual exponential density functions, even though one of the weights is negative. \square

Example 2
Suppose that the hypoexponential distribution has three phases. The parameters of the three phases are \lambda_1, \lambda_2 and \lambda_3. Determine the density of the independent sum X=X_1+X_2+X_3.

The idea is to apply the convolution of the density obtained in Example 1 and the exponential density f_{X_3}(x)=\lambda_3 \ e^{-\lambda_3 \ x}. The convolution integral is:

    \displaystyle f_{X_1+X_2+X_3}(x)=\int_0^x \ f_{X_1+X_2}(t) \ f_{X_3}(x-t) \ dt

The result of the integral is:

    \displaystyle \begin{aligned} f_{X_1+X_2+X_3}(x)&=(\lambda_1 \ e^{-\lambda_1 \ x}) \ \biggl( \frac{\lambda_2}{\lambda_2-\lambda_1} \times \frac{\lambda_3}{\lambda_3-\lambda_1} \biggr) \\& \ \ \ +(\lambda_2 \ e^{-\lambda_2 \ x}) \ \biggl( \frac{\lambda_1}{\lambda_1-\lambda_2} \times \frac{\lambda_3}{\lambda_3-\lambda_2} \biggr) \\& \ \ \ +(\lambda_3 \ e^{-\lambda_3 \ x}) \ \biggl( \frac{\lambda_1}{\lambda_1-\lambda_3} \times \frac{\lambda_2}{\lambda_2-\lambda_3} \biggr) \end{aligned}

The density function f_{X_1+X_2+X_3}(x) also appears to be a weighted average of the individual exponential density functions with some of the weights being negative. \square

Now, the general form of the density function for a hypoexponential density function.

    \displaystyle f_{X_1+X_2+\cdots+X_{n}}(x)=\sum \limits_{i=1}^n \ W_{n,i} \ (\lambda_i \ e^{-\lambda_i \ x}) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (12)

where the weight W_{n,i} is defined as:

    \displaystyle W_{n,i}=\prod \limits_{j \ne i} \frac{\lambda_j}{\lambda_j-\lambda_i}

The density function in the general case of n phases can be established by an induction proof.

_______________________________________________________________________________________________

More Hypoexponential Properties

As before, the hypoexponential random variable is the sum X=X_1+X_2+\cdots+X_n where X_1,X_2,\cdots,X_n are independent exponential random variables with \lambda_i being the rate parameter for X_i. Since it is an independent sum, the mean and variance are easily obtained.

    \displaystyle E[X]=\frac{1}{\lambda_1}+\frac{1}{\lambda_2}+\cdots+\frac{1}{\lambda_2} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (13)

    \displaystyle Var[X]=\frac{1}{\lambda_1^2}+\frac{1}{\lambda_2^2}+\cdots+\frac{1}{\lambda_2^2} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (14)

The coefficient of variation is the ratio of the standard deviation to the mean. For the exponential distribution, the coefficient of variation is always 1. For the hypoexponential distribution, the coefficient of variation is always less than 1. The following is the square of the coefficient of variation for a hypoexponential random variable.

    \displaystyle \text{CV}^2=\frac{Var[X]}{E[X]^2}=\frac{\frac{1}{\lambda_1^2}+\frac{1}{\lambda_2^2}+\cdots+\frac{1}{\lambda_2^2}}{\biggl(\frac{1}{\lambda_1}+\frac{1}{\lambda_2}+\cdots+\frac{1}{\lambda_2}\biggr)^2}<1 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (15)

    \displaystyle \text{CV}=\frac{\sigma_X}{E[X]}<1 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (16)

Note that the quantity in the denominator in (15) is always larger than the numerator, showing that \text{CV}^2<1. As a result, \text{CV}<1. In contrast, the coefficient of variation of the hyperexponential distribution is always greater than 1 (see (9)).

The survival function for the hypoexponential distribution has the following form:

    \displaystyle S_X(x)=S_{X_1+X_2+\cdots+X_{n}}(x)=\sum \limits_{i=1}^n \ W_{n,i} \ e^{-\lambda_i \ x} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (17)

where the weights W_{n,i} are defined in (12). Then the CDF is F_X(x)=1-S_X(x). As an example, the survival function for the hypoexponential distribution in Example 1 is

    \displaystyle S_X(x)=\frac{\lambda_1}{\lambda_1-\lambda_2} \ (e^{-\lambda_2 \ x}) + \frac{\lambda_2}{\lambda_2-\lambda_1} \ (e^{-\lambda_1 \ x})

With the density function in (12), several distributional quantities that are based on moments can be calculated, properties such as skewness and kurtosis.

The following is the failure rate of the hypoexponential distribution.

    \displaystyle \mu_X(t)=\frac{f_{X}(t)}{S_X(t)}=\frac{\sum \limits_{i=1}^n \ W_{n,i} \ (\lambda_i \ e^{-\lambda_i \ x})}{\sum \limits_{i=1}^n \ W_{n,i} \ e^{-\lambda_i \ x}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (18)

The hypoexponential failure rate is obviously not a constant rate since only the exponential distribution has constant failure rate. However, as the system reaches high ages, the failure rate approaches that of the smallest exponential rate parameters that define the hypoexponential distribution. The same observation is made above in (11), that is,

    \displaystyle \lim \limits_{t \rightarrow \infty} \mu_X(t)=\lambda_k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (19)

where \lambda_k=\text{min}(\lambda_1,\cdots,\lambda_n). Suppose that the lifetime of a life or system is modeled by a hypoexponential distribution. Then when the system is in a high age, the failure rate is approximately the smallest rate parameter of the exponential distributions defining the hypoexponential lifetime. In other words, as the hypoexponentially distributed lifetime becomes large, the remaining lifetime of the system is approximately the exponential lifetime with a rate parameter equal to the smallest of the exponential rate parameters. The proof of (19) is similar to the proof for (11). Simply multiply the numerator and denominator of (3) by e^{\lambda_k}. Then the terms with i=k become constants. As t \rightarrow \infty, the terms with i \ne k go to zero and the ratio goes to \lambda_k.

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

More topics on the exponential distribution

This is a continuation of two previous posts on the exponential distribution (an introduction and a post on the connection with the Poisson process). This post presents more properties that are not discussed in the two previous posts. Of course, a serious and in-depth discussion of the exponential distribution can fill volumes. The goal here is quite modest – to present a few more properties related to the memoryless property of the exponential distribution.

_______________________________________________________________________________________________

The Failure Rate

A previous post discusses the Poisson process and its relation to the exponential distribution. Now we present another way of looking at both notions. Suppose a counting process counts the occurrences of a type of random events. Suppose that an event means a termination of a system, be it biological or manufactured. Furthermore suppose that the terminations occur according to a Poisson process at a constant rate \lambda per unit time. Then what is the meaning of the rate \lambda? It is the rate of termination (dying). It is usually called the failure rate (or hazard rate or force of mortality). The meaning of the constant rate \lambda is that the rate of dying is the same regardless of the location of the time scale (i.e. regardless how long a life has lived). This means that the lifetime (the time until death) of such a system has no memory. Since the exponential distribution is the only continuous distribution with the memoryless property, the time until the next termination inherent in the Poisson process in question must be an exponential random variable with rate \lambda or mean \frac{1}{\lambda}. So the notion of failure rate function (or hazard rate function) runs through the notions of exponential and Poisson process and further illustrates the memoryless property.

Consider a continuous random variable X that only takes on the positive real numbers. Suppose F and f are the CDF and density function of X, respectively. The survival function is S=1-F. The failure rate (or hazard rate) \mu(t) is defined as:

    \displaystyle \mu(t)=\frac{f(t)}{1-F(t)}=\frac{f(t)}{S(t)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The function \mu(t) can be interpreted as the rate of failure at the next instant given that the life has survived to time t. Suppose that the lifetime distribution is exponential. Because of the memoryless proeprty, the remaining lifetime of a t-year old is the same as the lifetime distribution for a new item. It is then intuitively clear that the failure rate must be constant. Indeed, it is straightforward to show that if the lifetime X is an exponential random variable with rate parameter \lambda, i.e. the density is f(t)=\lambda e^{-\lambda t}, then the failure rate is \mu(t)=\lambda. This is why the parameter \lambda in the density function f(t)=\lambda e^{-\lambda t} is called the rate parameter.

On the other hand, the failure rate or hazard rate is not constant for other lifetime distributions. However, the hazard rate function \mu(t) uniquely determines the distributional quantities such as the CDF and the survival function. The definition (1) shows that the failure rate is derived using the CDF or survival function. We show that the failure rate is enough information to derive the CDF or the survival function. From the definition, we have:

    \displaystyle \mu(t)=\frac{-\frac{d}{dt} S(t)}{S(t)} \ \ \  \text{or} \ \ \ -\mu(t)=\frac{\frac{d}{dt} S(t)}{S(t)}

Integrate both sides produces the following:

    \displaystyle - \int_0^t \ \mu(x) \ dx=\int_0^t \ \frac{\frac{d}{dt} S(t)}{S(t)} \ dx=\text{ln} S(t)

Exponentiate on each side produces the following:

    \displaystyle S(t)=e^{-\int_0^t \ \mu(x) \ dx} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

Once the survival function is obtained from \mu(t), the CDF and the density function can be obtained. Interestingly, the derivation in (2) can give another proof of the fact that the exponential distribution is the only one with the memoryless property. If X is memoryless, then the failure rate must be constant. If \mu(x) is constant, then by (2), S(t) is the exponential survival function. Of course, the other way is clear: if X is exponential, then X is memoryless.

The preceding discussion shows that having a constant failure rate is another way to characterize the exponential distribution, in particular the memoryless property of the exponential distribution. Before moving onto the next topics, another example of a failure rate function is \mu(t)=\frac{\alpha}{\beta} (\frac{t}{\beta})^{\alpha-1} where both \alpha and \beta are positive constants. This is the Weibull hazard rate function. The derived survival function is S(t)=e^{-(\frac{t}{\beta})^\alpha}, which is the survival function for the Weibull distribution. This distribution is an excellent model choice for describing the life of manufactured objects. See here for an introduction to the Weibull distribution.

_______________________________________________________________________________________________

The Minimum Statistic

The following result is about the minimum of independent exponential distributions.

    Suppose that X_1,X_2,\cdots,X_n are independent exponential random variables with rates \lambda_1,\lambda_2,\cdots,\lambda_n, respectively. Then the minimum of the sample, denoted by Y=\text{min}(X_1,X_2,\cdots,X_n), is also an exponential random variable with rate \lambda_1+\lambda_2 +\cdots+\lambda_n. \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

For the minimum to be > y, all sample items must be > y, Thus P[Y> y] is:

    \displaystyle \begin{aligned} P[Y> y]&=P[X_1> y] \times P[X_2> y] \times \cdots \times P[X_n> y] \\&=e^{-\lambda_1 y} \times e^{-\lambda_2 y} \times \cdots \times e^{-\lambda_n y}\\&=e^{-(\lambda_1+\lambda_2+\cdots+\lambda_n) y}  \end{aligned}

This means that Y=\text{min}(X_1,X_2,\cdots,X_n) has an exponential distribution with rate \lambda_1+\lambda_2+\cdots+\lambda_n. As a result, the smallest item of a sample of independent exponential observations also follows an exponential distribution with rate being the sum of all the individual exponential rates.

Example 1
Suppose that X_1,X_2,X_3 are independent exponential random variables with rates \lambda_i for i=1,2,3. Calculate E[\text{min}(X_1,X_2,X_3)] and E[\text{max}(X_1,X_2,X_3)].

Let X=\text{min}(X_1,X_2,X_3) and Y=\text{max}(X_1,X_2,X_3). Then X is an exponential distribution with rate \lambda_1+\lambda_2+\lambda_3. As a result, E[X]=\frac{1}{\lambda_1+\lambda_2+\lambda_3}. Finding the expected value of the maximum requires calculation. First calculate P[Y \le y].

    \displaystyle \begin{aligned} P[Y \le y]&=P[X_1 \le y] \times P[X_2 \le y] \times  P[X_3 \le y] \\&=(1-e^{-\lambda_1 y}) \times (1-e^{-\lambda_2 y}) \times (1-e^{-\lambda_3 y}) \\&=1-e^{-\lambda_1 y}-e^{-\lambda_2 y}-e^{-\lambda_3 y}+e^{-(\lambda_1+\lambda_2) y}+e^{-(\lambda_1+\lambda_3) y} \\& \ \ \ +e^{-(\lambda_2+\lambda_3) y}-e^{-(\lambda_1+\lambda_2+\cdots+\lambda_n) y}  \end{aligned}

Differentiate F_Y(y)=P[Y \le y] to obtain the density function f_Y(y).

    \displaystyle \begin{aligned} f_Y(y)&=\lambda_1 e^{-\lambda_1 y}+\lambda_2 e^{-\lambda_2 y}+ \lambda_3 e^{-\lambda_3 y} \\& \ \ \ -(\lambda_1+\lambda_2) e^{-(\lambda_1+\lambda_2) y}-(\lambda_1+\lambda_3) e^{-(\lambda_1+\lambda_3) y}-(\lambda_2+\lambda_3)e^{-(\lambda_2+\lambda_3) y} \\& \ \ \ \ +(\lambda_1+\lambda_2+\cdots+\lambda_n)e^{-(\lambda_1+\lambda_2+\cdots+\lambda_n) y}  \end{aligned}

Each term in the density function f_Y(y) is by itself an exponential density. Thus the mean of the maximum is:

    \displaystyle E[Y]=\frac{1}{\lambda_1}+\frac{1}{\lambda_2}+ \frac{1}{\lambda_3} -\frac{1}{\lambda_1+\lambda_2}-\frac{1}{\lambda_1+\lambda_3}-\frac{1}{\lambda_2+\lambda_3}+\frac{1}{\lambda_1+\lambda_2+\lambda_3}

To make sense of the numbers, let \lambda_1=\lambda_2=\lambda_3=1. Then E[X]=\frac{1}{3}=\frac{2}{6} and E[Y]=\frac{11}{6}. In this case, the expected value of the maximum is 5.5 times larger than the expected value of the minimum. For \lambda_1=1, \lambda_2=2 and \lambda_3=3, E[X]=\frac{1}{6} and E[Y]=\frac{73}{60}=\frac{7.3}{6}. In the second case, the expected value of the maximum is 7.3 times larger than the expected value of the minimum. \square

_______________________________________________________________________________________________

Ranking Independent Exponential Distributions

In this section, X_1,X_2,\cdots,X_n are independent exponential random variables where the rate of X_i is \lambda_i for i=1,2,\cdots,n. What is the probability of P[X_{j(1)} <X_{j(2)}<\cdots<X_{j(k)}]? Here the subscripts j(1),\cdots,j(k) are distinct integers from \left\{1,2,\cdots,n \right\}. For example, for sample size of 2, what are the probabilities P[X_1<X_2] and P[X_2<X_1]? For sample size of 3, what are P[X_1<X_2<X_3] and P[X_2<X_1<X_3]? First, the case of ranking two independent exponential random variables.

    Ranking X_1 and X_2.

      \displaystyle P[X_1<X_2]=\frac{\lambda_1}{\lambda_1+\lambda_2} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

    where \lambda_1 is the rate of X_1 and \lambda_2 is the rate of X_2. Note that this probability is the ratio of the rate of the smaller exponential random variable over the total rate. The probability P[X_1<X_2] can be computed by evaluating the following integral:

      \displaystyle P[X_1<X_2]=\int_0^\infty \int_x^\infty \ \lambda_1 e^{-\lambda_1 \ x} \ \lambda_2 e^{-\lambda_2 \ x} \ dy \ dx

The natural next step is to rank three or more exponential random variables. Ranking three variables would require a triple integral and ranking more variables would require a larger multiple integral. Instead, we rely on a useful fact about the minimum statistic. First, another basic result.

    When one of the variables is the minimum:

      \displaystyle P[X_i=\text{min}(X_1,\cdots,X_n)]=\frac{\lambda_i}{\lambda_1+\lambda_2+\cdots+\lambda_n} \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

    The above says that the probability that the ith random variable is the smallest is simply the ratio of the rate of the ith variable over the total rate. This follows from (4) since we are ranking the two exponential variables X_i and \text{min}(X_1,\cdots,X_{i-1},X_{i+1},\cdots,X_n).

We now consider the following theorem.

Theorem 1
Let X_1,X_2,\cdots,X_n be independent exponential random variables. Then the minimum statistic \text{min}(X_1,X_2,\cdots,X_n) and the rank ordering of X_i are independent.

The theorem basically says that the probability of a ranking is not dependent on the location of the minimum statistic. For example, if we know that the minimum is more than 3, what is the probability of X_1<X_2<X_3? The theorem is saying that conditioning on \text{min}(X_1,X_2,X_3)>3 makes no difference on the probability of the ranking. Let Y=\text{min}(X_1,X_2,\cdots,X_n). The following establishes the theorem.

    \displaystyle P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)} \ |\ \text{min}(X_1,X_2,\cdots,X_n)>t]

    \displaystyle =P[X_{j(1)}-t<X_{j(2)}-t<\cdots<X_{j(k)}-t \ |\ \text{min}(X_1,X_2,\cdots,X_n)>t]

    \displaystyle =P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)}]

The key to the proof is the step to go from the second line to the third line. Assume that each X_i is the lifetime of a machine. When \text{min}(X_1,X_2,\cdots,X_n)>t, all the lifetimes X_i>t. By the memoryless property, the remaining lifetimes X_i-t are independent and exponential with the original rates. In other words, each X_i-t has the same exponential distribution as X_i. Consequently, the second line equals to the third line. To make it easier to see the step from the second line to the third line, think of the two dimensional case with X_1 and X_2. \square

The following is a consequence of Theorem 1.

Corollary 2
Let X_1,X_2,\cdots,X_n be independent exponential random variables. Then the event X_i=\text{min}(X_1,X_2,\cdots,X_n) and the rank ordering of X_1,\cdots,X_{i-1},X_{i+1}\cdots,X_n are independent.

Another way to state the corollary is that the knowledge that X_i is the smallest in the sample has no effect on the ranking of the variables other than X_i. This is the consequence of Theorem 1. To see why, let t=X_i=\text{min}(X_1,X_2,\cdots,X_n).

    \displaystyle P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)} \ |\ t=X_i=\text{min}(X_1,X_2,\cdots,X_n)]

    \displaystyle =P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)} \ |\ \text{min}(X_1,\cdots,X_{i-1},X_{i+1}\cdots,X_n)>t]

    \displaystyle =P[X_{j(1)}<X_{j(2)}<\cdots<X_{j(k)}] \square

We now present examples demonstrating how these ideas are used.

Example 2
Suppose that a bank has three tellers for serving its customers. The random variable X_1,X_2,X_3 are independent exponential random variables where X_i is the time spent by teller i serving a customer. The rate parameter of X_i is \lambda_i where i=1,2,3. If all three tellers are busy serving customers, what is P[X_1<X_2<X_3]? If the bank has 4 tellers instead, then what is P[X_1<X_2<X_3<X_4]?

The answer is given by the following:

    \displaystyle \begin{aligned} P[X_1<X_2<X_3]&=P[X_1=\text{min}(X_1,X_2,X_3)] \\& \ \ \ \times P[X_2<X_3 | X_1=\text{min}(X_1,X_2,X_3)] \\&\text{ } \\&=P[X_1=\text{min}(X_1,X_2,X_3)] \times P[X_2<X_3] \\&\text{ } \\&=\frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3} \times \frac{\lambda_2}{\lambda_2+\lambda_3}  \end{aligned}

The derivation uses Corollary 2 and the idea in (5). The same idea can be used for P[X_1<X_2<X_3<X_4].

    \displaystyle \begin{aligned} P[X_1<X_2<X_3<X_4]&=P[X_1=\text{min}(X_1,X_2,X_3,X_4)] \\& \ \ \times P[X_2<X_3<X_4 | X_1=\text{min}(X_1,X_2,X_3,X_4)] \\&\text{ } \\&=P[X_1=\text{min}(X_1,X_2,X_3,X_4)] \times P[X_2<X_3<X_4] \\&\text{ } \\&=P[X_1=\text{min}(X_1,X_2,X_3,X_4)] \\& \ \ \ \times P[X_2=\text{min}(X_2,X_3,X_4)] \times P[X_3<X_4] \\&\text{ } \\&=\frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3+\lambda_4} \times \frac{\lambda_2}{\lambda_2+\lambda_3+\lambda_4} \times \frac{\lambda_3}{\lambda_3+\lambda_4} \end{aligned}

The above applies the independence result twice, the first time on X_1,X_2,X_3,X_4, the second time on X_2,X_3,X_4. This approach is much preferred over direction calculation, which would involve integral calculation that is tedious and error prone. \square

Example 3
As in Example 2, a bank has three tellers for serving its customers. The service times of the three tellers are independent exponential random variables. The mean service time for teller i is \frac{1}{\lambda_i} minutes where i=1,2,3. You walk into the bank and find that all three tellers are busy serving customers. You are the only customer waiting for an available teller. Calculate the expected amount of time you spend at the bank.

Let T be the total time you spend in the bank, which is T=W+S where W is the waiting time for a teller to become free and S is the service time of the teller helping you. When you walk into the bank, the tellers are already busy. Let X_i be the remaining service time for teller i, i=1,2,3. By the memoryless property, X_i is exponential with the original mean \frac{1}{\lambda_i}. As a result, the rate parameter of X_i is \lambda_i.

The waiting time W is simply \text{min}(X_1,X_2,X_3). Thus E[W]=\frac{1}{\lambda_1+\lambda_2+\lambda_3}. To find E[S], we need to consider three cases, depending on which teller finishes serving the current customer first.

    \displaystyle \begin{aligned} E[S]&=E[S | X_1=\text{min}(X_1,X_2,X_3)] \times P[X_1=\text{min}(X_1,X_2,X_3)] \\& \ \ + E[S | X_2=\text{min}(X_1,X_2,X_3)] \times P[X_2=\text{min}(X_1,X_2,X_3)]  \\& \ \ + E[S | X_3=\text{min}(X_1,X_2,X_3)] \times P[X_3=\text{min}(X_1,X_2,X_3)]\end{aligned}

Finishing the calculation,

    \displaystyle \begin{aligned} E[S]&=\frac{1}{\lambda_1} \times \frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3}  + \frac{1}{\lambda_2} \times \frac{\lambda_2}{\lambda_1+\lambda_2+\lambda_3}  + \frac{1}{\lambda_3} \times \frac{\lambda_3}{\lambda_1+\lambda_2+\lambda_3}   \\&=\frac{3}{\lambda_1+\lambda_2+\lambda_3} \end{aligned}

    \displaystyle E[T]=\frac{1}{\lambda_1+\lambda_2+\lambda_3}+\frac{3}{\lambda_1+\lambda_2+\lambda_3}=\frac{4}{\lambda_1+\lambda_2+\lambda_3}

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

The exponential distribution and the Poisson process

This post is a continuation of the previous post on the exponential distribution. The previous post discusses the basic mathematical properties of the exponential distribution including the memoryless property. This post looks at the exponential distribution from another angle by focusing on the intimate relation with the Poisson process.

_______________________________________________________________________________________________

The Poisson Process

A previous post shows that a sub family of the gamma distribution that includes the exponential distribution is derived from a Poisson process. This post gives another discussion on the Poisson process to draw out the intimate connection between the exponential distribution and the Poisson process.

Suppose a type of random events occur at the rate of \lambda events in a time interval of length 1. As the random events occur, we wish to count the occurrences. Starting at time 0, let N(t) be the number of events that occur by time t. A counting process is the collection of all the random variables N(t). That is, we are interested in the collection \left\{N(t): t \ge 0 \right\}. More specifically, we are interested in a counting process that satisfies the following three axioms:

  • The probability of having more than one occurrence in a short time interval is essentially zero.
  • The probability of the occurrence of a random event in a short time interval is proportional to the length of the time interval and not on where the time interval is located. More specifically, the probability of the occurrence of the random event in a short interval of length h is approximately \lambda \ h.
  • The numbers of random events occurring in non-overlapping time intervals are independent.

Any counting process that satisfies the above three axioms is called a Poisson process with the rate parameter \lambda. Of special interest are the counting random variables N(t), which is the number of random events that occur in the interval [0,t] and N(s+t)-N(t), which is the number of events that occur in the interval [s,s+t]. There are also continuous variables that are of interest. For example, the time until the occurrence of the first event, denoted by S_1, and in general, the time until the occurrence of the nth event, denoted by S_n. We are also interested in the interarrival times, i.e. the time between the occurrences of two consecutive events. These are notated by T_1, T_2, T_3,\cdots where T_n is the time between the occurrence of the (n-1)st event and the occurrence of the nth event. Of course, S_n=T_1+T_2+\cdots+T_n.

_______________________________________________________________________________________________

Some Basic Results

Given a Poisson process \left\{N(t): t \ge 0 \right\} with rate parameter \lambda>0, we discuss the following basic results:

    Counting Variables

  • The discrete random variable N(t) is a Poisson random variable with mean \lambda \ t.
  • The discrete random variable N(s+t)-N(s) is a Poisson random variable with mean \lambda \ t.

    Waiting Time Variables

  • The interarrival times, T_n, where n=1,2,3,\cdots, are independent and identically distributed exponential random variables with rate parameter \lambda (or mean \frac{1}{\lambda}).
  • The waiting time until the nth event has a gamma distribution with shape parameter n and rate parameter \lambda.

The result that N(t) is a Poisson random variable is a consequence of the fact that the Poisson distribution is the limit of the binomial distribution. To see this, let’s say we have a Poisson process with rate \lambda>0. Note thar \lambda is the rate of occurrence of the event per unit time interval. Then subdivide the interval [0,t] into n subintervals of equal length. When n is sufficiently large, we can assume that there can be only at most one event occurring in a subinterval (using the first two axioms in the Poisson process). Each subinterval is then like a Bermoulli trial (either 0 events or 1 event occurring in the subinterval). The probability of having exactly one event occurring in a subinterval is approximately \lambda (\frac{t}{n})=\frac{\lambda t}{n}. By the third criterion in the Poisson process, the n subintervals are independent Bernoulli trials. Thus the total number of events occurring in these n subintervals is a Binomial random variable with n trials and with probability of success in each trial being \frac{\lambda t}{n}. It can be shown mathematically that when n \rightarrow \infty, the binomial distributions converge to the Poisson distribution with mean \lambda \ t. This fact is shown here and here. It follows that N(t) has a Poisson distribution with mean \lambda \ t.

To show that the increment N(s+t)-N(s) is a Poisson distribution, we simply count the events in the Poisson process starting at time s. It is clear that the resulting counting process is also a Poisson process with rate \lambda. We can use the same subdivision argument to derive the fact that N(s+t)-N(s) is a Poisson random variable with mean \lambda \ t. The subdividing is of course on the interval [s,s+t].

Based on the preceding discussion, given a Poisson process with rate parameter \lambda, the number of occurrences of the random events in any interval of length t has a Poisson distribution with mean \lambda \ t. Thus in a Poisson process, the number of events that occur in any interval of the same length has the same distribution. Any counting process that satisfies this property is said to possess stationary increments. On the other hand, any counting process that satisfies the third criteria in the Poisson process (the numbers of occurrences of events in disjoint intervals are independent) is said to have independent increments. Thus a Poisson process possesses independent increments and stationary increments.

Here is an interesting observation as a result of the possession of independent increments and stationary increments in a Poisson process. By independent increments, the process from any point forward is independent of what had previously occurred. By stationary increments, from any point forward, the occurrences of events follow the same distribution as in the previous phase. Starting with a Poisson process, if we count the events from some point forward (calling the new point as time zero), the resulting counting process is probabilistically the same as the original process. The probabilistic behavior of the new process from some point on is not dependent on history. The probability statements we can make about the new process from some point on can be made using the same parameter as the original process. In other words, a Poisson process has no memory.

We now discuss the continuous random variables derived from a Poisson process. The time until the first change, T_1, has an exponential distribution with mean \frac{1}{\lambda}. To see this, for T_1>t to happen, there must be no events occurring in the interval [0,t]. Thus, P[T_1>t] is identical to P[T_1>t]=P[N(t)=0]=e^{-\lambda t}. This means that T_1 has an exponential distribution with rate \lambda. What about T_2 and T_3 and so on? After the first event had occurred, we can reset the counting process to count the events starting at time T_1. Then the time until the next occurrence is also an exponential random variable with rate \lambda. Consequently, all the interarrival times T_n are exponential random variables with the same rate \lambda. Furthermore, by the discussion in the preceding paragraph, the exponential interarrival times T_1, T_2, T_3,\cdots are independent.

As a consequence of the T_n being independent exponential random variables, the waiting time until the nth change S_n is a gamma random variable with shape parameter n and rate parameter \lambda. Note that S_n=T_1+\cdots+T_n and that independent sum of n identical exponential distribution has a gamma distribution with parameters n and \lambda, which is the identical exponential rate parameter.

The connection between exponential/gamma and the Poisson process provides an expression of the CDF and survival function for the gamma distribution when the shape parameter is an integer. Specifically, the following shows the survival function and CDF of the waiting time S_n as well as the density.

    \displaystyle P(S_n>t)=\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}

    \displaystyle P(S_n \le t)=1-\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}=\sum \limits_{k=n}^{\infty} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}

    \displaystyle f_{S_n}(t)=\frac{1}{(n-1)!} \ \lambda^n \ t^{n-1} \ e^{-\lambda t}

The key in establishing the survival \displaystyle P(S_n>t) is that the waiting time S_n is intimately related to N(t), which has a Poisson distribution with mean \lambda t. For S_n>t to happen, there can be at most n events occurring prior to time t, i.e. N(t) \le t.

_______________________________________________________________________________________________

Reversing the Poisson Process

The preceding discussion shows that a Poisson process has independent exponential waiting times between any two consecutive events and gamma waiting time between any two events. Starting with a collection \left\{N(t): t \ge 0 \right\} of Poisson counting random variables that satisfies the three axioms described above, it can be shown that the sequence of interarrival times are independent exponential random variables with the same rate parameter as in the given Poisson process. Interestingly, the process can also be reversed, i.e. given a sequence of independent and identically distributed exponential distributions, each with rate \lambda, a Poisson process can be generated.

To see this, let T_1,T_2,T_3,\cdots be a sequence of independent and identically distributed exponential random variables with rate parameter \lambda. Mathematically, the T_n are just independent and identically distributed exponential random variables. Now think of them as the interarrival times between consecutive events. So the first event occurs at time S_1=T_1 and the second event occurs at time S_2=T_1+T_2 and so on. In general, the nth event occurs at time S_n=T_1+\cdots+T_n. More specifically, the counting process is \left\{N(t): t \ge 0 \right\} where N(t) is defined below:

    N(t)=\text{max} \left\{n: S_n=T_1+\cdots+T_n \le t \right\}

For N(t)=n to happen, it must be true that S_n=T_1+\cdots+T_n \le t and S_{n+1}=T_1+\cdots+T_n+T_{n+1} > t. The probability P[N(t)=n] is then

    \displaystyle \begin{aligned}P[N(t)=n]&=P[S_{n+1} > t \text{ and } S_n \le t]\\&=P[S_{n+1} > t \text{ and } (\text{ not } S_n > t)] \\&=P[S_{n+1} > t]-P[S_n > t] \\&=\sum \limits_{k=0}^{n} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}-\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!} \\&=\frac{e^{-\lambda t} \ (\lambda t)^n}{n!} \end{aligned}

The above derivation shows that the counting variable N(t) is a Poisson random variable with mean \lambda t. The derivation uses the gamma survival function derived earlier. On the other hand, because of the memoryless property, T_1-s,T_2-s,T_3-s,\cdots are also independent exponential random variables with the same rate \lambda. If the counting of events starts at a time s rather than at time 0, the counting would be based on T_k-s,T_{k+1}-s,T_{k+3}-s,\cdots for some k. By the same argument, N(s+t)-N(s) would be Poisson with mean \lambda t.

We have just established that the resulting counting process from independent exponential interarrival times T_n has stationary increments. The resulting counting process has independent increments too. This is because the interarrival times T_n are independent and that the interarrival times T_n are also memoryless.

From a mathematical point of view, a sequence of independent and identically distributed exponential random variables leads to a Poisson counting process. On the other hand, if random events occur in such a manner that the times between two consecutive events are independent and identically and exponentially distributed, then such a process is a Poisson process. This characterization gives another way work with Poisson processes.

_______________________________________________________________________________________________

Examples

Example 1
Suppose that the time until the next departure of a bus at a certain bus station is exponentially distributed with mean 10 minutes. Assume that the times in between consecutive departures at this bus station are independent.

  • Tom and his friend Mike are to take a bus trip together. Tom arrives at the bus station at 12:00 PM and is the first one to arrive. Mike arrives at the bus stop at 12:30 PM. They will board the first bus to depart after the arrival of Mike. What is the the probability that zero buses depart from this bus station while Tom is waiting for Mike?
  • Answer the same question for one bus, and two buses?
  • What is the probability that there are at least three buses leaving the station while Tom is waiting?

Because the inter-departure times are independent and exponential with the same mean, the random events (bus departures) occur according to a Poisson process with rate \frac{1}{10} per minute, or 1 bus per 10 minutes. The number of bus departures in a 30-minute period is a Poisson random variable with mean 3 (per 30 minutes). Let N be the number of buses leaving the bus station between 12:00 PM and 12:30 PM. Thus the answers are:

    P[N=0]=e^{-3}=0.05

    P[N=1]=e^{-3} \times 3=0.149

    P[N=2]=e^{-3} \times \frac{3^2}{2}=0.224

    P[N \ge 3]=1-P[n=0]-P[n=1]-P[n=2]=0.5768 \square

Example 2
Taxi arrives at a certain street corner according to a Poisson process at the rate of two taxi for every 15 minutes. Suppose that you are waiting for a taxi at this street corner and you are third in line. Assume that the people waiting for taxi do not know each other and each one will have his own taxi. What is the probability that you will board a taxi within 30 minutes?

The number of arrivals of taxi in a 30-minute period has a Poisson distribution with a mean of 4 (per 30 minutes). If there are at least 3 taxi arriving, then you are fine. Let N be the number of arrivals of taxi in a 30-minute period. The answer is

    \displaystyle \begin{aligned}P[N \ge 3]&=1-P[N=0] -P[N=1]-P[N=2]\\&=1-e^{-4}-e^{-4} \times 4-e^{-4} \times \frac{4^2}{2} \\&=1-0.2381 \\&=0.7619 \end{aligned}

_______________________________________________________________________________________________

Remarks

In this post, we present a view of the exponential distribution through the view point of the Poisson process. Any counting process that satisfies the three axioms of a Poisson process has independent and exponential waiting time between any two consecutive events and gamma waiting time between any two events. On the other hand, given a sequence of independent and identically distributed exponential interarrival times, a Poisson process can be derived. The memoryless property of the exponential distribution plays a central role in the interplay between Poisson and exponential.

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

The exponential distribution

This post focuses on the mathematical properties of the exponential distribution. Since the exponential distribution is a special case of the gamma distribution, the starting point of the discussion is on the properties that are inherited from the gamma distribution. The discussion then switches to other intrinsic properties of the exponential distribution, e.g. the memoryless property. The next post discusses the intimate relation with the Poisson process. Additional topics on exponential distribution are discussed in this post.

The exponential distribution is highly mathematically tractable. For the exponential distribution with mean \frac{1}{\beta} (or rate parameter \beta), the density function is f(x)=\frac{1}{\beta} \ e^{-\frac{x}{\beta}}. Thus for the exponential distribution, many distributional items have expression in closed form. The exponential distribution can certainly be introduced by performing calculation using the density function. For the sake of completeness, the distribution is introduced as a special case of the gamma distribution.

_______________________________________________________________________________________________

The Gamma Perspective

The exponential distribution is a special case of the gamma distribution. Recall that the gamma distribution has two parameters, the shape parameter \alpha and the rate parameter \beta. Another parametrization uses the scale parameter \theta, which is \theta=1 / \beta. For now we stick with the rate parameter \beta because of the connection with the Poisson process discussed below. The exponential distribution is simply a gamma distribution when \alpha=1. Immediately the properties of the gamma distribution discussed in the previous post can be transferred here. Suppose that X is a random variable that follows the exponential distribution with rate parameter \beta, equivalently with mean 1 / \beta. Immediately we know the following.

\displaystyle \begin{array}{lllll} \text{ } &\text{ } & \text{Definition} & \text{ } & \text{Exponential Distribution} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{PDF} &\text{ } & \text{ } & \text{ } & \displaystyle \begin{array}{ll} \displaystyle f_X(y)=\beta \ e^{-\beta x} & \ \ \displaystyle x>0  \\ \text{ } & \text{ }    \end{array} \\     \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{CDF} &\text{ } & \displaystyle F_X(x)=\int_0^x \ f_X(t) \ dt & \text{ } & \displaystyle \begin{aligned} F_X(x)&=1-e^{-\beta \ x}   \end{aligned} \\    \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{Survival Function} &\text{ } & \displaystyle S_X(x)=\int_x^\infty \ f_X(t) \ dt & \text{ } & \displaystyle \begin{aligned} S_X(x)&=e^{-\beta \ x}   \end{aligned} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{Mean} &\text{ } & \displaystyle E(X)=\int_0^\infty x \ f_X(t) \ dt & \text{ } & \displaystyle \frac{1}{\beta} \\     \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{Higher Moments} &\text{ } & \displaystyle E(X^k)=\int_0^\infty x^k \ f_X(t) \ dt & \text{ } & \displaystyle  \left\{ \begin{array}{ll}                     \displaystyle  \frac{\Gamma(1+k)}{\beta^k} &\ k>-1 \\           \text{ } & \text{ } \\           \displaystyle  \frac{k!}{\beta^k} &\ k \text{ is a positive integer}           \end{array} \right. \\   \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{Variance} &\text{ } & E(X^2)-E(X)^2 & \text{ } & \displaystyle \frac{1}{\beta^2} \\  \text{ } &\text{ } & \text{ } & \text{ } & \text{ } \\    \end{array}

\displaystyle \begin{array}{lllll}  \text{ } &\text{ } & \text{ } & \text{ } & \text{ } \\  \text{Mode} \ \ &\text{ } & \text{ } & \text{ } & \text{always } 0 \\   \text{ } &\text{ } & \text{ } & \text{ } & \text{ } \\  \text{MGF} \ \ &\text{ } &  M_X(t)=E[e^{tX}] \ \ \ \ \ \ \ \ & \text{ } & \displaystyle \begin{array}{ll}  \displaystyle\frac{\beta}{\beta- t} & \ \ \ \ \ \displaystyle t<\beta   \end{array} \\  \text{ } &\text{ } & \text{ } & \text{ } & \text{ } \\  \text{CV} \ \ &\text{ } & \displaystyle \frac{\sqrt{Var(X)}}{E(X)}=\frac{\sigma}{\mu} \ \ \ \ \ \ \ \ & \text{ } & 1 \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{Skewness} \ \ &\text{ } & \displaystyle E\biggl[\biggl(\frac{X-\mu}{\sigma}\biggr)^3\biggr] \ \ \ \ \ \ \ \ & \text{ } & 2 \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{Kurtosis} \ \ &\text{ } & \displaystyle E\biggl[\biggl(\frac{X-\mu}{\sigma}\biggr)^4\biggr] \ \ \ \ \ \ \ \ & \text{ } & 9 \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{Excess Kurtosis} \ \ &\text{ } & \displaystyle E\biggl[\biggl(\frac{X-\mu}{\sigma}\biggr)^4\biggr]-3 \ \ \ \ \ \ \ \ & \text{ } & 6   \end{array}

The above items are obtained by plugging \alpha=1 into the results in the post on gamma distribution. It is clear that exponential distribution is very mathematically tractable. The CDF has a closed form. The moments have a closed form. As a result, it is possible to derive many more properties than the ones shown here.

_______________________________________________________________________________________________

The Memoryless Property

No discussion on the exponential distribution is complete without the mentioning of the memoryless property. Suppose a random variable X has support on the interval (0, \infty). The random variable X is said to have the memoryless property (or the “no memory” property) if

    \displaystyle P(X > u+t \ | \ X > t)=P(X > u) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

holds for all positive real numbers t and u. To get an appreciation of this property, let’s think of X as the lifetime (in years) of some machine or some device. The statement in (1) says that if the device has lived at least t years, then the probability that the device will live at least u more years is the same as the probability that a brand new device lives to age u years. It is as if the device does not remember that it has already been in use for t years. If the lifetime of a device satisfies this property, it does not matter if you buy an old one or a new one. Both old and new have the same probability of living an additional u years. In other words, old device is as good as new.

Since the following is true,

    \displaystyle \begin{aligned} P(X > u+t \ | \ X > t)&=\frac{P(X > u+t \text{ and } X > t)}{P(X > t)} \\&=\frac{P(X > u+t)}{P(X > t)}  \end{aligned}

the memoryless property (1) is equivalent to the following:

    \displaystyle P(X > u+t )=P(X > u) \times P(X > t) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The property (2) says that the survival function of this distribution is a multiplicative function. The exponential distribution satisfies this property, i.e.

    \displaystyle P(X > u+t )=e^{-\beta (u+t)}=e^{-\beta u} \ e^{-\beta t}=P(X > u) \times P(X > t)

On the other hand, any continuous function that satisfies the multiplicative property (2) must be an exponential function (see the argument at the end of the post). Thus there is only one continuous probability distribution that possesses the memoryless property. The memoryless property can also be stated in a different but equivalent way as follows:

    The conditional random variable X-t \ | \ X > t is distributed identically as the unconditional random variable X. \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

The statement (1) can be rewritten \displaystyle P(X-t > u \ | \ X > t)=P(X > u). This means the conditional variable X-t \ | \ X > t shares the same survival function as the original random variable X, hence sharing the same cumulative distribution and the same density function and so on.

The memoryless property, anyone of the three statements, is a striking property. Once again, thinking of X as the lifetime of a system or a device. The conditional random variable in statement (3) is the remaining lifetime for a device at age t. Statement (3) says that it does not matter how old the device is, the remaining useful life is still governed by the same probability distribution for a brand new device, i.e. it is just as likely for an old device to live u more years as for a new device, and that an old device is expected to live as long as a new device.

_______________________________________________________________________________________________

Examples

Let’s look a few examples.

Example 1
Suppose that a bank has two tellers serving retailed customers that walk into the bank lobby. A queue is set up for customers to wait for one of the tellers. The time between the arrivals of two consecutive customers in the queue has an exponential distribution with mean 3 minutes. The time each teller spent with a customer has an exponential distribution with mean 5 minutes. Assume that the service times for the two tellers are independent. At 12:00 PM, both tellers are busy and a customer has just arrived in the queue.

  1. What is the probability that the next arrival in the queue will come before 12:05 PM? Between 12:05 and 12:10? After 12:10 PM?
  2. If no additional customers arrive before 12:05 PM, what is the probability that the next arrival will come within the next 5 minutes?
  3. If both tellers are busy serving customers at 1:00 PM, what is the probability that neither teller will finish the service before 1:05 PM? Before 1:10 PM?

Let X be the waiting time between two consecutive customers and Y be the service time of a bank teller, both in minutes. The answers for Part 1 are

    P[X \le 5]=1-e^{-\frac{5}{3}}=0.81

    P[5 < X \le 10]=e^{-\frac{5}{3}}-e^{-\frac{10}{3}}=0.1532

    P[X > 10]=e^{-\frac{10}{3}}=0.0357

Part 2 involves the memoryless property. The answer is:

    \displaystyle \begin{aligned} P[X \le 10 |X > 5]&=1-P[X > 10 |X > 5] \\&=1-P[X > 5] \\&=P[X \le 5] \\&=1-e^{-\frac{5}{3}}=0.81 \end{aligned}

Part 3 also involves the memoryless property. It does not matter how long each server has spent with the customer prior to 1:00 PM, the remaining service time after 1:00 PM still has the same exponential service time distribution. The answers are:

    \displaystyle \begin{aligned} P[\text{both service times more than 5 min}]&=P[Y > 5] \times P[Y > 5] \\&=e^{-\frac{5}{5}} \times e^{-\frac{5}{5}} \\&=e^{-2} \\&=0.1353 \end{aligned}

    \displaystyle \begin{aligned} P[\text{both service times more than 10 min}]&=P[Y > 10] \times P[Y > 10] \\&=e^{-\frac{10}{5}} \times e^{-\frac{10}{5}} \\&=e^{-4} \\&=0.0183 \end{aligned}

Example 2
Suppose that times between fatal auto accidents on a stretch of busy highway have an exponential distribution with a mean of 20 days. Suppose that an accident occurred on July 1. What is the probability that another fatal accident occurred in the same month? If the month of July were accident-free in this stretch of highway except for the accident on July 1, what is the probability that there will be another fatal accident in the following month (August)?

Let X be the time in days from July 1 to the next fatal accident on this stretch of highway. Then X is exponentially distributed with a mean of 20 days. The probability that another fatal accident will occur in the month of July is P[X \le 31], which is

    P[X \le 31]=1-e^{-\frac{31}{20}}=1-e^{-1.55}=0.7878.

Note that the month of July has 31 days and the month of August has 31 days. If the month of July is accident-free except for the accident on July 1, then the probability that an accident occurs in August is:

    \displaystyle \begin{aligned} P[X \le 62 |X > 31]&=1-P[X > 62 |X > 31] \\&=1-P[X > 31] \\&=P[X \le 31] \\&=1-e^{-\frac{31}{20}}=1-e^{-1.55}=0.7878 \end{aligned}

In setting P[X > 62 |X > 31]=P[X > 31], the memoryless property is used in the above derivation . If the occurrence of fatal accidents is a random event and furthermore, if the time between two successive accidents is exponentially distributed, then there is no “memory” in the waiting of the next fatal accident. Having a full month of no accidents has no bearing on when the next fatal accident occurs. \square

Example 3
Suppose that the amount of damage in an automobile accident follows an exponential distribution with mean 2000. An insurance coverage is available to cover such damages subject to a deductible of 1000. That is, if the damage amount is less than the deductible, the insurance pays nothing. If the damage amount is greater than the deductible, the policy pays the damage amount in excess of the deductible. Determine the mean, variance and standard deviation of the insurance payment per accident.

For clarity, the example is first discussed using \beta and d, where \frac{1}{\beta} is the mean of the exponential damage and d is the deductible. Let X be the amount of the damage of an auto accident. Let Y be the amount paid by the insurance policy per accident. Then the following is the rule for determining the amount of payment.

    \displaystyle  Y = \left\{ \begin{array}{ll}                     \displaystyle  0 & \ \ \ \ X \le d \\           \text{ } & \text{ } \\           \displaystyle   X-d &\ \ \ \ d < X           \end{array} \right.

With this payment rule, E[Y], Var[Y] and \sigma_Y can be worked out based on the exponential random variable X for the damage amount as follows:

    \displaystyle E[Y]=\int_{d}^\infty (x-d) \ \beta \ e^{-\beta x} \ dx

    \displaystyle E[Y^2]=\int_{d}^\infty (x-d)^2 \ \beta \ e^{-\beta x} \ dx

    Var[Y]=E[Y^2]-E[Y]^2

    \sigma_Y=\sqrt{Var[Y]}

Because the exponential distribution is mathematically very tractable, the mean E[Y] and the variance Var[Y] are very doable. Indeed the above integrals are excellent exercise for working with exponential distribution. We would like to demonstrate a different approach. Because of the memoryless property, there is no need to calculate the above integrals.

The insurance payment Y is a mixture. Specifically it can be one of two possibilities. With probability P[X \le d]=1-e^{-\beta d}, Y=0. With probability P[X > d]=e^{-\beta d}, Y=X-d |X > d. Because X is an exponential random variable, Y=X-d |X > d is distributed identically as the original damage amount X. Thus the mean of Y, E[Y], is the weighted average of E[Y|X \le d] and E[Y|X > d]. Likewise E[Y^2] is also a weighted average. The following shows how to calculate the first two moments of Y.

    \displaystyle \begin{aligned} E[Y]&=0 \times P[X \le d]+E[X-d |X > d] \times P[X > d]  \\&=0 \times (1-e^{-\beta d})+E[X] \times e^{-\beta d} \\&=0 \times (1-e^{-\beta d})+\frac{1}{\beta} \times e^{-\beta d} \\&=\frac{1}{\beta} \times e^{-\beta d} \end{aligned}

    \displaystyle \begin{aligned} E[Y^2]&=0 \times P[X \le d]+E[(X-d)^2 |X > d] \times P[X > d]  \\&=0 \times (1-e^{-\beta d})+E[X^2] \times e^{-\beta d} \\&=0 \times (1-e^{-\beta d})+\frac{2}{\beta^2} \times e^{-\beta d} \\&=\frac{2}{\beta^2} \times e^{-\beta d} \end{aligned}

The second moment of the random variable X is E[X^2]=Var[X]+E[X]^2=\frac{1}{\beta^2}+\frac{1}{\beta^2}=\frac{2}{\beta^2}. The variance of the insurance payment Y is

    \displaystyle Var[Y]=\frac{2 e^{-\beta d}}{\beta^2}-\biggl( \frac{e^{-\beta d}}{\beta} \biggr)^2

In this example, \frac{1}{\beta}=2000 and d=1000. We have:

    \displaystyle E[Y]=2000 \ e^{-\frac{1000}{2000}}=2000 \ e^{-0.5}=1213.06

    \displaystyle Var[Y]=2 \ (2000^2) \ e^{-0.5}-(2000 e^{-0.5})^2=3380727.513

    \sigma_Y=\sqrt{Var[Y]}=1838.675

Using the memoryless property and the fact that the insurance Y is a mixture requires less calculation. If the damage amount X is not exponential, then we may have to resort to the direct calculation by doing the above integrals. \square

_______________________________________________________________________________________________

The Unique Distribution with the Memoryless Property

Now we show that exponential distribution is the only one with the memoryless property. First establish the fact that any right continuous function defined on (0,\infty) satisfying the functional relation g(s+t)=g(s) \ g(t) must be an exponential function. The statement that g is a right continuous function means that if x_n \rightarrow x and x<x_n for all n, then g(x_n) \rightarrow g(x).

Let g be a right continuous function that is defined on (0,\infty) such that it satisfies the functional relation g(s+t)=g(s) \ g(t). First, establish the following:

    g(\frac{m}{n})=g(1)^{\frac{m}{n}} for any positive integers m and n.

Note that for any positive integer n, g(\frac{2}{n})=g(\frac{1}{n}+\frac{1}{n})=g(\frac{1}{n}) \ g(\frac{1}{n})=g(\frac{1}{n})^2. It follows that for any positive integer m, g(\frac{m}{n})=g(\frac{1}{n})^m. On the other hand, g(\frac{1}{n})=g(1)^\frac{1}{n}. To see this, note that g(1)=g(\frac{1}{n}+\cdots+\frac{1}{n})=g(\frac{1}{n})^n. Raising both sides to \frac{1}{n} gives the claim. Combining these two claims give the fact stated above. We have established the fact that g(r)=g(1)^r for any positive rational number r.

Next, we show that g(x)=g(1)^x for any x. To see this, let r_j be a sequence of rational numbers converging to x from the right. Then g(r_j) \rightarrow g(x). By the above fact, g(r_j)=g(1)^{r_j} for all j. On the other hand, g(1)^{r_j} \rightarrow g(1)^{x}. The same sequence g(r_j)=g(1)^{r_j} converges to both g(x) and g(1)^{x}. Thus g(x)=g(1)^{x}. This means that g is an exponential function with base g(1). If the natural log constant e is desired as a base, g(x)=e^{a x} where a=\text{ln}(g(1)).

Suppose that X is memoryless. Let S(x) be the survival function of the random variable X, i.e. S(x)=P[X>x]. Then by property (2), S(s+t)=S(s) \ S(t). So the survival function S(x) satisfies the functional relation g(s+t)=g(s) \ g(t). The survival function S(x) is always right continuous. By the fact that any right continuous function satisfying the functional relation g(s+t)=g(s) \ g(t) must be an exponential function, the survival function S(x) must be an exponential function. Thus X is an exponential random variable. This establishes the fact that among the continuous distributions, the exponential distribution is the only one with the memoryless property.

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}