# A catalog of parametric severity models

Various parametric continuous probability models have been presented and discussed in this blog. The number of parameters in these models ranges from one to two, and in a small number of cases three. They are all potential candidates for models of severity in insurance applications and in other actuarial applications. This post highlights these models. The list presented here is not exhaustive; it is only a brief catalog. There are other models that are also suitable for actuarial applications but not accounted for here. However, the list is a good place to begin. This post also serves a navigation device (the table shown below contains links to the blog posts).

A Catalog

Many of the models highlighted here are related to gamma distribution either directly or indirectly. So the catalog starts with the gamma distribution at the top and then branches out to the other related models. Mathematically, the gamma distribution is a two-parameter continuous distribution defined using the gamma function. The gamma sub family includes the exponential distribution, Erlang distribution and chi-squared distribution. These are distributions that are gamma distributions with certain restrictions on the one or both of the gamma parameters. Other distributions are obtained by raising a distribution to a power. Others are obtained by mixing distributions.

Here’s a listing of the models. Click on the links to find out more about the distributions.

……Derived From ………………….Model
Gamma function
Gamma sub families
Independent sum of gamma
Exponentiation
Raising to a power Raising exponential to a positive power

Raising exponential to a power

Raising gamma to a power

Raising Pareto to a power

Burr sub families
Mixture
Others

The above table categorizes the distributions according to how they are mathematically derived. For example, the gamma distribution is derived from the gamma function. The Pareto distribution is mathematically an exponential-gamma mixture. The Burr distribution is a transformed Pareto distribution, i.e. obtained by raising a Pareto distribution to a positive power. Even though these distributions can be defined simply by giving the PDF and CDF, knowing how their mathematical origins informs us of the specific mathematical properties of the distributions. Organizing according to the mathematical origin gives us a concise summary of the models.

$\text{ }$

$\text{ }$

From a mathematical standpoint, the gamma distribution is defined using the gamma function.

$\displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt$

In this above integral, the argument $\alpha$ is a positive number. The expression $t^{\alpha-1} \ e^{-t}$ in the integrand is always positive. The area in between the curve $t^{\alpha-1} \ e^{-t}$ and the x-axis is $\Gamma(\alpha)$. When this expression is normalized, i.e. divided by $\Gamma(\alpha)$, it becomes a density function.

$\displaystyle f(t)=\frac{1}{\Gamma(\alpha)} \ t^{\alpha-1} \ e^{-t}$

The above function $f(t)$ is defined over all positive $t$. The integral of $f(t)$ over all positive $t$ is 1. Thus $f(t)$ is a density function. It only has one parameter, the $\alpha$, which is the shape parameter. Adding the scale parameter $\theta$ making it a two-parameter distribution. The result is called the gamma distribution. The following is the density function.

$\displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ x>0$

Both parameters $\alpha$ and $\theta$ are positive real numbers. The first parameter $\alpha$ is the shape parameter and $\theta$ is the scale parameter.

As mentioned above, many of the distributions listed in the above table is related to the gamma distribution. Some of the distributions are sub families of gamma. For example, when $\alpha$ are positive integers, the resulting distributions are called Erlang distribution (important in queuing theory). When $\alpha=1$, the results are the exponential distributions. When $\alpha=\frac{k}{2}$ and $\theta=2$ where $k$ is a positive integer, the results are the chi-squared distributions (the parameter $k$ is referred to the degrees of freedom). The chi-squared distribution plays an important role in statistics.

Taking independent sum of $n$ independent and identically distributed exponential random variables produces the Erlang distribution, a sub gamma family of distribution. Taking independent sum of $n$ exponential random variables, with pairwise distinct means, produces the hypoexponential distributions. On the other hand, the mixture of $n$ independent exponential random variables produces the hyperexponential distribution.

The Pareto distribution (Pareto Type II Lomax) is the mixture of exponential distributions with gamma mixing weights. Despite the connection with the gamma distribution, the Pareto distribution is a heavy tailed distribution. Thus the Pareto distribution is suitable for modeling extreme losses, e.g. in modeling rare but potentially catastrophic losses.

As mentioned earlier, raising a Pareto distribution to a positive power generates the Burr distribution. Restricting the parameters in a Burr distribution in a certain way will produces the paralogistic distribution. The table indicates the relationships in a concise way. For details, go into the blog posts to get more information.

Tail Weight

Another informative way to categorize the distributions listed in the table is through looking at the tail weight. At first glance, all the distributions may look similar. For example, the distributions in the table are right skewed distributions. Upon closer look, some of the distributions put more weights (probabilities) on the larger values. Hence some of the models are more suitable for models of phenomena with significantly higher probabilities of large or extreme values.

When a distribution significantly puts more probabilities on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). In general tail weight is a relative concept. For example, we say model A has a larger tail weight than model B (or model A has a heavier tail than model B). However, there are several ways to check for tail weight of a given distribution. Here are the four criteria.

Tail Weight Measure What to Look for
1 Existence of moments The existence of more positive moments indicates a lighter tailed distribution.
2 Hazard rate function An increasing hazard rate function indicates a lighter tailed distribution.
3 Mean excess loss function An increasing mean excess loss function indicates a heavier tailed distribution.
4 Speed of decay of survival function A survival function that decays rapidly to zero (as compared to another distribution) indicates a lighter tailed distribution.

Existence of moments
For a positive real number $k$, the moment $E(X^k)$ is defined by the integral $\int_0^\infty x^k \ f(x) \ dx$ where $f(x)$ is the density function of the distribution in question. If the distribution puts significantly more probabilities in the larger values in the right tail, this integral may not exist (may not converge) for some $k$. Thus the existence of moments $E(X^k)$ for all positive $k$ is an indication that the distribution is a light tailed distribution.

In the above table, the only distributions for which all positive moments exist are gamma (including all gamma sub families such as exponential), Weibull, lognormal, hyperexponential, hypoexponential and beta. Such distributions are considered light tailed distributions.

The existence of positive moments exists only up to a certain value of a positive integer $k$ is an indication that the distribution has a heavy right tail. All the other distributions in the table are considered heavy tailed distribution as compared to gamma, Weibull and lognormal. Consider a Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta$. Note that the existence of the Pareto higher moments $E(X^k)$ is capped by the shape parameter $\alpha$. If the Pareto distribution is to model a random loss, and if the mean is infinite (when $\alpha=1$), the risk is uninsurable! On the other hand, when $\alpha \le 2$, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

Hazard rate function
The hazard rate function $h(x)$ of a random variable $X$ is defined as the ratio of the density function and the survival function.

$\displaystyle h(x)=\frac{f(x)}{S(x)}$

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged $x$ will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for $x$ units of time.

Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. On the other hand, a distribution with an increasing hazard rate function has a light tailed distribution. If the hazard rate function is decreasing (over time if the random variable is a time variable), then the population die off at a decreasing rate, hence a heavier tail for the distribution in question.

The Pareto distribution is a heavy tailed distribution since the hazard rate is $h(x)=\alpha/x$ (Pareto Type I) and $h(x)=\alpha/(x+\theta)$ (Pareto Type II Lomax). Both hazard rates are decreasing function.

The Weibull distribution is a flexible model in that when its shape parameter is $0<\tau<1$, the Weibull hazard rate is decreasing and when $\tau>1$, the hazard rate is increasing. When $\tau=1$, Weibull is the exponential distribution, which has a constant hazard rate.

The point about decreasing hazard rate as an indication of a heavy tailed distribution has a connection with the fourth criterion. The idea is that a decreasing hazard rate means that the survival function decays to zero slowly. This point is due to the fact that the hazard rate function generates the survival function through the following.

$\displaystyle S(x)=e^{\displaystyle -\int_0^x h(t) \ dt}$

Thus if the hazard rate function is decreasing in $x$, then the survival function will decay more slowly to zero. To see this, let $H(x)=\int_0^x h(t) \ dt$, which is called the cumulative hazard rate function. As indicated above, $S(x)=e^{-H(x)}$. If $h(x)$ is decreasing in $x$, $H(x)$ has a lower rate of increase and consequently $S(x)=e^{-H(x)}$ has a slower rate of decrease to zero.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The mean excess loss function
The mean excess loss is the conditional expectation $e_X(d)=E(X-d \lvert X>d)$. If the random variable $X$ represents insurance losses, mean excess loss is the expected loss in excess of a threshold conditional on the event that the threshold has been exceeded. Suppose that the threshold $d$ is an ordinary deductible that is part of an insurance coverage. Then $e_X(d)$ is the expected payment made by the insurer in the event that the loss exceeds the deductible.

Whenever $e_X(d)$ is an increasing function of the deductible $d$, the loss $X$ is a heavy tailed distribution. If the mean excess loss function is a decreasing function of $d$, then the loss $X$ is a lighter tailed distribution.

The Pareto distribution can also be classified as a heavy tailed distribution based on an increasing mean excess loss function. For a Pareto distribution (Type I) with shape parameter $\alpha$ and scale parameter $\theta$, the mean excess loss is $e(X)=d/(\alpha-1)$, which is increasing. The mean excess loss for Pareto Type II Lomax is $e(X)=(d+\theta)/(\alpha-1)$, which is also decreasing. They are both increasing functions of the deductible $d$! This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If the underlying distribution for a random loss is Pareto, it is a catastrophic risk situation.

In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Speed of decay of the survival function to zero
The survival function $S(x)=P(X>x)$ captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed. This point is touched on when discussing hazard rate function.

The following is a comparison of a Pareto Type II survival function and an exponential survival function. The Pareto survival function has parameters ($\alpha=2$ and $\theta=2$). The two survival functions are set to have the same 75th percentile, which is $x=2$. The following table is a comparison of the two survival functions.

$\displaystyle \begin{array}{llllllll} \text{ } &x &\text{ } & \text{Pareto } S_X(x) & \text{ } & \text{Exponential } S_Y(x) & \text{ } & \displaystyle \frac{S_X(x)}{S_Y(x)} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{ } &2 &\text{ } & 0.25 & \text{ } & 0.25 & \text{ } & 1 \\ \text{ } &10 &\text{ } & 0.027777778 & \text{ } & 0.000976563 & \text{ } & 28 \\ \text{ } &20 &\text{ } & 0.008264463 & \text{ } & 9.54 \times 10^{-7} & \text{ } & 8666 \\ \text{ } &30 &\text{ } & 0.00390625 & \text{ } & 9.31 \times 10^{-10} & \text{ } & 4194304 \\ \text{ } &40 &\text{ } & 0.002267574 & \text{ } & 9.09 \times 10^{-13} & \text{ } & 2.49 \times 10^{9} \\ \text{ } &60 &\text{ } & 0.001040583 & \text{ } & 8.67 \times 10^{-19} & \text{ } & 1.20 \times 10^{15} \\ \text{ } &80 &\text{ } & 0.000594884 & \text{ } & 8.27 \times 10^{-25} & \text{ } & 7.19 \times 10^{20} \\ \text{ } &100 &\text{ } & 0.000384468 & \text{ } & 7.89 \times 10^{-31} & \text{ } & 4.87 \times 10^{26} \\ \text{ } &120 &\text{ } & 0.000268745 & \text{ } & 7.52 \times 10^{-37} & \text{ } & 3.57 \times 10^{32} \\ \text{ } &140 &\text{ } & 0.000198373 & \text{ } & 7.17 \times 10^{-43} & \text{ } & 2.76 \times 10^{38} \\ \text{ } &160 &\text{ } & 0.000152416 & \text{ } & 6.84 \times 10^{-49} & \text{ } & 2.23 \times 10^{44} \\ \text{ } &180 &\text{ } & 0.000120758 & \text{ } & 6.53 \times 10^{-55} & \text{ } & 1.85 \times 10^{50} \\ \text{ } & \text{ } \\ \end{array}$

Note that at the large values, the Pareto right tails retain much more probabilities. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. Using an exponential distribution to model a Pareto random phenomenon would be a severe modeling error even though the exponential distribution may be a good model for describing the loss up to the 75th percentile (in the above comparison). It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

$\displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\theta^\alpha}{(x+\theta)^\alpha}}{e^{-\lambda x}}=\frac{\theta^\alpha e^{\lambda x}}{(x+\theta)^\alpha} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty$

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in $x$. This ratio goes to infinity as $x \rightarrow \infty$.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator.

It is important to examine the tail behavior of a distribution when considering it as a candidate for a model. The four criteria discussed here provide a crucial way to classify parametric models according to the tail weight.

severity models
math

Daniel Ma
mathematics

$\copyright$ 2017 – Dan Ma

# The Chi-Squared Distribution, Part 2

This is the part 2 of a 3-part series on the chi-squared distribution. In this post, we discuss several theorems, all centered around the chi-squared distribution, that play important roles in inferential statistics for the population mean and population variance of normal populations. These theorems are the basis for the test statistics used in the inferential procedures.

We first discuss the setting for the inference procedures. Then discuss the pivotal theorem (Theorem 5). We then proceed to discuss that theorems that produce the test statistics for $\mu$, the population mean of a normal population and for $\mu_1-\mu_2$, a difference of two population means from two normal populations. The discussion then shifts to the inference procedures on population variance.

_______________________________________________________________________________________________

The Settings

To facilitate the discussion, we use the notation $\mathcal{N}(\mu,\sigma^2)$ to denote the normal distribution with mean $\mu$ and variance $\sigma^2$. Whenever the random variable $X$ follows such as distribution, we use the notation $X \sim \mathcal{N}(\mu,\sigma^2)$.

The setting for making inference on one population is that we have a random sample $Y_1,Y_2,\cdots,Y_n$, drawn from a normal population $\mathcal{N}(\mu,\sigma^2)$. The sample mean $\overline{Y}$ and the sample variance $S^2$ are unbiased estimators of $\mu$ and $\sigma^2$, respectively, given by:

$\displaystyle \overline{Y}=\frac{Y_1+\cdots+Y_n}{n}=\frac{1}{n} \sum \limits_{j=1}^n Y_j \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

$\displaystyle S^2=\frac{1}{n-1} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The goal is to use the information obtained from the sample, namely $\overline{Y}$ and $S^2$, to estimate or make decisions about the unknown population parameters $\mu$ and $\sigma^2$.

Because the sample is drawn from a normal population, the sample $\overline{Y}$ has a normal distribution, more specifically $\overline{Y} \sim \mathcal{N}(\mu,\frac{\sigma^2}{n})$, which has two unknown parameters. To perform inferential procedures on the population mean $\mu$, it is preferable to have a test statistic that depends on $\mu$ only. To this end, a t-statistic is used (see Theorem 7), which has the t-distribution with $n-1$ degrees of freedom (one less than the sample size). Because the parameter $\sigma^2$ is replaced by the sample variance $S^2$, the t-statistic has only $\mu$ as the unknown parameter.

On the other hand, to perform inferential procedures on the population variance $\sigma^2$, we use a statistic that has a chi-squared distribution and that has only one unknown parameter $\sigma^2$ (see Theorem 5).

Now, the setting for performing inference on two normal populations. Let $X_1,X_2,\cdots,X_n$ be a random sample drawn from the distribution $\mathcal{N}(\mu_X,\sigma_X^2)$. Let $Y_1,Y_2,\cdots,Y_m$ be a random sample drawn from the distribution $\mathcal{N}(\mu_Y,\sigma_Y^2)$. Because the two samples are independent, the difference of the sample means $\overline{X}-\overline{Y}$ has a normal distribution. Specifically, $\overline{X}-\overline{Y} \sim \mathcal{N}(\mu_X-\mu_Y,\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{n})$. Theorem 8 gives a t-statistic that is in terms of the difference $\mu_X-\mu_Y$ such that the two unknown population variances are replaced by the pooled sample variance. This is done with the simplifying assumption that the two population variances are identical.

On the other hand, the inference on the population variances $\sigma_X^2$ and $\sigma_Y^2$, a statistic that has the F distribution can be used (See Theorem 10). One caveat is that this test statistic is sensitive to non-normality.

_______________________________________________________________________________________________

Connection between Normal Distribution and Chi-squared Distribution

There is an intimate relation between the sample items from a normal distribution and the chi-squared distribution. This is discussed in Part 1. Let’s recall this connection. If we normalize one sample item $Y_j$ and then square it, we obtain a chi-squared random variable with df = 1. Likewise, if we normalize each sample item and then square it, the sum of the squares will be a chi-squared random variable with df = $n$. The following results are discussed in Part 1 and are restated here for clarity.

Theorem 2
Suppose that the random variable $X$ follows a standard normal distribution, i.e. the normal distribution with mean 0 and standard deviation 1. Then $Y=X^2$ follows a chi-squared distribution with 1 degree of freedom.

Corollary 3
Suppose that the random variable $X$ follows a normal distribution with mean $\mu$ and standard deviation $\sigma$. Then $Y=[(X-\mu) / \sigma]^2$ follows a chi-squared distribution with 1 degree of freedom.

Corollary 4
Suppose that $X_1,X_2,\cdots,X_n$ is a random sample drawn from a normal distribution with mean $\mu$ and standard deviation $\sigma$. Then the following random variable follows a chi-squared distribution with $n$ degrees of freedom.

$\displaystyle \sum \limits_{j=1}^n \biggl( \frac{X_j-\mu}{\sigma} \biggr)^2=\biggl( \frac{X_1-\mu}{\sigma} \biggr)^2+\biggl( \frac{X_2-\mu}{\sigma} \biggr)^2+\cdots+\biggl( \frac{X_n-\mu}{\sigma} \biggr)^2$

_______________________________________________________________________________________________

A Pivotal Theorem

The statistic in Corollary 4 has two unknown parameters $\mu$ and $\sigma^2$. It turns out that the statistic will become more useful if $\mu$ is replaced by the sample mean $\overline{Y}$. The cost is that one degree of freedom is lost in the chi-squared distribution. The following theorem gives the details. The result is a statistic that is a function of the sample variance $S^2$ and the population variance $\sigma^2$.

Theorem 5
Let $Y_1,Y_2,\cdots,Y_n$ be a random sample drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$. Then the following conditions hold.

• The sample mean $\overline{Y}$ and the sample variance $S^2$ are independent.
• The statistic $\displaystyle \frac{(n-1) S^2}{\sigma^2}=\frac{1}{\sigma^2} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2$ has a chi-squared distribution with $n-1$ degrees of freedom.

Proof of Theorem 5
We do not prove the first bullet point. For a proof, see Exercise 13.93 in [2]. For the second bullet point, note that

\displaystyle \begin{aligned}\sum \limits_{j=1}^n \biggl( \frac{Y_j-\mu}{\sigma} \biggr)^2&=\sum \limits_{j=1}^n \biggl( \frac{(Y_j-\overline{Y})+(\overline{Y}-\mu)}{\sigma} \biggr)^2 \\&=\sum \limits_{j=1}^n \biggl( \frac{Y_j-\overline{Y}}{\sigma} \biggr)^2 +\frac{n (\overline{Y}-\mu)^2}{\sigma^2} \\&=\frac{(n-1) S^2}{\sigma^2} +\biggl( \frac{\overline{Y}-\mu}{\frac{\sigma}{\sqrt{n}}} \biggr)^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)\end{aligned}

Note that in expanding $[(Y_j-\overline{Y})+(\overline{Y}-\mu)]^2$, the sum of the middle terms equals to 0. Furthermore the result $(3)$ can be restated as follows:

$\displaystyle Q=\frac{(n-1) S^2}{\sigma^2}+Z^2$

where $Q=\sum \limits_{j=1}^n \biggl( \frac{Y_j-\mu}{\sigma} \biggr)^2$ and $Z=\frac{\overline{Y}-\mu}{\frac{\sigma}{\sqrt{n}}}$. Note that $Z$ is a standard normal random variable. Thus $Z^2$ has a chi-squared distribution with df = 1 (by Theorem 2). Since $Q$ is an independent sum of squares of standardized normal variables, $Q$ has a chi-squared distribution with $n$ degrees of freedom. Furthermore, since $\overline{Y}$ and $S^2$ are independent, $Z^2$ and $S^2$ are independent. Let $H=\frac{(n-1) S^2}{\sigma^2}$. As a result, $H$ and $Z^2$ are independent. The following gives the moment generating function (MGF) of $Q$.

\displaystyle \begin{aligned}E[e^{t \ Q}]&=E[e^{t \ (H+Z^2)}] \\&=E[e^{t \ H}] \ E[e^{t \ Z^2}] \end{aligned}

Since $Q$ and $Z^2$ follow chi-squared distributions, we can plug in the chi-squared MGFs to obtain the MGF of the random variable $H$.

$\displaystyle \biggl(\frac{1}{1-2t} \biggr)^{\frac{n}{2}}=E[e^{t \ H}] \ \biggl(\frac{1}{1-2t} \biggr)^{\frac{1}{2}}$

$\displaystyle E[e^{t \ H}]=\biggl(\frac{1}{1-2t} \biggr)^{\frac{n-1}{2}}$

The MGF for $H$ is that of a chi-squared distribution with $n-1$ degrees of freedom. $\square$

Remark
It is interesting to compare the following two quantities:

$\displaystyle \sum \limits_{j=1}^n \biggl( \frac{X_j-\mu}{\sigma} \biggr)^2$

$\displaystyle \frac{(n-1) S^2}{\sigma^2}=\frac{1}{\sigma^2} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2=\sum \limits_{j=1}^n \biggl( \frac{X_j-\overline{Y}}{\sigma} \biggr)^2$

The first quantity is from Corollary 4 and has a chi-squared distribution with $n$ degrees of freedom. The second quantity is from Theorem 5 and has a chi-squared distribution with $n-1$ degrees of freedom. Thus the effect of Theorem 5 is that by replacing the population mean $\mu$ with the sample mean $\overline{Y}$, one degree of freedom is lost in the chi-squared distribution.

Theorem 5 is a pivotal theorem that has wide applications. For our purposes at hand, it can be used for inference on both the mean and variance. Even though one degree of freedom is lost, the statistic $\displaystyle \frac{(n-1) S^2}{\sigma^2}$ is a function of one unknown parameter, namely the population variance $\sigma^2$. Since the sampling distribution is known (chi-squared), we can make probability statement about the statistic. Hence the statistic is useful for making inference about the population variance $\sigma^2$. As we will see below, in conjunction with other statistics, the statistic in Theorem 5 can be used for inference of two population variances as well as for inference on the mean (one sample and two samples).

_______________________________________________________________________________________________

Basis for Inference on Population Mean

Inference on the population mean of a single normal population and on the difference of the means of two independent normal populations relies on the t-statistic. Theorem 6 shows how to obtain a t-statistic using a chi-squared statistic and the standard normal statistic. Theorem 7 provides the one-sample t-statistic and Theorem 8 provides the two-sample t-statistic.

Theorem 6
Let $Z$ be the standard normal random variable. Let $U$ be a random variable that has a chi-squared distribution with $r$ degrees of freedom. Then the random variable

$\displaystyle T=\frac{Z}{\sqrt{\frac{U}{r}}}$

has a t-distribution with $r$ degrees of freedom and its probability density function (PDF) is

$\displaystyle g(t)=\frac{\Gamma(\frac{r+1}{2})}{\sqrt{\pi r} \Gamma(\frac{r}{2}) \biggl(1+\frac{t^2}{r} \biggr)^{\frac{r+1}{2}}} \ \ \ \ \ \ \ \ \ -\infty

Remark
The probability density function given here is not important for the purpose at hand. For the proof of Theorem 6, see [2]. The following two theorems give two applications of Theorem 6.

Theorem 7
Let $Y_1,Y_2,\cdots,Y_n$ be a random sample drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$. Let $S^2$ be the sample variance defined in $(2)$. Then the random variable

$\displaystyle T=\frac{\overline{Y}-\mu}{\frac{S}{\sqrt{n}}}$

has a t-distribution with $n-1$ degrees of freedom.

Proof of Theorem 7
Consider the following statistics.

$\displaystyle Z=\frac{\overline{Y}-\mu}{\frac{\sqrt{\sigma}}{n}}$

$\displaystyle U=\frac{(n-1) \ S^2}{\sigma^2}$

Note that $Z$ has the standard normal distribution. By Theorem 5, the quantity $U$ has a chi-square distribution with df = $n-1$. By Theorem 6, the following quantity has a t-distribution with df = $n-1$.

$\displaystyle T=\frac{Z}{\sqrt{\frac{U}{n-1}}}=\frac{\overline{Y}-\mu}{\frac{S}{\sqrt{n}}}$

The above result is obtained after performing algebraic simplification. $\square$

Theorem 8
Let $X_1,X_2,\cdots,X_n$ be a random sample drawn from a normal distribution with mean $\mu_X$ and variance $\sigma_X^2$. Let $Y_1,Y_2,\cdots,Y_m$ be a random sample drawn from a normal distribution with mean $\mu_Y$ and variance $\sigma_Y^2$. Suppose that $\sigma_X^2=\sigma_Y^2=\sigma^2$. Then the following statistic:

$\displaystyle T=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{S_p \ \sqrt{\frac{1}{n}+\frac{1}{m}}}$

has a t-distribution with df = $n+m-2$ where $\displaystyle S_p^2=\frac{(n-1) \ S_X^2+ (m-1) \ S_Y^2}{n+m-2}$.

Note that $S_p^2$ is the pooled variance of the two sample variances $S_X^2$ and $S_Y^2$.

Proof of Theorem 8
First, the sample mean $\overline{X}$ has a normal distributions with mean and variance $\mu_X$ and $\frac{\sigma_X^2}{n}$, respectively. The sample mean $\overline{Y}$ has a normal distribution with mean and variance $\mu_Y$ and $\frac{\sigma_Y^2}{n}$, respectively. Since the two samples are independent, $\overline{X}$ and $\overline{Y}$ are independent. Thus the sample difference $\overline{X}-\overline{Y}$ has a normal distribution with mean $\mu_X-\mu_Y$ and variance $\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{n}$. The following is a standardized normal random variance:

$\displaystyle Z=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{\sqrt{\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{m}}}$

On the other hand, by Theorem 5 the following quantities have chi-squared distributions with degrees of freedom $n-1$ and $m-1$, respectively.

$\displaystyle \frac{(n-1) S_X^2}{\sigma_X^2}=\frac{\sum \limits_{j=1}^n (X_j-\overline{X})^2}{\sigma_X^2}$

$\displaystyle \frac{(n-1) S_Y^2}{\sigma_Y^2}=\frac{\sum \limits_{j=1}^m (Y_j-\overline{Y})^2}{\sigma_Y^2}$

Because the two samples are independent, the two chi-squared statistics are independent. Then the following is a chi-squared statistic with $n+m-2$ degrees of freedom.

$\displaystyle U=\frac{(n-1) S_X^2}{\sigma_X^2}+\frac{(n-1) S_Y^2}{\sigma_Y^2}$

By Theorem 6, the following ratio

$\displaystyle T=\frac{Z}{\sqrt{\frac{U}{n+m-2}}}$

has a t-distribution with $n+m-2$ degrees of freedom. Here’s where the simplifying assumption of $\sigma_X^2=\sigma_Y^2=\sigma^2$ is used. Plugging in this assumption gives the following:

$\displaystyle T=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{\sqrt{\frac{(n-1) S_X^2+(m-1) S_Y^2}{n+m-2} \ (\frac{1}{n}+\frac{1}{m})}}=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{S_p \ \sqrt{\frac{1}{n}+\frac{1}{m}}}$

where $S_p^2$ is the pooled sample variance of the two samples as indicated above. $\square$

_______________________________________________________________________________________________

Basis for Inference on Population Variance

As indicated above, the statistic $\displaystyle \frac{(n-1) S^2}{\sigma^2}$ given in Theorem 5 can be used for inference on the variance of a normal population. The following theorem gives the basis for the statistic used for comparing the variances of two normal populations.

Theorem 9
Suppose that the random variables $U$ and $V$ are independent chi-squared random variables with $r_1$ and $r_2$ degrees of freedom, respectively. Then the statistic

$\displaystyle F=\frac{U / r_1}{V / r_2}$

has an F-distribution with $r_1$ and $r_2$ degrees of freedom.

Remark
The F-distribution depends on two parameters $r_1$ and $r_2$. The order they are given is important. We regard the first parameter as the degrees of freedom of the chi-squared distribution in the numerator and the second parameter as the degrees of freedom of the chi-squared distribution in the denominator.

It is not important to know the probability density functions for both the t-distribution and the F-distribution (in both Theorem 6 and Theorem 9). When doing inference procedures with these distributions, either tables or software will be used.

Given two independent normal random samples $X_1,X_2,\cdots,X_n$ and $Y_1,Y_2,\cdots,Y_n$ (as discussed in the above section on the settings of inference), the sample variance $S_X^2$ is an unbiased estimator of the population variance $\sigma_X^2$ of the first population, and the sample variance $S_Y^2$ is an unbiased estimator of the population variance $\sigma_Y^2$ of the second population. It seems to make sense that the ratio $\displaystyle \frac{S_X^2}{S_Y^2}$ can be used to make inference about the relative magnitude of $\sigma_X^2$ and $\sigma_Y^2$. The following theorem indicates that this is a valid approach.

Theorem 10
Let $X_1,X_2,\cdots,X_n$ be a random sample drawn from a normal distribution with mean $\mu_X$ and variance $\sigma_X^2$. Let $Y_1,Y_2,\cdots,Y_m$ be a random sample drawn from a normal distribution with mean $\mu_Y$ and variance $\sigma_Y^2$. Then the statistic

$\displaystyle \frac{S_X^2 \ / \ \sigma_X^2}{S_Y^2 \ / \ \sigma_Y^2}=\frac{S_X^2}{S_Y^2} \times \frac{\sigma_Y^2}{\sigma_X^2}$

has the F-distribution with degrees of freedom $n-1$ and $m-1$.

Proof of Theorem 10
By Theorem 5, $\displaystyle \frac{(n-1) S_X^2}{\sigma_X^2}$ has a chi-squared distribution with $n-1$ degrees of freedom and $\displaystyle \frac{(m-1) S_Y^2}{\sigma_Y^2}$ has a chi-squared distribution with $m-1$ degrees of freedom. By Theorem 9, the following statistic

$\displaystyle \frac{[(n-1) S_X^2 \ / \ \sigma_X^2] \ / \ (n-1)}{[(m-1) S_Y^2 \ / \ \sigma_Y^2] \ / \ (m-1)}$

has the F-distribution with $n-1$ and $m-1$ degrees of freedom. The statistic is further simplified to become the statistic as stated in the theorem. $\square$

_______________________________________________________________________________________________

Concluding Remarks

Theorem 7 and Theorem 8 produce the one-sample t-statistic and two-sample t-statistic, respectively. They are procedures for inference about one population mean and the difference of two population means, respectively. They can be used for estimation (e.g. construction of confidence intervals) or decision making (e.g. hypothesis testing). On the other hand, Theorem 5 produces a chi-squared statistic for inference about one population variance. Theorem 10 produces an F-statistic that can be used for inference about two population variances. Since the F-statistic is a ratio of the two population variances, it can be used for inference about the relative magnitude of the variances.

The purpose in this post is to highlight the important roles of the chi-squared distribution. We now discuss briefly the quality of the derived statistical procedures. The procedures discussed here (t or F) are exactly correct if the populations from which the samples are drawn are normal. Real life data usually do not exactly follow normal distributions. Thus the usefulness of these statistics in practice depends on how strongly they are affected by non-normality. In other words, if there is a significant deviation from the assumption of normal distribution, are these procedures still reliable?

A statistical inference procedure is called robust if the calculated results drawn from the procedure are insensitive to deviations of assumptions. For a non-robust procedure, the result would be distorted if there is deviation from assumptions. For example, the t procedures are not robust against outliers. The presence of outliers in the data can distort the results since the t procedures are based on the sample mean $\overline{x}$ and sample variance $S^2$, which are not resistant to outliers.

On the other hand, the t procedures for inference about means are quite robust against slight deviation of normal population assumptions. The F procedures for inference about variances are not so robust. So it must be used with care. Even if there is a slight deviation from normal assumptions, the results from the F procedures may not be reliable. For a more detailed but accessible discussion on robustness, see [1].

When the sample sizes are large, the sample mean $\overline{x}$ is close to a normal distribution (this result is called the central limit theorem). So the discussion about deviation of normality assumption is no longer important. When the sample sizes are large, simply use the Z statistic for inference about the means. On the other hand, when the sample sizes are large, the sample variance $S^2$ will be an accurate estimate of the population variance $\sigma^2$ regardless of the assumption of the population distribution. This fact is related to the law of large numbers. Thus the statistical procedures described here are for small sample sizes and for assuming normal populations.

_______________________________________________________________________________________________

Reference

1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$

# The Weibull distribution

Mathematically, the Weibull distribution has a simple definition. It is mathematically tractable. It is also a versatile model. The Weibull distribution is widely used in life data analysis, particularly in reliability engineering. In addition to analysis of fatigue data, the Weibull distribution can also be applied to other engineering problems, e.g. for modeling the so called weakest link model.This post gives an introduction to the Weibull distribution.

_______________________________________________________________________________________________

Defining the Weibull Distribution

A random variable $Y$ is said to follow a Weibull distribution if $Y$ has the following density function

$\displaystyle f(y)=\frac{\tau}{\lambda} \ \biggl( \frac{y}{\lambda} \biggr)^{\tau-1} \text{exp} \biggl[- \biggl(\frac{y}{\lambda}\biggr)^\tau \ \biggr] \ \ \ ;y>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

where $\tau>0$ and $\lambda>0$ are some fixed constants. The notation $\text{exp}[x]$ refers to the exponential function $e^{x}$. As defined here, the Weibull distribution is a two-parameter distribution with $\tau$ being the shape parameter and $\lambda$ being the scale parameter. The following is the cumulative distribution function (CDF).

$\displaystyle F(y)=1- \text{exp} \biggl[- \biggl(\frac{y}{\lambda}\biggr)^\tau \ \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

_______________________________________________________________________________________________

Connection with the Exponential Distribution

The Weibull distribution can also arise naturally from the random sampling of an exponential random variable. A better way to view Weibull is through the lens of exponential. Taking an observation from an exponential distribution and raising it to a positive power will result in a Weibull observation. Specifically, the random variable $\displaystyle Y=X^{\frac{1}{\tau}}$ has the same CDF as in $(2)$ if $X$ is an exponential random variable with mean $\lambda^\tau$. To see this, consider the following:

\displaystyle \begin{aligned} P[Y \le y]&=P[X^{\frac{1}{\tau}} \le y] \\&=P[X \le y^{\tau}] \\&=1-e^{-\frac{y^{\tau}}{\lambda^\tau}} \\&=1-e^{-(\frac{y}{\lambda})^\tau} \end{aligned}

_______________________________________________________________________________________________

Basic Properties

The idea of the Weibull distribution as a power of an exponential distribution simplifies certain calculation on the Weibull distribution. For example, a raw moment of the Weibull distribution is simply another raw moment of the exponential distribution. For an exponential random variable $X$ with mean $\theta$, the raw moments $E[X^k]$ are (details can be found here):

$\displaystyle E[X^k]=\left\{ \begin{array}{ll} \displaystyle \Gamma(1+k) \ \theta^k &\ k>-1 \\ \text{ } & \text{ } \\ \displaystyle k! \ \theta^k &\ k \text{ is a positive integer} \end{array} \right.$

where $\Gamma(\cdot)$ is the gamma function. To see how the gamma function can be evaluated in Microsoft Excel, see the last section of this post.

For the Weibull random variable $Y$ with parameters $\tau$ and $\lambda$, i.e. $Y=X^{1 / \tau}$ where $X$ is the exponential random variable with mean $\lambda^\tau$, the following shows the mean and higher moments.

$\displaystyle E[Y]=E[X^{\frac{1}{\tau}}]=\Gamma \biggl(1+\frac{1}{\tau} \biggr) \ \lambda \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

$\displaystyle E[Y^k]=E[X^{\frac{k}{\tau}}]=\Gamma \biggl(1+\frac{k}{\tau} \biggr) \ \lambda^k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

With the moments established, several other distributional quantities that are based on moments can also be established. The following shows the variance, skewness and kurtosis.

$\displaystyle Var[Y]=E[Y^2]-E[Y]^2=\biggl[ \Gamma \biggl(1+\frac{2}{\tau} \biggr)-\Gamma \biggl(1+\frac{1}{\tau} \biggr)^2 \biggr] \ \lambda^2 \ \ \ \ \ \ \ \ \ \ \ \ \ (5)$

\displaystyle \begin{aligned} \gamma_1&=\frac{E[(Y-\mu)^3}{\sigma^3} \\&=\frac{E[Y^3]-3 \ \mu \ E[Y^2]+2 \ \mu^3}{\sigma^3} \\&\displaystyle =\frac{\Gamma \biggl(1+\frac{3}{\tau} \biggr) \lambda^3-3 \ \mu \ \Gamma \biggl(1+\frac{2}{\tau} \biggr) \lambda^2+2 \ \mu^3}{(\sigma^2)^{\frac{3}{2}}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6) \end{aligned}

\displaystyle \begin{aligned} \gamma_2&=\frac{E[(Y-\mu)^4}{\sigma^4} \\&=\frac{E[Y^4]-4 \ \mu \ E[Y^3]+6 \ \mu^2 \ E[Y^2]-3 \ \mu^4}{\sigma^4} \\&\displaystyle =\frac{\Gamma \biggl(1+\frac{4}{\tau} \biggr) \lambda^4-4 \ \mu \ \Gamma \biggl(1+\frac{3}{\tau} \biggr) \lambda^3+6 \ \mu^2 \ \Gamma \biggl(1+\frac{2}{\tau} \biggr) \lambda^2-3 \ \mu^4}{\sigma^4} \ \ \ \ \ (7) \end{aligned}

The notation $\gamma_1$ here is the skewness. The notation $\gamma_2$ is kurtosis. The excess kurtosis is $\gamma_2-3$. In some sources, the notation $\gamma_2$ is to denote excess kurtosis. Of course $\mu$ and $\sigma^2$ are the mean and variance, respectively.

Another calculation that is easily accessible for the Weibull distribution is that of the percentiles. It is easy to solve for $y$ in the CDF in $(2)$. For example, to find the median, set the CDF equals to 0.5 and solves for $y$, producing the following.

$\displaystyle \text{median}=\lambda \ \biggl(\text{ln}(2) \biggr)^{\frac{1}{\tau}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)$

Another basic and important property to examine is the failure rate. The failure rate of a distribution is the ratio of the density function to its survival function. The following is the failure of the Weibull distribution.

$\displaystyle \mu(t)=\frac{f(t)}{1-F(t)}=\frac{\tau}{\lambda} \ \biggl( \frac{t}{\lambda} \biggr)^{\tau-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)$

See here for a discussion of the failure rate in conjunction with the exponential distribution. Suppose that the distribution in question is a lifetime distribution (time until termination or death). Then the failure rate $\mu(t)$ can be interpreted as the rate of failure at the next instant given that the life has survived to time $t$.

_______________________________________________________________________________________________

When the Parameters Vary

The discussion in the previous section might give the impression that all Weibull distributions (when the parameters vary) behave in the same way. We now look at examples showing that as $\tau$ (shape parameter) and/or $\lambda$ (scale parameter) vary, the distribution will exhibit markedly different behavior. Note that when $\tau=1$, the Weibull distribution is reduced to the exponential distribution.

Example 1
The following diagram shows the PDFs of the Weibull distribution with $\tau=0.5$, $\tau=1$ and $\tau=2$ where $\lambda=1$ in all three cases.

Figure 1

Figure 1 shows the effect of the shape parameter taking on different values while keeping the scale parameter fixed. The effect is very pronounced on the skewness. All three density curves are right skewed. The PDF with $\tau=0.5$ (the blue curve) has a very strong right skew. The PDF with $\tau=1$ (the red curve) is exponential and has, by comparison, a much smaller skewness. The PDF with $\tau=2$ looks almost symmetric, though there is a clear and small right skew. This observation is borne out by the calculation. The skewness coefficients are $\gamma_1=6.62$ (blue curve), $\gamma_1=2$ (red curve) and $\gamma_1=0.6311$ (green curve).

Another clear effect of the shape parameter is the thickness of the tail (in this case the right tail). Figure 1 suggests that the PDF with $\tau=0.5$ (the blue curve) is higher than the other two density curves on the interval $x>2$. As a result, the blue curve has more probability mass in the right tail. Thus the blue curve has a thicker tail comparing to the other two PDFs. For a numerical confirmation, the following table compares the probabilities in the right tail.

$\displaystyle \begin{array}{lllllll} \text{ } &\text{ } & \tau=0.5 & \text{ } & \tau=1 & \text{ } & \tau=2 \\ \text{ } & \text{ } &\text{Blue Curve} & \text{ } & \text{Red Curve} & \text{ }& \text{Green Curve} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ P[X>2] &\text{ } & 0.2431 & \text{ } & 0.1353 & \text{ } & 0.0183\\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ P[X>3] &\text{ } & 0.1769 & \text{ } & 0.0498 & \text{ } & 0.0003355 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ P[X>4] &\text{ } & 0.1353 & \text{ } & 0.0183 & \text{ } & \displaystyle 1.125 \times 10^{-7} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ P[X>5] &\text{ } & 0.1069 & \text{ } & 0.00674 & \text{ } & 1.389 \times 10^{-11} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ P[X>10] &\text{ } & 0.0423 & \text{ } & 0.0000454 & \text{ } & 3.72 \times 10^{-44} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ P[X>25] &\text{ } & 0.00673 & \text{ } & 1.39 \times 10^{-11} & \text{ } & 3.68 \times 10^{-272} \end{array}$

It is clear from the above table that the Weibull distribution with the blue curve assigns more probabilities to the higher values. The mean of the distribution for blue curve is 2. The right tail $x>10$ (over 5 times the mean) contains 4.23% of the probability mass (a small probability for sure but not negligible). The right tail $x>25$ (over 12.5 times the mean) still has a small probability of 0.00673 that cannot be totally ignored. On the other hand, the Weibull distribution for the green curve has a light tail. The mean of the distribution for the green curve is about 0.89. At $x>4$ (over 4.5 times of its mean), the tail probability is already negligible at $1.125 \times 10^{-7}$. At $x>10$ (over 11 times of its mean), the tail probability is $3.72 \times 10^{-44}$, practically zero.

Let’s compare the density curves in Figure 1 with their failure rates. The following figure shows the failure rates for these three Weibull distributions.

Figure 2

According to the definition in $(9)$, the following shows the failure rate function for the three Weibull distributions.

$\mu(y)=\frac{0.5}{y^{0.5}} \ \ \ \ \ y>0 \ \ \ \ \ \ \ \tau=0.5 \ \ \text{(blue curve)}$

$\mu(y)=1 \ \ \ \ \ \ \ \ y>0 \ \ \ \ \ \ \ \tau=1 \ \ \text{(red curve)}$

$\mu(y)=2y \ \ \ \ \ \ y>0 \ \ \ \ \ \ \ \tau=2 \ \ \text{(green curve)}$

The blue curve in Figure 1 ($\tau=0.5$) has a decreasing failure rate as shown in Figure 2. The failure rate function is constant for the case of $\tau=1$ (the exponential case). It is an increasing function for the case of $\tau=2$. This comparison shows that the Weibull distribution is particularly useful for engineers and researchers who study the reliability of machines and devices. If the engineers believe that the failure rate is constant, then an exponential model is appropriate. If they believe that the failure rate increases with time or age, then a Weibull distribution with shape parameter $\tau>1$ is more appropriate. If the engineers believe that the failure rate decreases with time, then a Weibull distribution with shape parameter $\tau<1$ is more appropriate.

Example 2
We now compare Weibull distributions with various values for $\lambda$ (scale parameter) while keeping the shape parameter $\tau$ fixed. The following shows the density curves for the Weibull distributions with $\lambda=1, 2, 3$ while keeping $\tau=2$.

Figure 3

The effect of the scale parameter $\lambda$ is to compress or stretch out the standard Weibull density curve, i.e. the Weibull distribution with $\lambda=1$. For example, the density function for $\lambda=2$ is obtained by stretching out the density curve for the one with $\lambda=1$. The same overall shape is maintained while the density curve is being stretched or compressed. According to $(3)$, the mean of the transformed distribution is increased (stretching) or decreased (compressing). For example, as $\lambda$ is increased from 1 to 2, the mean has a two-fold increase. As the density curve is stretched, the resulting distribution is more spread out and the peak of the density curve decreases. The overall effect of changing the scale parameter is essentially a change in the scale in the x-axis.

The next example is a computational exercise.

Example 3
The time until failure (in months) of a semiconductor device has a Weibull distribution with shape parameter $\tau=2.2$ and scale parameter $\lambda=400$.

• Give the density function and the survival function.
• Determine the probability that the device will last at least 500 hours.
• Determine the probability that the device will last at least 600 hours given that it has been running for over 500 hours.
• Find the mean and standard deviation of the time until failure.
• Determine the failure rate function of the Weibull time until failure.

To obtain the density function, the survival function and the failure rate, follow the relationships in $(1)$, $(2)$ and $(9)$.

$\displaystyle f(y)=\frac{2.2}{400} \ \biggl( \frac{y}{400} \biggr)^{1.2} \text{exp} \biggl[- \biggl(\frac{y}{400}\biggr)^{2.2} \ \biggr]$

$\displaystyle S(y)=\text{exp} \biggl[- \biggl(\frac{y}{400}\biggr)^{2.2} \biggr]$

$\displaystyle \mu(y)=\frac{2.2}{400} \ \biggl( \frac{y}{400} \biggr)^{1.2}$

Note that the Weibull failure rate is the ratio of the density function to the survival function. In this case, the failure is an increasing function of $y$. Since $Y$ is a time scale, then this is a model for machines that wear out over time (see the next section).

The probability that the device will last over 500 hours is $e^{-(\frac{500}{400})^{2.2}}=0.1952$. The unconditional probability that the device will last over 600 hours is $e^{-(\frac{600}{400})^{2.2}}=0.0872$ The conditional probability that the device will last more than 600 hours given that it has lasted more 500 hours is the ratio $\frac{S(600)}{S(500)}=0.4465$.

To find the mean and variance, we need to evaluate the gamma function. Using Excel, we obtain the following two values of the gamma function (as shown here):

$\Gamma(1+\frac{1}{2.2})=0.88562476$

$\Gamma(1+\frac{2}{2.2})=0.964912489$

The mean and standard deviation of the time until failure are:

$E[Y]=\Gamma(1+\frac{1}{2.2}) \times 400=354.2499042$

$E[Y^2]=\Gamma(1+\frac{2}{2.2}) \times 400^2=154385.9982$

$Var[Y]=E[Y^2]-E[Y]^2=28893.00363$

$\sigma_Y=\sqrt{Var[Y]}=169.9794212$

_______________________________________________________________________________________________

The Weibull Failure Rates

Looking at the failure rate function indicated in $(9)$ and looking at Figure 2, it is clear that when the shape parameter $0<\tau<1$, the failure rate decreases with time (if the distribution is a model for the time until death of a life). When the shape parameter $\tau=1$, the failure rate is constant. When the shape parameter $\tau>1$, the failure rate increases with time. As a result, the Weibull family of distribution has a great deal of flexibility for modeling the failures of objects (machines, devices).

When the shape parameter $0<\tau<1$, the failure rate decreases with time. Such a Weibull distribution is a model for infant mortality, or early-life failures. When the shape parameter $\tau=1$ (or near 1), the failure rate is constant or near constant. The resulting Weibull distribution (an exponential model) is a model for random failures (failures that are independent of age). When the shape parameter $\tau>1$, the failure rate increases with time. The resulting Weibull distribution is a model for wear-out failures.

In some applications, it may be necessary to model each phase of a lifetime separately, e.g. the early phase with a Weibull distribution with $0<\tau<1$, the useful phase with a Weibull distribution with $\tau$ close to 1 and the wear-out phase with a Weibull distribution with $\tau>1$. The resulting failure rate curve resembles a bathtub curve. The following is an idealized bathtub curve.

Figure 4

The blue part of the bathtub curve is the early phase of the lifetime, which is characterized by decreasing failure rate. This is the early-life period in which the defective products die off and are taken out of the study. The next period is the useful-life period, the red part of the curve, in which the failures are random that are independent of age. In this phase, the failure rate is constant or near constant. The green part of the bathtub curve is characterized by increasing failure rates, which is the wear-out phase of the lifetime being studied.

_______________________________________________________________________________________________

Another attractiveness of the Weibull model is that it can be used to model the so called the weakest link model. Consider a machine or device that has multiple components. Suppose that the device dies or fails when any one of the components fails. The lifetime of such a machine or device is the time to the first failure. Such a lifetime model is called the weakest link model. It can be shown that under these conditions a Weibull distribution is a good model for the distribution of the lifetime of such a machine or device.

If the time until failure of the individual components are indpendent and identically distributed Weibull random variables, then it follows that the minimum of of the Weibull random variables is also a Weibull random variable. To see this, let $X_1,X_2,\cdots,X_n$ be independent and identically distributed Weibull random varaibles. Let $\tau$ and $\lambda$ be the parameters for the common Weibull distribution. Let $Y=\text{min}(X_1,X_2,\cdots,X_n)$. The following gives the survival function of $Y$.

\displaystyle \begin{aligned} P[Y >y]&=P[\text{all } X_i >y] \\&=\biggl(e^{-(\frac{y}{\lambda})^\tau} \biggr)^n \\&=e^{-n (\frac{y}{\lambda})^\tau} \\&=e^{-(\frac{y}{\lambda_1})^\tau} \end{aligned}

where $\lambda_1=\frac{\lambda}{t}$ and $t=n^{\frac{1}{\tau}}$. This shows that $Y$ has a Weibull distribution with shape parameter $\tau$ (as before) and scale parameter $\lambda_1$. Under the condition that the times to failure for the multiple components are indentically Weibull distributed, the lifetime of the device is also a Weibull model.

_______________________________________________________________________________________________

The Moment Generating Function

The post is concluded with a comment on the moment generating function for the Weibull distribution. Note that relationship $(4)$ indicates that all positive moments for the Weibull distribution exist. A natural question is on whether the moment generating function (MGF) exists. It turns out that MGF does not always exist for the Weibull distribution. The MGF exists for the Weibull distribution whenever the shape parameter $\tau \ge 1$. However the MGF cannot be expressed in terms of any familiar functions. Instead, the Weibull MGF can be expressed as a power series.

$\displaystyle M(t)=\sum \limits_{n=0}^\infty \ \frac{(\lambda t)^n}{n!} \ \Gamma \biggl(1+\frac{n}{\tau} \biggr) \ \ \ \ \ \tau \ge 1$

To see this, start with the power series for $e^x$.

$\displaystyle e^X=\sum \limits_{n=0}^\infty \ \frac{X^n}{n!}$

$\displaystyle e^{t X}=\sum \limits_{n=0}^\infty \ \frac{(t X)^n}{n!}$

$\displaystyle M(t)=E[e^{t X}]=\sum \limits_{n=0}^\infty \ \frac{t^n}{n!} \ E[X^n]$

$\displaystyle M(t)=\sum \limits_{n=0}^\infty \ \frac{(\lambda t)^n}{n!} \ \Gamma \biggl(1+\frac{n}{\tau} \biggr)$

Since the positive moments exist for the Weibull distribution, the higher moments from $(4)$ are plugged into the power series. It can be shown that the last series converges when $\tau \ge 1$.

When $0<\tau<1$, the power series does not converge. For a specific example, let $\tau=0.5$. Then the term in the series is simplified to $2 \lambda t (2n+1)$, when goes to infinity when $n \rightarrow \infty$. Thus the Weilbull distribution with the shape parameter $\tau=0.5$ is an example of a distribution where all the positive moments exist but the MGF does not exist.

_______________________________________________________________________________________________

Further Information

Further information can be found here and here.

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$

# Introducing the beta function

The gamma distribution is mathematically defined from the gamma function. This post gives a brief introduction to the beta function. The goal is to establish one property that is the basis for defining the beta distribution.

_______________________________________________________________________________________________

The Beta Function

For any positive constants $a$ and $b$, the beta function is defined to be the following integral:

$\displaystyle B(a,b)=\int_0^1 t^{a-1} \ (1-t)^{b-1} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)$

The beta function can be evaluated directly if the parameters $a$ and $b$ are not too large. For example, $B(3,2)$ is the integral $\displaystyle \int_0^1 t^2 (1-t) \ dt$, which is $1/12$. Evaluating $(0)$ in a case by case basis does not shed light on the beta function. Direct calculation can also be cumbersome (e.g. for large parameters that are integers) or challenging (e.g. for parameters $a$ and $b$ that are fractional). It turns out that the evaluation of the beta function $B(a,b)$ is based on the gamma function.

_______________________________________________________________________________________________

Connection to the Gamma Function

The remainder of the post is to establish the following value of the beta function:

$\displaystyle B(a,b)=\int_0^1 t^{a-1} \ (1-t)^{b-1} \ dt=\frac{\Gamma(a) \ \Gamma(b)}{\Gamma(a+b)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

To start the proof of $(1)$, let $X$ and $Y$ be two independent random variables such that $X$ follows a gamma distribution with shape parameter $a$ and rate parameter $\beta$ and that $Y$ follows a gamma distribution with shape parameter $b$ and rate parameter $\beta$. It does not matter what $\beta$ is, as long as it is the rate parameter for both $X$ and $Y$. Then the sum $S=X+Y$ has a gamma distribution with shape parameter $a+b$ and rate parameter $\beta$. The following is the density function for $S=X+Y$.

$\displaystyle f_S(s)=\frac{1}{\Gamma(a+b)} \ \beta^{a+b} \ s^{a+b-1} \ e^{-\beta s} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The density function of $S=X+Y$ can also be derived from the convolution formula using the density functions of $X$ and $Y$ as follows:

\displaystyle \begin{aligned} f_S(s)&=\int_0^s f_Y(s-x) \ f_X(x) \ dx \ \ \ \ \ \text{(convolution)} \\&=\int_0^s \frac{1}{\Gamma(b)} \ \beta^{b} \ (s-x)^{b-1} \ e^{-\beta (s-x)} \ \frac{1}{\Gamma(a)} \ \beta^{a} \ x^{a-1} \ e^{-\beta x} \ dx \\&=\frac{\beta^{a+b}}{\Gamma(a) \ \Gamma(b)} \ e^{-\beta s} \ \int_0^s x^{a-1} \ (s-x)^{b-1} \ dx \\&=\frac{\beta^{a+b}}{\Gamma(a) \ \Gamma(b)} \ e^{-\beta s} \ s^{a+b-1} \ \int_0^1 t^{a-1} \ (1-t)^{b-1} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

See here for more information on how to use the convolution formula. The last step in $(3)$ is obtained by a change of variable in the integral from the step immediately above it by letting $x=st$. The last step in $(3)$ must equal to $(2)$. Setting the two equal would produce the equality in $(1)$.

Note that if the function $t^{a-1} \ (1-t)^{b-1}$ is normalized by the value $B(a,b)$, it would be a density function, which is the beta distribution. The following is the density function of the beta distribution.

$\displaystyle f(x)=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ x^{a-1} \ (1-x)^{b-1}; \ \ \ \ \ \ \ \ 0

The beta distribution is further examined in the next post.

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$

# The gamma distribution from the point of view of a Poisson process

In the previous post, the gamma distribution is defined from the gamma function. This post shows that the gamma distribution can arise from a Poisson process.

_______________________________________________________________________________________________

The Poisson Process

Consider an experiment in which events that are of interest occur at random in a time interval. The goal here is to derive two families of random variables, one continuous and one discrete. Starting at time 0, record the time of the occurrence of the first event. Then record the time at which the second random event occurs and so on (these are the continuous random variables). Out of these measurements, we can derive discrete random variables by counting the number of random events in a fixed time interval.

The recording of times of the occurrences of the random events is like placing markings on a time line to denote the arrivals of the random events. We are interested in counting the number of markings in a fixed interval. We are also interested in measuring the length from the starting point to the first marking and to the second marking and so on. Because of this interpretation, the random process discussed here can also describe random events occurring along a spatial interval, i.e. intervals in terms of distance or volume or other spatial measurements.

A Poisson process is a random process described above in which several criteria are satisfied. We show that in a Poisson process, the number of occurrences of random events in a fixed time interval follows a Poisson distribution and the time until the $n$th random event follows a Gamma distribution.

A good example of a Poisson process is the well known experiment in radioactivity conducted by Rutherford and Geiger in 1910. In this experiment, $\alpha$-particles were emitted from a polonium source and the number of $\alpha$-particles were counted during an interval of 7.5 seconds (2,608 many such time intervals were observed). In these 2,608 intervals, a total of 10,097 particles were observed. Thus the mean count per period of 7.5 seconds is 10097 / 2608 = 3.87.

In the Rutherford and Geiger experiment in 1910, a random event is the observation of an $\alpha$-particle. The random events occur at an average of 3.87 per unit time interval (7.5 seconds).

One of the criteria in a Poisson process is that in a very short time interval, the chance of having more than one random event is essentially zero. So either one random event will occur or none will occur in a very short time interval. Considering the occurrence of a random event as a success, there is either a success or a failure in a very short time interval. So a very short time interval in a Poisson process can be regarded as a Bernoulli trial.

The second criterion is that the experiment remains constant over time. Specifically this means that the probability of a random event occurring in a given subinterval is proportional to the length of that subinterval and not on where the subinterval is in the original interval. Any counting process that satisfies this criterion is said to possess stationary increments. For example, in the 1910 radioactivity study, $\alpha$-particles were emitted at the rate of $\lambda=$ 3.87 per 7.5 seconds. So the probability of one $\alpha$-particle emitted from the radioactive source in a one-second interval is 3.87/7.5 = 0.516. Then the probability of observing one $\alpha$-particle in a half-second interval is 0.516/2 = 0.258. For a quarter-second interval, the probability is 0.258/2 = 0.129. So if we observe half as long, it will be half as likely to observe the occurrence of a random event. On the other hand, it does not matter when the quarter-second subinterval is, whether at the beginning or toward the end of the original interval of 7.5 seconds.

The third criterion is that non-overlapping subintervals are mutually independent in the sense that what happens in one subinterval (i.e. the occurrence or non-occurrence of a random event) will have no influence on the occurrence of a random event in another subinterval. Any counting process that satisfies this criterion is said to possess independent increments. In the Rutherford and Geiger experiment, the observation of one particle in one half-second period does not imply that a particle will necessarily be observed in the next half-second.

To summarize, the following are the three criteria of a Poisson process:

Suppose that on average $\lambda$ random events occur in a time interval of length 1.

1. The probability of having more than one random event occurring in a very short time interval is essentially zero.
2. For a very short subinterval of length $\frac{1}{n}$ where $n$ is a sufficiently large integer, the probability of a random event occurring in this subinterval is $\frac{\lambda}{n}$.
3. The numbers of random events occurring in non-overlapping time intervals are independent.

_______________________________________________________________________________________________

The Poisson Distribution

We are now ready to derive the Poisson distribution from a Poisson random process.

Consider random events generated in a Poisson process and let $Y$ be the number of random events observed in a unit time interval. Break up the unit time interval into $n$ non-overlapping subintervals of equal size where $n$ is a large integer. Each subinterval can have one or no random event. The probability of one random event in a subinterval is $\lambda/n$. The subintervals are independent. In other words, the three criteria of a Poisson process described above ensure that the $n$ subintervals are independent Bernoulli trials. As a result, the number of events occurring in these $n$ subintervals is a binomial distribution with $n$ trials and probability of success $\lambda/n$. This binomial distribution is an approximation of the random variable $Y$. The binomial distribution can get more and more granular. The resulting limit is a Poisson distribution, which coincides with the distribution for $Y$. The fact that the Poisson distribution is the limiting case of the binomial distribution is discussed here and here.

It follows that $Y$ follows the Poisson distribution with mean $\lambda$. The following is the probability function.

$\displaystyle P(Y=y)=\frac{e^{-\lambda} \ \lambda^y}{y!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots$

In the 1910 radioactivity study, the number of $\alpha$-particles observed in a 7.5-second period has a Poisson distribution with the mean of $\lambda=3.87$ particles per 7.5 seconds.

Sometimes it may be necessary to count the random events not in a unit time interval but in a smaller or larger time interval of length $t$. In a sense, the new unit time is $t$ and the new average rate of the Poisson process is then $\lambda t$. Then the idea of taking granular binomial distributions will lead to a Poisson distribution. Let $Y_t$ be the number of occurrences of the random events in a time interval of length $t$. Then $Y_t$ follows a Poisson distribution with mean $\lambda t$. The following is the probability function.

$\displaystyle P(Y_t=x)=\frac{e^{-\lambda t} \ (\lambda t)^y}{y!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots$

In the 1910 radioactivity study, the number of $\alpha$-particles observed in a 3.75-second period has a Poisson distribution with the mean of $\lambda=3.87/2=1.935$ particles per 3.75 seconds.

_______________________________________________________________________________________________

The Gamma Distribution as Derived from a Poisson Process

With the Poisson process and Poisson distribution properly set up and defined, we can now derive the gamma distribution. As before, we work with a Poisson process in which the random events arrive at an average rate of $\lambda$ per unit time. Let $W_1$ be the waiting time until the occurrence of the first random event, $W_2$ be the waiting time until the occurrence of the second random event and so on. First examine the random variable $W_1$.

Consider the probability $P(W_1>t)$. The event $W_1>t$ means that the first random event takes place after time $t$. This means that there must be no occurrence of the random event in question from time 0 to time $t$. It follows that

$P(W_1>t)=P(Y_t=0)=e^{-\lambda t}$

As a result, $P(W_1 \le t)=1-e^{-\lambda t}$, which is the cumulative distribution of $W_1$. Taking the derivative, the probability density function of $W_1$ is $f_{W_1}(t)=\lambda e^{-\lambda t}$. This is the density function of the exponential distribution with mean $\frac{1}{\lambda}$. Recall that $\lambda$ is the rate of the Poisson process, i.e. the random events arrive at the mean rate of $\lambda$ per unit time. Then the mean time between two consecutive events is $\frac{1}{\lambda}$.

Consider the probability $P(W_2>t)$. The event $W_2>t$ means that the second random event takes place after time $t$. This means that there can be at most one occurrence of the random events in question from time 0 to time $t$. It follows that

$P(W_2>t)=P(Y_t=0)+P(Y_t=1)=e^{-\lambda t}+\lambda t \ e^{-\lambda t}$

As a result, $P(W_2 \le t)=1-e^{-\lambda t}-\lambda t \ e^{-\lambda t}$, which is the cdf of the waiting time $W_2$. Taking the derivative, the probability density function of $W_2$ is $f_{W_2}(t)=\lambda^2 \ t \ e^{-\lambda t}$. This is the density function of the gamma distribution with shape parameter 2 and rate parameter $\lambda$

By the same reasoning, the waiting time until the $n^{th}$ random event, $W_n$, follows a gamma distribution with shape parameter $n$ and rate parameter $\lambda$. The survival function, cdf and the density function are:

$\displaystyle P(W_n>t)=\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}$

$\displaystyle P(W_n \le t)=1-\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}=\sum \limits_{k=n}^{\infty} \frac{e^{-\lambda t} \ (\lambda t)^k}{k!}$

$\displaystyle f_{W_n}(t)=\frac{1}{(n-1)!} \ \lambda^n \ t^{n-1} \ e^{-\lambda t}$

The survival function $P(W_n>t)$ is identical to $P(Y_t \le n-1)$. The equivalence is through the translation: the event $W_n>t$ is equivalent to the event that there can be at most $n-1$ random events occurring from time 0 to time $t$.

Example 1
Let’s have a quick example of calculation for the gamma distribution. In the study by Rutherford and Geiger in 1910, the average rate of arrivals of $\alpha$-particles is 3.87 per 7.5-second period, giving the average rate of 0.516 particles per second. On average, it takes $0.516^{-1}$ = 1.94 seconds to wait for the next particle. The probability that it takes more than 3 seconds of waiting time for the first particle to arrive is $e^{-0.516 (3)}$ = 0.213.

How long would it take to wait for the second particle? On average it would take $2 \cdot 0.516^{-1}$ = 3.88 seconds. The probability that it takes more than 5 seconds of waiting time for the second particle to arrive is

$e^{-0.516 (5)}+0.516 \cdot 5 e^{-0.516 (5)}=3.58 \cdot e^{-2.58}$ = 0.271

_______________________________________________________________________________________________

Remarks

The above discussion shows that the gamma distribution arises naturally from a Poisson process, a random experiment that satisfies three assumptions that deal with independence and uniformity in time. The gamma distribution derived from a Poisson process has two parameters $n$ and $\lambda$ where $n$ is a positive integer and is the shape parameter and $\lambda$ is the rate parameter. If the random variable $W$ follows this distribution, its pdf is:

$\displaystyle f_W(w)=\frac{1}{(n-1)!} \ \lambda^n \ w^{n-1} \ e^{-\lambda w} \ \ \ \ \ \ \ \ \ \ \ \ w>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

The above pdf can be interpreted as the density function for the waiting time until the arrival of the $n$th random event in a Poisson process with an average rate of arrivals at $\lambda$ per unit time. The density function may be derived from an actual Poisson process or it may be just describing some random quantity that has nothing to do with any Poisson process. But the Poisson process interpretation is still useful. One advantage of the Poisson interpretation is that the survival function and the cdf would have an expression in closed form.

$\displaystyle P(W>w)=\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda w} \ (\lambda w)^k}{k!} \ \ \ \ \ \ \ \ \ \ \ \ w>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

\displaystyle \begin{aligned} P(W \le w)&=1-\sum \limits_{k=0}^{n-1} \frac{e^{-\lambda w} \ (\lambda w)^k}{k!} \\&=\sum \limits_{k=n}^{\infty} \frac{e^{-\lambda w} \ (\lambda w)^k}{k!} \ \ \ \ \ \ \ \ \ \ \ \ w>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

In the Poisson process interpretation, $P(W>w)$ is the probability that the $n$th random event occurs after time $w$. This means that in the interval $(0, w)$, there are at most $n-1$ random events. Thus the gamma survival function is identical to the cdf of a Poisson distribution. Even when $W$ is simply a model of some random quantity that has nothing to do with a Poisson process, such interpretation can still be used to derive the survival function and the cdf of such a gamma distribution.

The gamma distribution described in the density function $(1)$ has a shape parameter that is a positive integer. This special case of the gamma distribution sometimes go by the name Erlang distribution and is important in queuing theory.

In general the shape parameter does not have to be integers; it can be any positive real number. For the more general gamma distribution, see the previous post.

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$

# Introducing the gamma distribution

The gamma distribution is a probability distribution that is useful in actuarial modeling. Due to its mathematical properties, there is considerable flexibility in the modeling process. For example, since it has two parameters (a scale parameter and a shape parameter), the gamma distribution is capable of representing a variety of distribution shapes and dispersion patterns. This post gives an account of how the distribution arises mathematically and discusses some of its mathematically properties. The next post discusses how the gamma distribution can arise naturally as the waiting time between two events in a Poisson process.

_______________________________________________________________________________________________

The Gamma Function

From a mathematically point of view, in defining the gamma distribution, the place to start is the gamma function. For any real number $\alpha>0$, define:

$\displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)$

The above improper integral converges for every positive $\alpha$. The proof that the improper integral converges and other basic facts can be found here.

When the integral in $(0)$ has “incomplete” limits, the resulting functions are called incomplete gamma functions. The following are called the upper incomplete gamma function and lower incomplete gamma function, respectively.

$\displaystyle \Gamma(\alpha, x)=\int_x^\infty t^{\alpha-1} \ e^{-t} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

$\displaystyle \gamma(\alpha, x)=\int_0^x t^{\alpha-1} \ e^{-t} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

_______________________________________________________________________________________________

The Gamma Probability Density Function

Notice that the integrand in $(0)$ is a positive value for every $t>0$. Thus the integrand $t^{\alpha-1} \ e^{-t}$ is a density function if it is normalized by $\Gamma(\alpha)$.

$\displaystyle f_T(t)=\frac{1}{\Gamma(\alpha)} \ t^{\alpha-1} \ e^{-t} \ \ \ \ \ \ \ t>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

The function in $(3)$ is a probability density function since the integral is one when it is integrated over the interval $(0,\infty)$. For convenience, let $T$ be the random variable having this density function. The density function in $(3)$ only has one parameter, which is $\alpha$ (the shape parameter). To add the second parameter, transform the random variable $T$ by multiplying a constant. This can be done in two ways. The following are the probability density functions for the random variables $X=\theta T$ and $Y=\frac{1}{\beta} \ T$, respectively.

$\displaystyle f_X(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ x>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4a)$

$\displaystyle f_Y(y)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ y^{\alpha-1} \ e^{-\beta y} \ \ \ \ \ \ \ \ \ \ \ \ y>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4b)$

A random variable $X$ is said to follow the gamma distribution with shape parameter $\alpha$ and scale parameter $\theta$ if $(4a)$ is its probability density function (pdf). A random variable $Y$ is said to follow the gamma distribution with shape parameter $\alpha$ and rate parameter $\beta$ if $(4b)$ is its pdf.

The parameter $\theta$ is the scale parameter since it is the case that the larger the value, the more spread out the distribution. The parameter $\beta$ is the rate parameter in the family of gamma distribution. The rate parameter is defined as the reciprocal of the scale parameter.

The following figure (Figure 1) demonstrates the role of the shape parameter $\alpha$. With the scale parameter $\theta$ kept at 2, the gamma distribution becomes less skewed as $\alpha$ increases.

Figure 1

The following figure (Figure 2) demonstrates the role of the scale parameter $\theta$. With the shape parameter $\alpha$ kept at 2, all the gamma distributions have the same skewness. However, the gamma distributions become more spread out as $\theta$ increases.

Figure 2

There are sevreral important subclasses of the gamma distribution. When the shape parameter $\alpha=1$, the gamma distribution becomes the exponential distribution with mean $\theta$ or $\frac{1}{\beta}$ depending on the parametrization. When the shape parameter $\alpha$ is any positive integer, the resulting subclass of gamma distribution is called the Erlang distribution. A Chi-square distribution is a gamma distribution with shape parameter $\alpha=\frac{k}{2}$ and scale parameter $\theta=2$ where $k$ is a positive integer (the degrees of freedom). The gamma density curves in Figure 1 are chi-square distributions. Their degrees of freedom are 2, 4, 6, 10 and 20. Chi-square distribution with 2 degrees of freedom would be an exponential distribution.

Between the two parametrizations presented here, the version with the scale parameter is the more appropriate model in the settings where a parameter is needed for describing the magnitude of the mean and the spread. The parametrization with $\alpha$ and $\beta$ is sometimes easier to work with. For example, it is more common in Bayesian analysis where the gamma distribution can be used as a conjugate prior distribution for a parameter that is a rate (e.g. the rate parameter of a Poisson distribution).

_______________________________________________________________________________________________

Some Distributional Quantities

The gamma distribution is a two-parameter family of distributions. Here’s some of the basic distributional quantities that are of interest.

_________________________________________
$X$ is a random variable with the gamma distribution with shape parameter $\alpha$ and scale parameter $\theta$.

$Y$ is a random variable with the gamma distribution with shape parameter $\alpha$ and rate parameter $\beta$.

_________________________________________
Probability Density Function (PDF)

$\displaystyle \begin{array}{ll} \displaystyle f_X(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} & \ \ \ \ \ \ \ \ \ \ x>0 \\ \text{ } & \text{ } \\ \displaystyle f_Y(y)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ y^{\alpha-1} \ e^{-\beta y} & \ \ \ \ \ \ \ \ \ \ \displaystyle y>0 \end{array}$

_________________________________________
Cumulative Distribution Function (CDF)

\displaystyle \begin{aligned} F_X(x)&=\int_0^x \ \frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ t^{\alpha-1} \ e^{-\frac{t}{\theta}} \ dt \\&\text{ } \\&=\frac{\gamma(\alpha, \frac{x}{\theta})}{\Gamma(\alpha)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (\text{Alternative formulation}) \end{aligned}

\displaystyle \begin{aligned} F_Y(y)&=\int_0^y \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ t^{\alpha-1} \ e^{-\beta t} \ dt \\&\text{ } \\&=\frac{\gamma(\alpha, \beta y)}{\Gamma(\alpha)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (\text{Alternative formulation}) \end{aligned}

_________________________________________
Mean and Variance

$\displaystyle \begin{array}{ll} \displaystyle E(X)=\alpha \ \theta & \ \ \ \ \ \ \ \ \ \ Var(X)=\alpha \ \theta^2 \\ \text{ } & \text{ } \\ \displaystyle E(Y)=\frac{\alpha}{\beta} & \ \ \ \ \ \ \ \ \ \ \displaystyle Var(Y)=\frac{\alpha}{\beta^2} \end{array}$

_________________________________________
Higher Moments

$\displaystyle E(X^k) = \left\{ \begin{array}{ll} \displaystyle \frac{\theta^k \ \Gamma(\alpha+k)}{\Gamma(\alpha)} &\ \ \ \ \ \ k>-\alpha \\ \text{ } & \text{ } \\ \theta^k \ \alpha (\alpha+1) \cdots (\alpha+k-1) &\ \ \ \ \ \ k \text{ is a positive integer} \end{array} \right.$

$\displaystyle E(Y^k) = \left\{ \begin{array}{ll} \displaystyle \frac{\Gamma(\alpha+k)}{\beta^k \ \Gamma(\alpha)} &\ \ \ \ \ \ k>-\alpha \\ \text{ } & \text{ } \\ \displaystyle \frac{\alpha (\alpha+1) \cdots (\alpha+k-1)}{\beta^k} &\ \ \ \ \ \ k \text{ is a positive integer} \end{array} \right.$

_________________________________________
Moment Generating Function

$\displaystyle \begin{array}{ll} \displaystyle M_X(t)=\biggl(\frac{1}{1-\theta t} \biggr)^\alpha & \ \ \ \ \ \displaystyle t<\frac{1}{\theta} \\ \text{ } & \text{ } \\ \displaystyle M_Y(t)=\biggl(\frac{\beta}{\beta- t} \biggr)^\alpha & \ \ \ \ \ \displaystyle t<\beta \end{array}$

_________________________________________
Mode

$\displaystyle \begin{array}{ll} \displaystyle \theta \ (\alpha-1) & \ \ \ \ \ \alpha>1 \ \text{else } 0 \\ \text{ } & \text{ } \\ \displaystyle \frac{\alpha-1}{\beta} & \ \ \ \ \ \alpha>1 \ \text{else } 0 \end{array}$

_________________________________________
Coefficient of Variation

$\displaystyle CV = \frac{1}{\sqrt{\alpha}}$

_________________________________________
Coefficient of Skewness

$\displaystyle \gamma = \frac{2}{\sqrt{\alpha}}$

_________________________________________
Kurtosis

$\displaystyle \begin{array}{ll} \displaystyle 3+\frac{6}{\alpha} & \ \ \ \ \ \text{Kurtosis} \\ \text{ } & \text{ } \\ \displaystyle \frac{6}{\alpha} & \ \ \ \ \ \text{Excess Kurtosis} \end{array}$

There is no simple closed form for the cumulative distribution function, except for the case of $\alpha=1$ (i.e. the exponential distribution) and the case of $\alpha$ being a positive integer (see next post). As a result, the distributional quantities that required solving for $x$ in the CDF have no closed form, e.g. median and other percentiles. As stated above, the CDF can be expressed using the incomplete gamma function, which can be estimated numerically. For the distributional quantities with no closed form, either use numerical estimation or use software.

The calculation for some of the distributional quantities is quite straightforward. For example, to calculate any higher moment $E(X^k)$, simply adjust the integrand to be an appropriate gamma density function. Then the result will be what can be moved outside the integral, as shown in the following.

\displaystyle \begin{aligned}E(X^k)&=\int_0^\infty x^k \cdot \frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ dx=\int_0^\infty \frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha+k-1} \ e^{-\frac{x}{\theta}} \ dx \\&=\frac{\theta^k \Gamma(\alpha+k)}{\Gamma(\alpha)} \int_0^\infty \frac{1}{\Gamma(\alpha+k)} \ \biggl(\frac{1}{\theta}\biggr)^{\alpha+k} \ x^{\alpha+k-1} \ e^{-\frac{x}{\theta}} \ dx \\&=\frac{\theta^k \Gamma(\alpha+k)}{\Gamma(\alpha)} \\&=\theta^k \ \alpha (\alpha+1) \cdots (\alpha+k-1) \end{aligned}

Once the higher moments are known, some of the other calculations follow. For example, the coefficient of variation is defined by the ratio of the standard deviation to the mean. This ratio is the standardized measure of dispersion of a probability distribution. The coefficient of skewness is the ratio of the third central moment to the third power of the standard deviation, i.e. $E[(X-\mu)^3]/\sigma^3$. see here for a discussion on skewness. The kurtosis is the ratio of the fourth central moment to the fourth power of the standard deviation, i.e. $E[(X-\mu)^4]/\sigma^4$. The excess kurtosis is obtained by subtracting 3 from the kurtosis.

_______________________________________________________________________________________________

Discussion

The product of two gamma moment generating functions with the same scale parameter $\theta$ (or rate parameter $\beta$) is also an MGF for a gamma distribution. This points to the fact that the independent sum of two gamma distribution (with the same scale parameter or rate parameter) is a gamma distribution. Specifically if $X_1$ follows a gamma distribution with the shape parameter $\alpha_1$ and $X_2$ follows a gamma distribution with shape parameter $\alpha_2$ and that they are independent, then the sum $X_1+X_2$ has a gamma distribution with shape parameter $\alpha_1+\alpha_2$.

We now revisit Figure 1 and Figure 2. The skewness of a gamma distribution is driven only by the shape parameter $\alpha$. The gamma skewness is $\frac{2}{\sqrt{\alpha}}$. The higher the $\alpha$, the less skewed the gamma distribution is (or the more symmetric it looks). This is borne out by Figure 1. There is another angle via the central limit theorem that is borne out by Figure 1. The gamma densities with larger value of $\alpha$ can also be thought of as the independent sum of many gamma distributions with smaller $\alpha$ values. For example, the gamma with $\alpha=10$ and $\theta=2$ can be regarded as the independent sum of 10 exponential distributions each with mean 2. By the central limit theorem, any gamma distribution with large value of $\alpha$ will tend to look symmetric.

In Figure 2, all gamma densities have the same $\alpha=2$. Thus they all have the same skewness (about 0.707). It is clear that as the scale parameter $\theta$ increases, the densities become more spread out while remaining skewed density curves.

The support of the gamma distribution is the interval $(0,\infty)$. Thus it is plausible model for random quantities that take on positive values, e.g. insurance losses or insurance claim amounts. With the gamma density curves being positively skewed (skewed to the right), the gamma distribution is a good candidate for random quantities that are concentrated more on the lower end of the interval $(0,\infty)$.

Though the gamma distribution is positively skewed, it is considered to have a light (right) tail. The notion of having a light tail or heavy tail is a relative concept. The gamma distribution has a light right tail as compared to the Pareto distribution. The Pareto distribution significantly puts more probability on larger values (the gamma distribution with same mean and variance will put significantly less probabilities on the larger values). In terms of modeling insurance losses, the gamma distribution will be a more suitable model for losses that are not catastrophic in nature.

One tell tale sign of a distribution with a light tail is that all positive moments exist. For the gamma distribution, $E(X^k)$ exists for all positive integer $k$. In fact, the gamma distribution has a property stronger than the mere fact that all moments exist, i.e. it has a moment generating function. So even though the Gamma distribution has a right tail that is infinitely long (it extents out to infinity), the amount of probabilities is almost negligible after some limit (as compared to the Pareto distribution for example). See here for a more detailed discussion on tail weights and the Pareto distribution.

The next post discusses how the gamma distribution can arise naturally as the waiting time between two events in a Poisson process.

_______________________________________________________________________________________________

Evaluating the Gamma Function

In many calculations discussed in this blog, it is necessary at times to evaluate the gamma function. Of course if the argument is a positive integer, the gamma function is simply the factorial function. Some special values of the gamma function are:

$\Gamma(\frac{1}{2})=\sqrt{\pi}$

$\Gamma(\frac{3}{2})=\frac{1}{2} \sqrt{\pi}$

$\Gamma(\frac{5}{2})=\frac{3}{4} \sqrt{\pi}$

$\Gamma(\frac{7}{2})=\frac{15}{8} \sqrt{\pi}$

If special values are not known, it is possible to use software to evaluate the gamma function. We demonstrate how it is done using Excel. There is no dedicated function in Excel for evaluating the gamma function. However, Excel has a function for the PDF of a gamma distribution. To evaluate $\Gamma(\alpha)$, consider the density function in $(4b)$ with parameters $\alpha$, $\beta=1$ and $y=1$. We have the following:

$\displaystyle f_Y(1)=\frac{1}{\Gamma(\alpha)} \ 1^\alpha \ 1^{\alpha-1} \ e^{- 1}=\frac{1}{\Gamma(\alpha)}$

$\displaystyle \Gamma(\alpha)=\frac{e^{-1}}{f_Y(1)}$

The key to evaluate $\Gamma(\alpha)$ in Excel is to evaluate the gamma PDF with $\alpha$ and $\beta=1$ at the x-value of 1. The following shows the formula for evaluating the PDF.

=GAMMADIST(1, $\alpha$, 1, FALSE)

Then the gamma function can be evaluated by evaluating the following formula in Excel:

=EXP(-1) / GAMMADIST(1, $\alpha$, 1, FALSE)

For example, $\Gamma(3.6)=3.717023853$ which is obtained by the following formula in Excel:

=EXP(-1) / GAMMADIST(1, 3.6, 1, FALSE)

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$

# Introducing the gamma function

The gamma distribution is a probability distribution that is useful in actuarial modeling. From a mathematical point of view, the gamma function is the starting point of defining the gamma distribution. This post discusses the basic facts that are needed for defining the gamma distribution. Here’s the definition of the gamma function.

For any real number $\alpha>0$, define:

$\displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)$

The following is the graph of the gamma function. It has a U shape. As $x \rightarrow 0$, the graph goes up to infinity. As $x \rightarrow \infty$, the graph increases without bound. As is seen below, the gamma function coincides with the factorial function at the integers.

_______________________________________________________________________________________________

The Convergence

The gamma function is defined by the improper integral as described in $(0)$. The improper integral converges. In fact, the integral converges for all complex numbers $\alpha$ with positive real part. For our purposes at hand, we restrict $\alpha$ to be positive real numbers. Showing that the integral in $(0)$ converges is important. For example, it will be nice to know that the density function of the gamma distribution sums to 1.0. To show that the integral in $(0)$ converges, we first make the observation: for some positive real number $M$ and for all $t>M$,

$\displaystyle t^{\alpha-1}

This is saying that the quantity $\displaystyle e^{\frac{t}{2}}$ dominates the quantity $\displaystyle t^{\alpha-1}$ when $t$ is sufficiently large. This is because the exponential function $\displaystyle e^{\frac{t}{2}}$ increases at a much faster rate than the polynomial function $\displaystyle t^{\alpha-1}$. Then multiply both sides of $(1)$ by $\displaystyle e^{-t}$ to obtain the following:

$\displaystyle t^{\alpha-1} e^{-t}

Now, break up the interval in $(0)$ into two pieces:

$\displaystyle \Gamma(\alpha)=\int_0^M t^{\alpha-1} \ e^{-t} \ dt +\int_M^\infty t^{\alpha-1} \ e^{-t} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

A basic fact. For any continuous function $h(t)$ defined over a closed interval $[a,b]$ of finite length, the integral $\displaystyle \int_a^b h(t) \ dt$ exists and has a finite value. As a result, the first integral in $(3)$ exists. So we just focus on the second integral.

$\displaystyle \int_M^\infty t^{\alpha-1} \ e^{-t} \ dt < \int_M^\infty e^{-\frac{t}{2}} \ dt=2 e^{-\frac{M}{2}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

Note that in $(4)$, the inequality in $(2)$ is used. Due to $(4)$, the second integral in $(3)$ is finite. With both integrals in $(3)$ finite, it follows that the improper integral in the definition of $\Gamma(\alpha)$ is finite.

_______________________________________________________________________________________________

Some Properties

The gamma function is a well known function in mathematics and has wide and deep implications. Here, we simply state several basic facts that are needed. In the discussion here, One useful fact is that the gamma function is the factorial function shifted down by one when the argument $\alpha$ is a positive integer. Thus the gamma function generalizes the factorial function. Another useful fact is that the gamma function satisfies a recursive relation.

• If $n$ is a positive integer, $\Gamma(n)=(n-1)!$.
• For all $\alpha>0$, $\Gamma(\alpha+1)=\alpha \Gamma(\alpha)$.

The recursive relation $\Gamma(\alpha+1)=\alpha \Gamma(\alpha)$ is derived by using integration by parts. It is clear from definition that $\Gamma(1)=1$. By an induction argument, $\Gamma(n)=(n-1)!$ for all integers greater than 1.

As result, $\Gamma(1)=0!=1$, $\Gamma(2)=1!=1$ and $\Gamma(3)=2!=2$ and so on. Given that $\Gamma(\frac{1}{2})=\sqrt{\pi}$, the recursive relation tells us that $\Gamma(\frac{3}{2})=\frac{1}{2} \sqrt{\pi}$ and $\Gamma(\frac{5}{2})=\frac{3}{4} \sqrt{\pi}$ and so on. The above recursive relation can be further extended:

• For any positive integer $k$, $\Gamma(\alpha+k)=\alpha (\alpha+1) \cdots (\alpha + k-1) \ \Gamma(\alpha)$.

When the integral in $(0)$ has “incomplete” limits, the resulting functions are called incomplete gamma functions. The following are called the upper incomplete gamma function and lower incomplete gamma function, respectively.

$\displaystyle \Gamma(\alpha, x)=\int_x^\infty t^{\alpha-1} \ e^{-t} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

$\displaystyle \gamma(\alpha, x)=\int_0^x t^{\alpha-1} \ e^{-t} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

_______________________________________________________________________________________________
The next post shows how the gamma distribution arises naturally from the gamma function.

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$