A catalog of parametric severity models

Various parametric continuous probability models have been presented and discussed in this blog. The number of parameters in these models ranges from one to two, and in a small number of cases three. They are all potential candidates for models of severity in insurance applications and in other actuarial applications. This post highlights these models. The list presented here is not exhaustive; it is only a brief catalog. There are other models that are also suitable for actuarial applications but not accounted for here. However, the list is a good place to begin. This post also serves a navigation device (the table shown below contains links to the blog posts).

A Catalog

Many of the models highlighted here are related to gamma distribution either directly or indirectly. So the catalog starts with the gamma distribution at the top and then branches out to the other related models. Mathematically, the gamma distribution is a two-parameter continuous distribution defined using the gamma function. The gamma sub family includes the exponential distribution, Erlang distribution and chi-squared distribution. These are distributions that are gamma distributions with certain restrictions on the one or both of the gamma parameters. Other distributions are obtained by raising a distribution to a power. Others are obtained by mixing distributions.

Here’s a listing of the models. Click on the links to find out more about the distributions.

……Derived From ………………….Model
Gamma function
Gamma sub families
Independent sum of gamma
Exponentiation
Raising to a power Raising exponential to a positive power

Raising exponential to a power

Raising gamma to a power

Raising Pareto to a power

Burr sub families
Mixture
Others

The above table categorizes the distributions according to how they are mathematically derived. For example, the gamma distribution is derived from the gamma function. The Pareto distribution is mathematically an exponential-gamma mixture. The Burr distribution is a transformed Pareto distribution, i.e. obtained by raising a Pareto distribution to a positive power. Even though these distributions can be defined simply by giving the PDF and CDF, knowing how their mathematical origins informs us of the specific mathematical properties of the distributions. Organizing according to the mathematical origin gives us a concise summary of the models.

\text{ }

\text{ }

Further Comments on the Table

From a mathematical standpoint, the gamma distribution is defined using the gamma function.

    \displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt

In this above integral, the argument \alpha is a positive number. The expression t^{\alpha-1} \ e^{-t} in the integrand is always positive. The area in between the curve t^{\alpha-1} \ e^{-t} and the x-axis is \Gamma(\alpha). When this expression is normalized, i.e. divided by \Gamma(\alpha), it becomes a density function.

    \displaystyle f(t)=\frac{1}{\Gamma(\alpha)} \ t^{\alpha-1} \ e^{-t}

The above function f(t) is defined over all positive t. The integral of f(t) over all positive t is 1. Thus f(t) is a density function. It only has one parameter, the \alpha, which is the shape parameter. Adding the scale parameter \theta making it a two-parameter distribution. The result is called the gamma distribution. The following is the density function.

    \displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ x>0

Both parameters \alpha and \theta are positive real numbers. The first parameter \alpha is the shape parameter and \theta is the scale parameter.

As mentioned above, many of the distributions listed in the above table is related to the gamma distribution. Some of the distributions are sub families of gamma. For example, when \alpha are positive integers, the resulting distributions are called Erlang distribution (important in queuing theory). When \alpha=1, the results are the exponential distributions. When \alpha=\frac{k}{2} and \theta=2 where k is a positive integer, the results are the chi-squared distributions (the parameter k is referred to the degrees of freedom). The chi-squared distribution plays an important role in statistics.

Taking independent sum of n independent and identically distributed exponential random variables produces the Erlang distribution, a sub gamma family of distribution. Taking independent sum of n exponential random variables, with pairwise distinct means, produces the hypoexponential distributions. On the other hand, the mixture of n independent exponential random variables produces the hyperexponential distribution.

The Pareto distribution (Pareto Type II Lomax) is the mixture of exponential distributions with gamma mixing weights. Despite the connection with the gamma distribution, the Pareto distribution is a heavy tailed distribution. Thus the Pareto distribution is suitable for modeling extreme losses, e.g. in modeling rare but potentially catastrophic losses.

As mentioned earlier, raising a Pareto distribution to a positive power generates the Burr distribution. Restricting the parameters in a Burr distribution in a certain way will produces the paralogistic distribution. The table indicates the relationships in a concise way. For details, go into the blog posts to get more information.

Tail Weight

Another informative way to categorize the distributions listed in the table is through looking at the tail weight. At first glance, all the distributions may look similar. For example, the distributions in the table are right skewed distributions. Upon closer look, some of the distributions put more weights (probabilities) on the larger values. Hence some of the models are more suitable for models of phenomena with significantly higher probabilities of large or extreme values.

When a distribution significantly puts more probabilities on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). In general tail weight is a relative concept. For example, we say model A has a larger tail weight than model B (or model A has a heavier tail than model B). However, there are several ways to check for tail weight of a given distribution. Here are the four criteria.

Tail Weight Measure What to Look for
1 Existence of moments The existence of more positive moments indicates a lighter tailed distribution.
2 Hazard rate function An increasing hazard rate function indicates a lighter tailed distribution.
3 Mean excess loss function An increasing mean excess loss function indicates a heavier tailed distribution.
4 Speed of decay of survival function A survival function that decays rapidly to zero (as compared to another distribution) indicates a lighter tailed distribution.

Existence of moments
For a positive real number k, the moment E(X^k) is defined by the integral \int_0^\infty x^k \ f(x) \ dx where f(x) is the density function of the distribution in question. If the distribution puts significantly more probabilities in the larger values in the right tail, this integral may not exist (may not converge) for some k. Thus the existence of moments E(X^k) for all positive k is an indication that the distribution is a light tailed distribution.

In the above table, the only distributions for which all positive moments exist are gamma (including all gamma sub families such as exponential), Weibull, lognormal, hyperexponential, hypoexponential and beta. Such distributions are considered light tailed distributions.

The existence of positive moments exists only up to a certain value of a positive integer k is an indication that the distribution has a heavy right tail. All the other distributions in the table are considered heavy tailed distribution as compared to gamma, Weibull and lognormal. Consider a Pareto distribution with shape parameter \alpha and scale parameter \theta. Note that the existence of the Pareto higher moments E(X^k) is capped by the shape parameter \alpha. If the Pareto distribution is to model a random loss, and if the mean is infinite (when \alpha=1), the risk is uninsurable! On the other hand, when \alpha \le 2, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

Hazard rate function
The hazard rate function h(x) of a random variable X is defined as the ratio of the density function and the survival function.

    \displaystyle h(x)=\frac{f(x)}{S(x)}

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged x will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for x units of time.

Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. On the other hand, a distribution with an increasing hazard rate function has a light tailed distribution. If the hazard rate function is decreasing (over time if the random variable is a time variable), then the population die off at a decreasing rate, hence a heavier tail for the distribution in question.

The Pareto distribution is a heavy tailed distribution since the hazard rate is h(x)=\alpha/x (Pareto Type I) and h(x)=\alpha/(x+\theta) (Pareto Type II Lomax). Both hazard rates are decreasing function.

The Weibull distribution is a flexible model in that when its shape parameter is 0<\tau<1, the Weibull hazard rate is decreasing and when \tau>1, the hazard rate is increasing. When \tau=1, Weibull is the exponential distribution, which has a constant hazard rate.

The point about decreasing hazard rate as an indication of a heavy tailed distribution has a connection with the fourth criterion. The idea is that a decreasing hazard rate means that the survival function decays to zero slowly. This point is due to the fact that the hazard rate function generates the survival function through the following.

    \displaystyle S(x)=e^{\displaystyle -\int_0^x h(t) \ dt}

Thus if the hazard rate function is decreasing in x, then the survival function will decay more slowly to zero. To see this, let H(x)=\int_0^x h(t) \ dt, which is called the cumulative hazard rate function. As indicated above, S(x)=e^{-H(x)}. If h(x) is decreasing in x, H(x) has a lower rate of increase and consequently S(x)=e^{-H(x)} has a slower rate of decrease to zero.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The mean excess loss function
The mean excess loss is the conditional expectation e_X(d)=E(X-d \lvert X>d). If the random variable X represents insurance losses, mean excess loss is the expected loss in excess of a threshold conditional on the event that the threshold has been exceeded. Suppose that the threshold d is an ordinary deductible that is part of an insurance coverage. Then e_X(d) is the expected payment made by the insurer in the event that the loss exceeds the deductible.

Whenever e_X(d) is an increasing function of the deductible d, the loss X is a heavy tailed distribution. If the mean excess loss function is a decreasing function of d, then the loss X is a lighter tailed distribution.

The Pareto distribution can also be classified as a heavy tailed distribution based on an increasing mean excess loss function. For a Pareto distribution (Type I) with shape parameter \alpha and scale parameter \theta, the mean excess loss is e(X)=d/(\alpha-1), which is increasing. The mean excess loss for Pareto Type II Lomax is e(X)=(d+\theta)/(\alpha-1), which is also decreasing. They are both increasing functions of the deductible d! This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If the underlying distribution for a random loss is Pareto, it is a catastrophic risk situation.

In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Speed of decay of the survival function to zero
The survival function S(x)=P(X>x) captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed. This point is touched on when discussing hazard rate function.

The following is a comparison of a Pareto Type II survival function and an exponential survival function. The Pareto survival function has parameters (\alpha=2 and \theta=2). The two survival functions are set to have the same 75th percentile, which is x=2. The following table is a comparison of the two survival functions.

    \displaystyle \begin{array}{llllllll} \text{ } &x &\text{ } & \text{Pareto } S_X(x) & \text{ } & \text{Exponential } S_Y(x) & \text{ } & \displaystyle \frac{S_X(x)}{S_Y(x)} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{ } &2 &\text{ } & 0.25 & \text{ } & 0.25 & \text{ } & 1  \\    \text{ } &10 &\text{ } & 0.027777778 & \text{ } & 0.000976563 & \text{ } & 28  \\  \text{ } &20 &\text{ } & 0.008264463 & \text{ } & 9.54 \times 10^{-7} & \text{ } & 8666  \\   \text{ } &30 &\text{ } & 0.00390625 & \text{ } & 9.31 \times 10^{-10} & \text{ } & 4194304  \\  \text{ } &40 &\text{ } & 0.002267574 & \text{ } & 9.09 \times 10^{-13} & \text{ } & 2.49 \times 10^{9}  \\  \text{ } &60 &\text{ } & 0.001040583 & \text{ } & 8.67 \times 10^{-19} & \text{ } & 1.20 \times 10^{15}  \\  \text{ } &80 &\text{ } & 0.000594884 & \text{ } & 8.27 \times 10^{-25} & \text{ } & 7.19 \times 10^{20}  \\  \text{ } &100 &\text{ } & 0.000384468 & \text{ } & 7.89 \times 10^{-31} & \text{ } & 4.87 \times 10^{26}  \\  \text{ } &120 &\text{ } & 0.000268745 & \text{ } & 7.52 \times 10^{-37} & \text{ } & 3.57 \times 10^{32}  \\  \text{ } &140 &\text{ } & 0.000198373 & \text{ } & 7.17 \times 10^{-43} & \text{ } & 2.76 \times 10^{38}  \\  \text{ } &160 &\text{ } & 0.000152416 & \text{ } & 6.84 \times 10^{-49} & \text{ } & 2.23 \times 10^{44}  \\  \text{ } &180 &\text{ } & 0.000120758 & \text{ } & 6.53 \times 10^{-55} & \text{ } & 1.85 \times 10^{50}  \\  \text{ } & \text{ } \\    \end{array}

Note that at the large values, the Pareto right tails retain much more probabilities. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. Using an exponential distribution to model a Pareto random phenomenon would be a severe modeling error even though the exponential distribution may be a good model for describing the loss up to the 75th percentile (in the above comparison). It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

    \displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\theta^\alpha}{(x+\theta)^\alpha}}{e^{-\lambda x}}=\frac{\theta^\alpha e^{\lambda x}}{(x+\theta)^\alpha} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in x. This ratio goes to infinity as x \rightarrow \infty.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator.

It is important to examine the tail behavior of a distribution when considering it as a candidate for a model. The four criteria discussed here provide a crucial way to classify parametric models according to the tail weight.

severity models
math

Daniel Ma
mathematics

\copyright 2017 – Dan Ma

Examples of mixtures

The notion of mixtures is discussed in this previous post. Many probability distributions useful for actuarial modeling are mixture distributions. The previous post touches on some examples – negative binomial distribution (a Poisson-Gamma mixture), Pareto distribution (an exponential-gamma mixture) and the normal-normal mixture. In this post we present additional examples. We discuss the following examples.

  1. Poisson-Gamma mixture = Negative Binomial.
  2. Normal-Normal mixture = Normal.
  3. Exponential-Gamma mixture = Pareto.
  4. Exponential-Inverse Gamma mixture = Pareto.
  5. Gamma-Gamma mixture = Generalized Pareto.
  6. Weibull-Exponential mixture = Loglogistic.
  7. Gamma-Geometric mixture = Exponential.
  8. Normal-Gamma mixture = Student t.

The first three examples are discussed in the previous post. We discuss the remaining examples in this post.

The Pareto Family

Examples 3 and 4 show that Pareto distributions are mixtures of exponential distributions with either gamma or inverse gamma mixing weights. In Example 3, X \lvert \Theta is an exponential distribution with \Theta being a rate parameter. When \Theta follows a gamma distribution, the resulting mixture is a (Type I Lomax) Pareto distribution. In Example 4, X \lvert \Theta is an exponential distribution with \Theta being a scale parameter. When \Theta follows an inverse gamma distribution, the resulting mixture is also a (Type I Lomax) Pareto distribution.

As a mixture, Example 5 is like Example 3, except that it is a gamma-gamma mixture resulting in a generalized Pareto distribution. Example 3 has been discussed in the previous post. We now discuss Example 4 and Example 5.

Example 4. Suppose that X \lvert \Theta has an exponential distribution where \Theta is a scale parameter.
Further suppose that the random parameter \Theta follows an inverse gamma distribution with parameters \alpha and \beta. Then the unconditional distribution for X is a (Type I Lomax) Pareto distribution with shape parameter \alpha and scale parameter \beta.

The following gives the cumulative distribution function (CDF) and survival function of the conditional random variable X \lvert \Theta.

    F(x \lvert \Theta)=1-e^{- x/\Theta}

    S(x \lvert \Theta)=e^{- x/\Theta}

The random parameter \Theta follows an inverse gamma distribution with parameters \alpha and \beta. The following is the pdf of \Theta:

    \displaystyle g(\theta)=\frac{1}{\Gamma(\alpha)} \ \biggl[\frac{\beta}{\theta}\biggr]^\alpha \ \frac{1}{\theta} \ e^{-\frac{\beta}{ \theta}} \ \ \ \ \ \theta>0

We show that the unconditional survival function for X is the survival function for the Pareto distribution with parameters \alpha (shape parameter) and \beta (scale parameter).

    \displaystyle \begin{aligned} S(x)&=\int_0^\infty S(x \lvert \theta) \ g(\theta) \ d \theta \\&=\int_0^\infty e^{- x/\theta} \ \frac{1}{\Gamma(\alpha)} \ \biggl[\frac{\beta}{\theta}\biggr]^\alpha \ \frac{1}{\theta} \ e^{-\beta / \theta} \ d \theta \\&=\int_0^\infty \frac{1}{\Gamma(\alpha)} \ \biggl[\frac{\beta}{\theta}\biggr]^\alpha \ \frac{1}{\theta} \ e^{-(x+\beta) / \theta} \ d \theta \\&=\frac{\beta^\alpha}{(x+\beta)^\alpha} \ \int_0^\infty \frac{1}{\Gamma(\alpha)} \ \biggl[\frac{x+\beta}{\theta}\biggr]^\alpha \ \frac{1}{\theta} \ e^{-(x+\beta) / \theta} \ d \theta \\&=\biggl(\frac{\beta}{x+\beta} \biggr)^\alpha \end{aligned}

Note that the the integrand in the last integral is a density function for an inverse gamma distribution. Thus the integral is 1 and can be eliminated. The result that remains is the survival function for a Pareto distribution with parameters \alpha and \beta. The following gives the CDF and density function of this Pareto distribution.

    \displaystyle F(x)=1-\biggl(\frac{\beta}{x+\beta} \biggr)^\alpha

    \displaystyle f(x)=\frac{\alpha \ \beta^{\alpha}}{(x+\beta)^{\alpha+1}}

See here for further information on Pareto Type I Lomax distribution.

Example 5. Suppose that X \lvert \Theta has a gamma distribution with shape parameter k (a known constant) and rate parameter \Theta. Further suppose that the random parameter \Theta follows a gamma distribution with shape parameter \alpha and rate parameter \beta. Then the unconditional distribution for X is a generalized Pareto distribution with parameters \alpha, \beta and k.

Conditional on \Theta=\theta, the following is the density function of X.

    \displaystyle f(x \lvert \theta)=\frac{1}{\Gamma(k)} \ \theta^k \ x^{k-1} \ e^{-\theta x}  \ \ \ \ \ x>0

The following is the density function of the random parameter \Theta.

    \displaystyle g(\theta)=\frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ \ \ \ \ \ \theta>0

The following gives the unconditional density function for X.

    \displaystyle \begin{aligned} f(x)&=\int_0^\infty  f(x \lvert \theta) \ g(\theta) \ d \theta \\&=\int_0^\infty  \frac{1}{\Gamma(k)} \ \theta^k \ x^{k-1} \ e^{-\theta x} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ d \theta \\&=\int_0^\infty \frac{1}{\Gamma(k)} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ x^{k-1} \ \theta^{\alpha+k-1} \ e^{-(x+\beta) \theta} \ d \theta \\&= \frac{1}{\Gamma(k)} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ x^{k-1} \frac{\Gamma(\alpha+k)}{(x+\beta)^{\alpha+k}} \int_0^\infty \frac{1}{\Gamma(\alpha+k)} \ (x+\beta)^{\alpha+k} \ \theta^{\alpha+k-1} \ e^{-(x+\beta) \theta} \ d \theta \\&=\frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \frac{\beta^\alpha \ x^{k-1}}{(x+\beta)^{\alpha+k}} \end{aligned}

Any distribution that has a density function described above is said to be a generalized Pareto distribution with the parameters \alpha, \beta and k. Its CDF cannot be written in closed form but can be expressed using the incomplete beta function.

    \displaystyle \begin{aligned} F(x)&=\int_0^x  \frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \frac{\beta^\alpha \ t^{k-1}}{(t+\beta)^{\alpha+k}} \ dt \\&=\int_0^x  \frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \biggl(\frac{t}{t+\beta} \biggr)^{k-1} \ \biggl(\frac{\beta}{t+\beta} \biggr)^{\alpha-1} \ \frac{\beta}{(t+\beta)^2} \ dt \\&=\frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \int_0^{\frac{x}{x+\beta}} u^{k-1} \ (1-u)^{\alpha-1} \ du, \ \ \ u=\frac{t}{t+\beta} \\&=\frac{\Gamma(\alpha+k)}{\Gamma(\alpha) \ \Gamma(k)} \ \int_0^{w} t^{k-1} \ (1-t)^{\alpha-1} \ dt, \ \ \ w=\frac{x}{x+\beta}   \end{aligned}

The moments can be easily derived for the generalized Pareto distribution but on a limited basis. Since it is a mixture distribution, the unconditional mean is the weighted average of the conditional means.

    \displaystyle \begin{aligned} E(X^w)&=\int_0^\infty  E(X \lvert \theta) \ g(\theta) \ d \theta \\&=\int_0^\infty  \frac{\Gamma(k+w)}{\theta^w \Gamma(k)} \ \frac{1}{\Gamma(\alpha)} \ \beta^\alpha \ \theta^{\alpha-1} \ e^{-\beta \theta} \ d \theta \\&=\frac{\beta^w \ \Gamma(k+w) \ \Gamma(\alpha-w)}{\Gamma(k) \ \Gamma(\alpha)} \int_0^\infty \frac{1}{\Gamma(\alpha-w)} \ \beta^{\alpha-w} \ \theta^{\alpha-w-1} \ e^{-\beta \theta} \ d \theta \\&=\frac{\beta^w \ \Gamma(k+w) \ \Gamma(\alpha-w)}{\Gamma(k) \ \Gamma(\alpha)} \ \ \ \ -k<w<\alpha   \end{aligned}

Note that E(X) has a simple expression E(X)=\frac{k \beta}{\alpha-1} when 1<\alpha.

When the parameter k=1, the conditional distribution for X \lvert \Theta is an exponential distribution. Then the situation reverts back to Example 3, leading to a Pareto distribution. Thus the Pareto distribution is a special case of the generalized Pareto distribution. Both the Pareto distribution and the generalized Pareto distribution have thicker and longer tails than the original conditional gamma distribution.

It turns out that the F distribution is also a special case of the generalized Pareto distribution. The F distribution with r_1 and r_2 degrees of freedom is the generalized Pareto distribution with parameters k=r_1/2, \alpha=r_2/2 and \beta=r_2/r_1. As a result, the following is the density function.

    \displaystyle \begin{aligned} h(x)&=\frac{\Gamma(r_1/2 + r_2/2)}{\Gamma(r_1/2) \ \Gamma(r_2/2)} \ \frac{(r_2/r_1)^{r_2/2} \ x^{r_1/2-1}}{(x+r_2/r_1)^{r_1/2+r_2/2}} \\&=\frac{\Gamma(r_1/2 + r_2/2)}{\Gamma(r_1/2) \ \Gamma(r_2/2)} \ \frac{(r_1/r_2)^{r_1/2} \ x^{r_1/2-1}}{(1+(r_1/r_2)x)^{r_1/2+r_2/2}}  \ \ \ \ 0<x<\infty   \end{aligned}

Another way to generate the F distribution is from taking a ratio of two chi-squared distributions (see Theorem 9 in this previous post). Of course, there is no need to use the explicit form of the density function of the F distribution. In a statistical application, the F distribution is accessed using tables or software.

The Loglogistic Distribution

The loglogistic distribution can be derived as a mixture of Weillbull distribution with exponential mixing weights.

Example 6. Suppose that X \lvert \Lambda has a Weibull distribution with shape parameter \gamma (a known constant) and a parameter \Lambda such that the CDF of X \lvert \Lambda is F(x \lvert \Lambda)=1-e^{-\Lambda \ x^\gamma}. Further suppose that the random parameter \Lambda follows an exponential distribution with rate parameter \theta^{\gamma}. Then the unconditional distribution for X is a loglogistic distribution with shape parameter \gamma and scale parameter \theta.

The following gives the conditional survival function for X \lvert \Lambda and the exponential mixing weight.

    \displaystyle S(x \lvert \lambda)=e^{-\lambda \ x^\gamma}

    \displaystyle g(\lambda)=\theta^\gamma \ e^{-\theta^\gamma \ \lambda}

The following gives the unconditional survival function and CDF of X as well as the PDF.

    \displaystyle \begin{aligned} S(x)&=\int_0^\infty S(x \lvert \lambda) \ g(\lambda) \ d \lambda \\&=\int_0^\infty e^{-\lambda \ x^\gamma} \ \theta^\gamma \ e^{-\theta^\gamma \ \lambda} \ d \lambda \\&=\int_0^\infty  \theta^\gamma \ e^{-(x^\gamma+\theta^\gamma) \ \lambda} \ d \lambda \\&=\frac{\theta^\gamma}{(x^\gamma+\theta^\gamma)} \int_0^\infty   (x^\gamma+\theta^\gamma) \ e^{-(x^\gamma+\theta^\gamma) \ \lambda} \ d \lambda \\&=\frac{\theta^\gamma}{x^\gamma+\theta^\gamma} \end{aligned}

    \displaystyle \begin{aligned} F(x)&=1-S(x)=1-\frac{\theta^\gamma}{x^\gamma+\theta^\gamma} =\frac{x^\gamma}{x^\gamma+\theta^\gamma} =\frac{(x/\theta)^\gamma}{1+(x/\theta)^\gamma} \end{aligned}

    \displaystyle f(x)=\frac{d}{dx} \biggl( \frac{x^\gamma}{x^\gamma+\theta^\gamma} \biggr)=\frac{\gamma \ (x/\theta)^\gamma}{x [1+(x/\theta)^\gamma]^2}

Any distribution that has any one of the above three distributional quantities is said to be a loglogistic distribution with shape parameter \gamma and scale parameter \theta.

One interesting point about loglogistic distribution that an inverse loglogistic distribution is another loglogistic distribution. Suppose that X has a loglogistic distribution with shape parameter \gamma and scale parameter \theta. Let Y=\frac{1}{X}. Then Y has a loglogistic distribution with shape parameter \gamma and scale parameter \theta^{-1}.

    \displaystyle \begin{aligned} P[Y \le y]&=P[\frac{1}{X} \le y] =P[X \ge y^{-1}] =\frac{\theta^\gamma}{y^{-\gamma}+\theta^\gamma} \\&=\frac{\theta^\gamma \ y^\gamma}{1+\theta^\gamma \ y^\gamma} \\&=\frac{y^\gamma}{(\theta^{-1})^\gamma+y^\gamma} \end{aligned}

The above is a survival function for the loglogistic distribution with the desired parameters. Thus there is no need to specially call out the inverse loglogistic distribution.

In order to find the mean and higher moments of the loglogistic distribution, we take the approach of identifying the conditional Weibull means and the weight these means by the exponential mixing weights. Note that the parameter \Lambda in the conditional CDF F(x \lvert \Lambda)=1-e^{-\Lambda \ x^\gamma} is not a scale parameter. The Weibull distribution in this conditional CDF is equivalent to a Weibull distribution with shape parameter \gamma and scale parameter \Lambda^{-1/\gamma}. According to formula (4) in this previous post, the kth moment of this Weillbull distribution is

    \displaystyle E[ (X \lvert \Lambda)^k]=\Gamma \biggl(1+\frac{k}{\gamma} \biggr) \Lambda^{-k/\gamma}

The following gives the unconditional kth moment of the Weibull-exponential mixure.

    \displaystyle \begin{aligned} E[X^k]&=\int_0^\infty E[ (X \lvert \Lambda)^k] \ g(\lambda) \ d \lambda \\&=\int_0^\infty \Gamma \biggl(1+\frac{k}{\gamma} \biggr) \lambda^{-k/\gamma} \ \theta^\gamma \ e^{-\theta^\gamma \ \lambda} \ d \lambda\\&=\Gamma \biggl(1+\frac{k}{\gamma} \biggr) \ \theta^\gamma \int_0^\infty  \lambda^{-k/\gamma} \ e^{-\theta^\gamma \ \lambda} \ d \lambda \\&=\theta^k \ \Gamma \biggl(1+\frac{k}{\gamma} \biggr)  \int_0^\infty  t^{-k/\gamma} \ e^{-t} \ dt \ \ \text{ where } t=\theta^\gamma \lambda \\&=\theta^k \ \Gamma \biggl(1+\frac{k}{\gamma} \biggr) \int_0^\infty  t^{[(\gamma-k)/\gamma]-1} \ e^{-t} \ dt   \\&=\theta^k \ \Gamma \biggl(1+\frac{k}{\gamma} \biggr) \ \Gamma \biggl(1-\frac{k}{\gamma} \biggr) \ \ \ \ -\gamma<k<\gamma  \end{aligned}

The range \gamma<k<\gamma follows from the fact that the arguments of the gamma function must be positive. Thus the kth moments of the loglogistic distribution are limited by its shape parameter \gamma. If \gamma=1, then E(X) does not exist. For a larger \gamma, more moments exist but always a finite number of moments. This is an indication that the loglogistic distribution has a thick (right) tail. This is not surprising since mixture distributions (loglogistic in this case) tend to have thicker tails than the conditional distributions (Weibull in this case). The thicker tail is a result of the uncertainty in the random parameter in the conditional distribution (the Weibull \Lambda in this case).

Another Way to Obtain Exponential Distribution

We now consider Example 7. The following is a precise statement of the gamma-geometric mixture.

Example 7. Suppose that X \lvert \alpha has a gamma distribution with shape parameter \alpha that is a positive integer and rate parameter \beta (a known constant). Further suppose that the random parameter \alpha follows a geometric distribution with probability function P[Y=\alpha]=p (1-p)^{\alpha-1} where \alpha=1,2,3,\cdots. Then the unconditional distribution for X is an exponential distribution with rate parameter \beta p.

The conditional gamma distribution has an uncertain shape parameter \alpha that can take on positive integers. The parameter \alpha follows a geometric distribution. Here’s the ingredients that go into the mixture.

    \displaystyle f(x \lvert \alpha)=\frac{1}{(\alpha-1)!} \ \beta^\alpha \ x^{\alpha-1} \ e^{-\beta x}

    P[Y=\alpha]=p (1-p)^{\alpha-1}

The following is the unconditional probability density function of X.

    \displaystyle \begin{aligned} f(x)&=\sum \limits_{\alpha=1}^\infty f(x \lvert \alpha) \ P[Y=\alpha] \\&=\sum \limits_{\alpha=1}^\infty \frac{1}{(\alpha-1)!} \ \beta^\alpha \ x^{\alpha-1} \ e^{-\beta x} \ p (1-p)^{\alpha-1} \\&=\beta p \ e^{-\beta x} \sum \limits_{\alpha=1}^\infty \frac{[\beta(1-p) x]^{\alpha-1}}{(\alpha-1)!} \\&=\beta p \ e^{-\beta x} \sum \limits_{\alpha=0}^\infty \frac{[\beta(1-p) x]^{\alpha}}{(\alpha)!} \\&=\beta p \ e^{-\beta x} \ e^{\beta(1-p) x} \end{aligned}

The above density function is that of an exponential distribution with rate parameter \beta p.

Student t Distribution

Example 3 (discussed in the previous post) involves a normal distribution with a random mean. Example 8 involves a normal distribution with mean 0 and an uncertain variance, which follows a gamma distribution such that the two gamma parameters are related to a common parameter r, which will be the degrees of freedom of the student t distribution. The following is a precise description of the normal-gamma mixture.

Example 8. Suppose that X \lvert \Lambda has a normal distribution with mean 0 and variance 1/\Lambda. Further suppose that the random parameter \Lambda follows a gamma distribution with shape parameter \alpha and scale parameter \theta such that 2 \alpha=\frac{2}{\theta}=r is a positive integer. Then the unconditional distribution for X is a student t distribution with r degrees of freedom.

The following gives the ingredients of the normal-gamma mixture. The first item is the conditional density function of X given \Lambda. The second is the density function of the mixing weight \Lambda.

    \displaystyle f(x \lvert \lambda)=\frac{1}{\sqrt{1/\lambda} \ \sqrt{2 \pi}} \ e^{-(\lambda/2) \  x^2}=\sqrt{\frac{\lambda}{2 \pi}} \ e^{-(\lambda/2) \  x^2}

    \displaystyle g(\lambda)=\frac{1}{\Gamma(\alpha)} \biggl( \frac{1}{\theta} \biggr)^\alpha \ \lambda^{\alpha-1} \ e^{-\lambda/\theta}

The following calculation derives the unconditional density function of X.

    \displaystyle \begin{aligned} f(x)&=\int_{0}^\infty f(x \lvert \lambda) \ g(\lambda) \ d \lambda \\&=\int_{0}^\infty \sqrt{\frac{\lambda}{2 \pi}} \ e^{-(\lambda/2) \  x^2} \ \frac{1}{\Gamma(\alpha)} \biggl( \frac{1}{\theta} \biggr)^\alpha \ \lambda^{\alpha-1} \ e^{-\lambda/\theta} \ d \lambda \\&=\frac{1}{\Gamma(\alpha)} \ \biggl( \frac{1}{\theta} \biggr)^\alpha \ \frac{1}{\sqrt{2 \pi}} \int_0^\infty \lambda^{\alpha+\frac{1}{2}-1} e^{-(\frac{x^2}{2}+\frac{1}{\theta} ) \lambda} \ d \lambda \\&=\frac{\Gamma(\alpha+\frac{1}{2})}{\Gamma(\alpha)} \ \biggl( \frac{1}{\theta} \biggr)^\alpha \ \frac{1}{\sqrt{2 \pi}} \ \biggl(\frac{2 \theta}{\theta x^2+2} \biggr)^{\alpha+\frac{1}{2}} \\& \times \int_0^\infty \frac{1}{\Gamma(\alpha+\frac{1}{2})} \ \biggl(\frac{\theta x^2+2}{2 \theta} \biggr)^{\alpha+\frac{1}{2}} \lambda^{\alpha+\frac{1}{2}-1} e^{-\frac{\theta x^2+2}{2 \theta} \lambda} \ d \lambda \\&=\frac{\Gamma(\alpha+\frac{1}{2})}{\Gamma(\alpha)} \ \biggl( \frac{1}{\theta} \biggr)^\alpha \ \frac{1}{\sqrt{2 \pi}} \ \biggl(\frac{2 \theta}{\theta x^2+2} \biggr)^{\alpha+\frac{1}{2}} \ \ \ \ \ -\infty<x<\infty \end{aligned}

The above density function is in terms of the two parameters \alpha and \theta. In the assumptions, the two parameters are related to a common parameter r such that \alpha=\frac{r}{2} and \theta=\frac{2}{r}. The following derivation converts to the common r.

    \displaystyle \begin{aligned} f(x)&=\frac{\Gamma(\frac{r}{2}+\frac{1}{2})}{\Gamma(\frac{r}{2})} \ \biggl( \frac{r}{2} \biggr)^{\frac{r}{2}} \ \frac{1}{\sqrt{2 \pi}} \ \biggl(\frac{2 \frac{2}{r}}{\frac{2}{r} x^2+2} \biggr)^{\frac{r}{2}+\frac{1}{2}} \\&=\frac{\Gamma(\frac{r}{2}+\frac{1}{2})}{\Gamma(\frac{r}{2})} \ \frac{r^{r/2}}{2^{r/2}} \ \frac{1}{2^{1/2} \sqrt{\pi}} \ \biggl(\frac{2/r}{x^2/r+1} \biggr)^{(r+1)/2} \\&=\frac{\Gamma \biggl(\displaystyle \frac{r+1}{2} \biggr)}{\Gamma \biggl(\displaystyle \frac{r}{2} \biggr)} \ \frac{1}{\sqrt{\pi r}} \ \frac{1 \ \ \ \ \ }{\biggl(1+\displaystyle \frac{x^2}{r} \biggr)^{(r+1)/2}} \ \ \ \ \ -\infty<x<\infty \end{aligned}

The above density function is that of a student t distribution with r degrees of freedom. Of course, in performing test of significance, the t distribution is accessed by using tables or software. A usual textbook definition of the student t distribution is the ratio of a normal distribution and a chi-squared distribution (see Theorem 6 in this previous post.

\text{ }

\text{ }

\text{ }

\copyright 2017 – Dan Ma

Lognormal distribution

This post discusses the basic properties of the lognormal distribution. The lognormal distribution is a transformation of the normal distribution through exponentiation. As a result, some of the mathematical properties of the lognormal distribution can be derived from the normal distribution. The normal distribution is applicable in many situations but not in all situations. The normal density curve is a bell-shaped curve and is thus not appropriate in phenomena that are skewed to the right. In such situations, the lognormal distribution can be a good alternative to the normal distribution. In an actuarial setting, the lognormal distribution is an excellent candidate for a model of insurance claim sizes.

_______________________________________________________________________________________________

Defining the distribution

In this post, the notation log refers to the natural log function, i.e., logarithm to the base e=2.718281828459 \cdots. Thus log(2) = 0.693147181.

A random variable Y is said to follow a lognormal distribution if \log(Y) follows a normal distribution. A lognormal distribution has two parameters \mu and \sigma, which are the mean and standard deviation of the normal random variable \log(Y). To be more precise, the definition is restated as follows:

A random variable Y is said to follow a lognormal distribution with parameters \mu and \sigma if \log(Y) follows a normal distribution with mean \mu and standard deviation \sigma.

Many useful probability distributions are transformations of other known distributions. The above definition shows that a normal distribution is the transformation of a lognormal distribution under the natural logarithm. Start with a lognormal distribution, taking the natural log of it gives you a normal distribution. The other direction is actually more informative, i.e., a lognormal distribution is the transformation of a normal distribution by the exponential function. Start with a normal random variable X, the exponentiation of it is a lognormal distribution, i.e., Y=e^{X} is a lognormal distribution. The following summarizes these two transformations.

If the random variable Y is lognormal, then the random variable X=\log(Y) is normal.
If the random variable X is normal, then the random variable Y=e^{X} is lognormal.

_______________________________________________________________________________________________

Comparing Normal and Lognormal

Normally one of the first things to focus on is the probability density function when studying a continuous probability model. In the case of the lognormal distribution, a natural way to start is to focus on the relationship between lognormal distribution and normal distribution. In this section, we compare the following:

  • The lognormal distribution with parameters \mu = 0 and \sigma = 1 (standard lognormal distribution).
  • The normal distribution with mean 0 and standard deviation 1 (standard normal distribution).

Suppose that the random variable Y follows the standard lognormal distribution. This means that X=\log(Y) is the normal distribution with mean 0 and standard deviation 1, i.e. the standard normal distribution. Going the other direction, Y=e^X is the lognormal distribution we start with.

The standard normal X takes on values from -\infty to \infty. But most likely, about 99.7% of the times, X takes on values between -3 and 3 (in general, a normal random variable takes on values between -3 standard deviations and +3 standard deviation about 99.7% of the time).

Since exponentiation always gives positive values, the standard lognormal random variable Y always takes on positive values. Since X is rarely outside of -3 and +3, the standard lognormal random variable Y will takes on values between e^{-3} = 0.049787068 to e^{3} = 20.08553692 about 99.7% of the time. Thus observing a standard lognormal value over 20 would be an extremely rare event. The following figure shows the graphs of the standard normal density function and the standard lognormal density function.

    Figure 1 – normal and lognormal density curves
    standard normal - lognormal PDFs

In Figure 1, the standard normal density curve is symmetric bell shaped curve, with mean and median located at x = 0. The standard lognormal density is located entirely over the positive x-axis. This is because the exponential function always gives positive values regardless of the sign of the argument. The lognormal density curve in Figure 1 is not symmetric and is a uni-modal curve skewed toward the right. All the standard normal values on the negative x-axis, when exponentiated, are in the interval (0, 1). Thus in Figure 1, the lower half of the lognormal probabilities lie in the interval x = 0 to x = 1 (i.e. the median of this lognormal distribution is x = 1). The other half of the lognormal probabilities lie in the interval (1, \infty). Such lopsided assignment of probabilities shows that lognormal distribution is a positively skewed distribution (skewed to the right).

In the above paragraph, the lower half of the normal distribution on (-\infty,0) is matched with the lognormal distribution on the interval (0, 1). Such interval matching can tell us a great deal about the lognormal distribution. Another example: about 75% of the standard normal distributional values lie below x = 0.67. Thus in Figure 1, about 75% of the lognormal probabilities lie in the interval (0, 1.95) where e^{0.67} \approx 1.95. Another example: what is the probability that the lognormal distribution in Figure 1 lie between 1 and 3.5? Then the normal matching interval is (0, 1.25) where \log(3.5) \approx 1.25. The normal probability in this interval is 0.8944 – 0.5 = 0.3944. Thus randomly generated a value in the standard lognormal distribution, there is a 39.44% percent chance that it is between 1 and 3.5.

The interval matching idea is very useful for computing lognormal probabilities (e.g. cumulative distribution function) and for finding lognormal percentiles. This idea is discussed further below to make it work for any lognormal distribution, not just the standard lognormal distribution. The following compares the two cumulative distribution functions (CDFs).

    Figure 2 – normal and lognormal CDFs

Note that the standard normal CDF basically reaches the level of y = 1 when the x-values get close to 3.0. The lognormal CDF approaches 1.0 too, but at a much slower rate. The lognormal CDF is close to 1 when x = 10 and is rapidly approaching 1 after that point.

Though lognormal distribution is a skewed distribution, some are less skewed than others. The lognormal distributions with larger parameter value of \sigma tend to be more skewed. The following is a diagram of three lognormal density curves that demonstrates this point. Note that the small \sigma of 0.25 relatively resembles a bell curve.

    Figure 3 – three lognormal density curves
    Three lognormal density curves

_______________________________________________________________________________________________

How to compute lognormal probabilities and percentiles

Let Y be a random variable that follows a lognormal distribution with parameters \mu and \sigma. Then the related normal random variable is X=\log(Y), which has mean \mu and standard deviation \sigma. If we raise e to X, we get back the lognormal Y.

Continuing the interval matching idea, the lognormal interval Y \le y will match with the normal interval X \le \log(y). Both intervals receive the same probability in their respective distributions. The following states this more clearly.

    \displaystyle P\biggl(Y \le y\biggr)=P\biggl(X=\log(Y) \le \log(y)\biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

On the other hand, the normal interval X \le x will match with the lognormal interval Y \le e^x. The same probability is assigned to both intervals in their respective distributions. This idea is stated as follows:

    \displaystyle P\biggl(X \le x\biggr)=P\biggl(Y=e^X \le e^x\biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The idea of (1) gives the cumulative distribution of the lognormal distribution (argument y), which is evaluated as the CDF of the corresponding normal distribution at \log(y). One obvious application of (2) is to have an easy way to find percentiles for the lognormal distribution. It is relatively easy to find the corresponding percentile of the normal distribution. Then the lognormal percentile is e raised to the corresponding percentile of the normal distribution. For example, the median of the normal distribution is at the mean \mu. Then the median of the lognormal distribution is e^{\mu}.

The calculation in both (1) and (2) involve finding normal probabilities, which can be obtained using software or using a table of probability values of the standard normal distribution. To do the table approach, each normal CDF is converted to the standard normal CDF as follows:

    \displaystyle P\biggl(X \le x \biggr)=P\biggl(\frac{X-\mu}{\sigma} \le \frac{x-\mu}{\sigma} \biggr)=\Phi(z) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

where z is the z-score which is the ratio (x-\mu)/\sigma and \Phi(z) is the cumulative distribution function of the standard normal distribution, which can be looked up from a table based on the z-score. In light of this, (1) can be expressed as follows:

    \displaystyle P\biggl(Y \le y\biggr)=\Phi\biggl(\frac{\log(y)-\mu}{\sigma} \biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

A quick example to demonstrate how this works.

Example 1
If Y is lognormally distributed with parameters \mu=2.5 and \sigma=1.5,

  • what is the probability P(1.9 \le Y \le 31.34)?
  • what is the 95th percentile of this lognormal distribution?

The first answer is P(1.9 \le Y \le 31.34)=P(Y \le 31.34)-P(Y \le 1.9), which is calculated as follows:

    \displaystyle P(Y \le 31.34)=\Phi \biggl( \frac{\log(31.34)-2.5}{1.5}\biggr)=\Phi(0.63)=0.7357

    \displaystyle P(Y \le 1.9)=\Phi \biggl( \frac{\log(1.9)-2.5}{1.5}\biggr)=\Phi(-1.24)=1-0.8925=0.1075

    \displaystyle \begin{aligned} P(1.9 \le Y \le 31.34)&=P(Y \le 31.34)-P(Y \le 1.9) \\&=0.7357-0.1075 \\&=0.6282  \end{aligned}

The z-score for the 95th percentile for the standard normal distribution is z = 1.645. Then the 95th percentile for the normal distribution with mean 2.5 and standard deviation 1.5 is x = 2.5 + 1.645 (1.5) = 4.9675. Then apply the exponential function to obtain e^{4.9675} \approx 143.67, which is the desired lognormal 95th percentile. \square

As (2) and Example 1 suggest, to find a lognormal percentile, first find the percentile for the corresponding normal distribution. If x_p is the 100pth percentile of the normal distribution, then e^{x_p} is the 100pth percentile of the lognormal distribution. Usually, we can first find the 100pth percentile for the standard normal distribution z_p. Then the normal percentile we need is x=\mu + z_p \cdot \sigma. The lognormal percentile is then:

    \displaystyle e^{\displaystyle \mu + z_p \cdot \sigma}=\text{lognormal } 100p \text{th percentile} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

The above discussion shows that the explicit form of the lognormal density curve is not needed in computing lognormal probabilities and percentiles. For the sake of completeness, the following shows the probability density functions of both the normal distribution and the lognormal distribution.

    Normal PDF
    \displaystyle f_X(x)=\frac{1}{\sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(x-\mu)^2}{2 \sigma^2}} \ \ \ \ \ \ \ \ \ \ \ -\infty<x<\infty \ \ \ \ \ \ \ \ \ \ (6)

    Lognormal PDF
    \displaystyle f_Y(y)=\frac{1}{y \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(y)-\mu)^2}{2 \sigma^2}} \ \ \ \ \ 0<y<\infty \ \ \ \ \ \ \ \ \ \ \ \ \ (7)

The cumulative distribution function for the lognormal distribution is then

    \displaystyle F_Y(y)=\int_0^y \frac{1}{t \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(t)-\mu)^2}{2 \sigma^2}} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)

Of course, we do not have to use (8) since the lognormal CDF can be obtained based on the corresponding normal CDF.

One application of the lognormal PDF in (7) is to use it to find the mode (by taking its derivative and finding the critical value). The mode of the lognormal distribution with parameters \mu and \sigma is \displaystyle e^{\mu - \sigma^2}.

_______________________________________________________________________________________________

How to find lognormal moments

To find the mean and higher moments of the lognormal distribution, we once again rely on basic information about normal distribution. For any random variable T (normal or otherwise), its moment generating function, if exists, is defined by M_T(t)=E(e^{t \ T}). The following is the moment generating function of the normal distribution with mean \mu and standard deviation \sigma.

    \displaystyle M_X(t)=\text{\Large e}^{\displaystyle \biggl[ \mu t + (1/2) \sigma^2 t^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)

As before, let Y be a random variable that follows a lognormal distribution with parameters \mu and \sigma. Then Y=e^X where X is normal with mean \mu and standard deviation \sigma. Then E(Y)=E(e^X) is simply the normal moment generating function evaluated at 1. In fact, the kth moment of Y, E(Y^k)=E(e^{k X}), is simply the normal mgf evaluated at k. Because the mgf of the normal distribution is defined at any real number, all moments for the lognormal distribution exist. The following gives the moments explicitly.

    E(Y)=\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}

    E(Y^k)=\text{\Large e}^{\displaystyle \biggl[ k \mu+(k^2/2) \sigma^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (10)

In particular, the variance and standard deviation are:

    \displaystyle \begin{aligned}Var(Y)&=\text{\Large e}^{\displaystyle \biggl[ 2 \mu+2 \sigma^2 \biggr]}-\text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \\&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (11) \end{aligned}

    \displaystyle \sigma_Y=\sqrt{\text{\Large e}^{\displaystyle \sigma^2}-1} \ \ \text{\Large e}^{\displaystyle \biggl[ \mu+\frac{1}{2} \sigma^2 \biggr]}=\sqrt{\text{\Large e}^{\displaystyle \sigma^2}-1} \ E(Y) \ \ \ \ \ \ \ \ \ \ \ \ (12)

The formulas (11) and (12) give the variance and standard deviation if the parameters \mu and \sigma are known. They do not need to be committed to memory, since they can always be generated from knowing the moments in (10). As indicated before, the lognormal median is e^{\mu}, which is always less than the mean, which is e raised to \mu+(1/2) \sigma^2. So the mean is greater than the median by a factor of e raised to (1/2) \sigma^2. The mean being greater than the median is another sign that the lognormal distribution is skewed right.

Example 2
Suppose Y follows a lognormal distribution with mean 12.18 and variance 255.02. Determine the probability that Y is greater than its mean.

With the given information, we have:

    E(Y)=\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}=12.18

    \biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]}=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \biggl(\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}\biggr)^2=255.02

    \displaystyle \begin{aligned}Var(Y)&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \\&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \biggl(\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}\biggr)^2  \\&=  \biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) 12.18^2=255.02    \end{aligned}

From the last equation, we can solve for \sigma. The following shows the derivation.

    \displaystyle \begin{aligned} &\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr)=\frac{255.02}{12.18^2}   \\&\text{\Large e}^{\displaystyle \sigma^2}=2.719014994  \\&\sigma^2=\log(2.719014994)=1.00026968 \end{aligned}

Thus we can take \sigma=1. Then plug \sigma=1 into E(Y) to get \mu=2. The desired probability is:

    \displaystyle \begin{aligned} P(Y>12.18)&=1-\Phi \biggl(\frac{\log(12.18)-2}{1}  \biggr) \\&=1-\Phi(0.499795262)  \\&=1-\Phi(0.5)=1-0.6915=0.3085  \ \square    \end{aligned}

_______________________________________________________________________________________________

Linear transformations

For any random variable W, a linear transformation of W is the random variable aW+b where a and b are real constants. It is well known that if X follows a normal distribution, any linear transformation of X also follows a normal distribution. Does this apply to lognormal distribution? A linear transformation of a lognormal distribution may not have distributional values over the entire positive x-axis. For example, if Y is lognormal, then Y+1 is technically not lognormal since the values of Y+1 lie in (1, \infty) and not (0, \infty). Instead, we focus on the transformations cY where c>0 is a constant. We have the following fact.

    If Y has a lognormal distribution with parameters \mu and \sigma, then cY has a lognormal distribution with parameters \mu+\log(c) and \sigma.

The effect of the constant adjustment of the lognormal distribution is on the \mu parameter, which is adjusted by adding the natural log of the constant c. Note that the adjustment on \mu is addition and not multiplication. The \sigma parameter is unchanged.

One application of the transformation cY is that of inflation. For example, suppose Y represents claim amounts in a given calendar year arising from a group of insurance policies. If the insurance company expects that the claims amounts in the next year will increase by 10%, then 1.1Y is the random variable that models next year’s claim amounts. If Y is assumed to be lognormal, then the effect of the 10% inflation is on the \mu parameter as indicated above.

To see why the inflation on Y works as described, let’s look at the cumulative distribution function of T=cY.

    \displaystyle \begin{aligned} P(T \le y)&=P(Y \le \frac{y}{c})=F_Y(y/c) \\&=\int_0^{y/c} \frac{1}{t \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(t)-\mu)^2}{2 \sigma^2}} \ dt  \end{aligned}

Taking derivative of the last item above, we obtain the probability density function f_T(y).

    \displaystyle \begin{aligned} f_T(y)&=\frac{1}{y/c \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(y/c)-\mu)^2}{2 \sigma^2}} (1/c) \\&=\frac{1}{y \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{\biggl(\log(y)-(\mu+\log(c))\biggr)^2}{2 \sigma^2}}  \end{aligned}

Comparing with the density function (8), the last line is the lognormal density function with parameters \mu + \log(c) and \sigma.

_______________________________________________________________________________________________

Distributional quantities involving higher moments

As the formula (10) shows, all moments exist for the lognormal distribution. As a result, any distributional quantity that is defined using moments can be described explicitly in terms of the parameters \mu and \sigma. We highlight three such distributional quantities: coefficient of variation, coefficient of skewness and kurtosis. The following shows their definitions. The calculation is done by plugging in the moments obtained from (10).

    \displaystyle CV=\frac{\sigma_Y}{\mu_Y} \ \ \ \ \ \text{(Coefficient of variation)}

    \displaystyle \begin{aligned} \gamma_1&=\frac{E[ (Y-\mu_Y)^3 ]}{\sigma_Y^3}  \\&=\frac{E(Y^3)-3 \mu_Y \sigma_Y^2-\mu_Y^3}{(\sigma_Y^2)^{\frac{3}{2}}}  \ \ \ \ \ \text{(Coefficient of skewness)}    \end{aligned}

    \displaystyle \begin{aligned} \beta_2&=\frac{E[ (Y-\mu_Y)^4 ]}{\biggl(E[ (Y-\mu_Y)^2]\biggr)^2}  \\&=\frac{E(Y^4)-4 \ \mu_Y \ E(Y^3) + 6 \ \mu_Y^2 \ E(Y^2) - 3 \ \mu_Y^4}{(\sigma_Y^2)^{2}}  \ \ \ \ \ \text{(Kurtosis)}    \end{aligned}

The above definitions are made for any random variable Y. The notations \mu_Y and \sigma_Y are the mean and standard deviation of Y, respectively. Coefficient of variation is the ratio the standard deviation to the mean. It is a standardized measure of dispersion of a probability distribution or frequency distribution. The coefficient of skewness is defined as the ratio of the third central moment about the mean to the cube of the standard deviation. The second line in the above definition is an equivalent form that is in terms of the mean, variance and the third raw moment, which may be easier to calculate in some circumstances. Kurtosis is defined to be the ratio of the fourth central moment about the mean to the square of the second central moment about the mean. The second line in the definition gives an equivalent form that is in terms of the mean, variance and the third and fourth raw moments.

The above general definitions of CV, \gamma_1 and \beta_2 can be obtained for the lognormal distribution. The mean and variance and higher raw moments can be obtained by using (10). Then it is a matter of plugging in the relevant items into the above definitions. The following example shows how this is done.

Example 3
Determine the CV, \gamma_1 and \beta_2 of the lognormal distribution in Example 2.

The calculation in Example 2 shows that the lognormal parameters are \mu=2 and \sigma=1. Now use formula (10) to get the ingredients.

    \mu_Y=e^{2.5}

    Var(Y)=(e-1) e^5 \ \ \ \ \ \ \ \ \sigma_Y=\sqrt{e-1} \ e^{2.5}

    E(Y^2)=e^6

    E(Y^3)=e^{10.5}

    E(Y^4)=e^{16}

Right away, CV = \sqrt{e-1}=1.31. The following shows the calculation for skewness and kurtosis.

    \displaystyle \begin{aligned} \gamma_1&=\frac{E(Y^3)-3 \mu_Y \sigma_Y^2-\mu_Y^3}{(\sigma_Y^2)^{\frac{3}{2}}}  \\&=\frac{e^{10.5}-3 \ e^{2.5} \ (e-1) \ e^5-(e^{2.5})^3}{[(e-1) \ e^5]^{\frac{3}{2}}}\\&=6.1849     \end{aligned}

    \displaystyle \begin{aligned} \beta_2&=\frac{E(Y^4)-4 \ \mu_Y \ E(Y^3) + 6 \ \mu_Y^2 \ E(Y^2) - 3 \ \mu_Y^4}{(\sigma_Y^2)^{2}}  \\&=\frac{e^{16}-4 \ e^{2.5} \ e^{10.5} + 6 \ (e^{2.5})^2 \ e^{6} - 3 \ (e^{2.5})^4}{[(e-1) \ e^5]^{2}}\\&=113.9364    \end{aligned}

_______________________________________________________________________________________________

Is there a moment generating function for the lognormal distribution?

Because the normal distribution has a moment generating function, all moments exist for the lognormal distribution (see formula (10) above). Does the moment generating function exist for the lognormal distribution? Whenever the mgf exists for a distribution, its moments can be derived from the mgf. What about the converse? That is, when all moments exist for a given distribution, does it mean that its moment generating function would always exist? The answer is no. It turns out that the lognormal distribution is a counterexample. We conclude this post by showing this fact.

Let Y be the standard lognormal distribution, i.e., X=\log(Y) has the standard normal distribution. We show that the expectation E(e^{tY}) converges to infinity when t>0.

    \displaystyle \begin{aligned} E(e^{t \ Y})&=E(e^{t \ e^X}) \\&=\frac{1}{\sqrt{2 \pi}} \ \int_{-\infty}^\infty e^{t \ e^x} \ e^{-0.5 x^2} \ dx \\&=\frac{1}{\sqrt{2 \pi}} \ \int_{-\infty}^\infty e^{t \ e^x - 0.5x^2} \ dx \\&> \frac{1}{\sqrt{2 \pi}} \ \int_{0}^\infty e^{t \ e^x - 0.5x^2} \ dx \\&> \frac{1}{\sqrt{2 \pi}} \ \int_{0}^\infty e^{\biggl(t \ ( \displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}) - 0.5x^2 \biggr)} \ dx=\infty \end{aligned}

The last integral in the above derivation converges to infinity. Note that the Taylor’s series expansion of e^x is \displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots. In the last step, e^x is replaced by \displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}. Then the exponent in the last integral is a third degree polynomial with a positive coefficient for the x^3 term. Thus this third degree polynomial converges to infinity as x goes to infinity. With the last integral goes to infinity, the mgf E(e^{t \ Y}) goes to infinity as well.

_______________________________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in here.

More practice problems are available: A summary of the lognormal basic properties, more lognormal practice problems.

_______________________________________________________________________________________________
\copyright 2017 – Dan Ma

The Chi-Squared Distribution, Part 3a

This post is the part 3 of a three-part series on chi-squared distribution. In this post, we discuss the roles played by chi-squared distribution on experiments or random phenomena that result in measurements that are categorical rather than quantitative (part 2 deals with quantitative measurements). An introduction of the chi-squared distribution is found in part 1.

The chi-squared test discussed here is also referred to as Pearson’s chi-squared test, which was formulated by Karl Pearson in 1900. It can be used to assess three types of comparison on categorical variables – goodness of fit, homogeneity, and independence. As a result, we break up the discussion into 3 parts – part 3a (goodness of fit, this post), part 3b (test of homogeneity) and part 3c (test of independence).

_______________________________________________________________________________________________

Multinomial Experiments

Let’s look at the setting for Pearson’s goodness-of-fit test. Consider a random experiment consisting of a series of independent trials each of which results in exactly one of k categories. We are interested in summarizing the counts of the trials that fall into the k distinct categories. Some examples of such random experiments are:

  • In rolling a die n times, consider the counts of the faces of the die.
  • Perform a series of experiments each of which is a toss of three coins. Summarize the experiments according to the number of heads, 0, 1, 2, and 3, that occur in each experiment.
  • Blood donors can be classified into the blood types A, B, AB and O.
  • Record the number of automobile accidents per week in a one-mile stretch of highway. Classify the weekly accident counts into the groupings 0, 1, 2, 3, 4 and 5+.
  • A group of auto insurance policies are classified into the claim frequency rates of 0, 1, ,2, 3, 4+ accidents per year.
  • Auto insurance claims are classified into various claim size groupings, e.g. under 1000, 1000 to 5000, 5000 to 10000 and 10000+.
  • In auditing financial transactions in financial documents (accounting statements, expense reports etc), the leading digits of financial figures can be classified into 9 cells: 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Each of the example can be referred to as a multinomial experiment. The characteristics of such an experiment are

  • The experiment consists of performing n identical trials that are independent.
  • For each trial, the outcome falls into exactly one of k categories or cells.
  • The probability of the outcome of a trial falling into a particular cell is constant across all trials.

For cell j, let p_j be the probability of the outcome falling into cell j. Of course, p_1+p_2+\cdots+p_k=1. We are interested in the joint random variables Y_1,Y_2,\cdots,Y_k where Y_j is the number of trials whose outcomes fall into cell j.

If k=2 (only two categories for each trial), then the experiment is a binomial experiment. Then one of the categories can be called success (with cell probability p) and the other is called failure (with cell probability 1-p). If Y_1 is the count of the successes, then Y_1 has a binomial distribution with parameters n and p.

In general, the variables Y_1,Y_2,\cdots,Y_k have a multinomial distribution. To be a little more precise, the random variables Y_1,Y_2,\cdots,Y_{k-1} have a multinomial distribution with parameters n and p_1,p_2,\cdots,p_{k-1}. Note that the last variable Y_k is deterministic since Y_k=n-(Y_1+\cdots+Y_{k-1}).

In the discussion here, the objective is to make inference on the cell probabilities p_1,p_2,\cdots,p_k. The hypotheses in the statistical test are expressed in terms of specific values of p_j, j=1,2,\cdots,k. For example, the null hypothesis may be of the following form:

    H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k

where p_{j,0} are the hypothesized values of the cell probabilities. It is cumbersome to calculate the probabilities for the multinomial distribution. As a result, it would be difficult (if not impossible) to calculate the exact level of significance, which is the probability of type I error. Thus it is critical to use a test statistic that does not depend on the multinomial distribution. Fortunately this problem was solved by Karl Pearson. He formulated a test statistic that has an approximate chi-squared distribution.

_______________________________________________________________________________________________

Test Statistic

The random variables Y_1,Y_2,\cdots,Y_k discussed above have a multinomial distribution with parameters p_1,p_2,\cdots,p_k, respectively. Of course, each p_j is the probability that the outcome of a trial falls into cell j. The marginal distribution of each Y_j has a binomial distribution with parameters n and p_j with p_j being the probability of success. Thus the expected value and the variance of Y_j are E[Y_j]=n p_j and Var[Y_j]=n p_j (1-p_j). The following is the chi-squared test statistic.

    \displaystyle \chi^2=\sum \limits_{j=1}^k \frac{(Y_j-n \ p_j)^2}{n \ p_j}=\sum \limits_{j=1}^k \frac{(Y_j-E[Y_j])^2}{E[Y_j]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The statistic defined in (1) was proposed by Karl Pearson in 1900. It is defined by summing the squares of the difference of the observed counts Y_j and the expected counts E[Y_j] where each squared difference is normalized by the expected count (i.e. divided by the expected count). On one level, the test statistic in (1) seems intuitive since it involves all the k deviations Y_j-E[Y_j]. If the observed values Y_j are close to the expected cell counts, then the test statistic in (1) would have a small value.

The chi-squared test statistic defined in (1) has an approximate chi-squared distribution when the number of trials n is large. The proof of this fact will not be discussed here. We demonstrate with the case for k=2.

    \displaystyle \begin{aligned} \sum \limits_{j=1}^2 \frac{(Y_j-n \ p_j)^2}{n \ p_j}&=\frac{p_2 \ (Y_1-n p_1)^2+p_1 \ (Y_2-n p_2)^2}{n \ p_1 \ p_2} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+p_1 \ ((n-Y_1)-n (1-p_1))^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+ p_1 \ (Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\biggl( \frac{Y_1-n p_1}{\sqrt{n \ p_1 \ (1-p_1)}} \biggr)^2 =\biggl( \frac{Y_1-E[Y_1]}{\sqrt{Var[Y_1]}} \biggr)^2 \end{aligned}

The quantity inside the brackets in the last step is approximately normal according to the central limit theorem. Since the square of a normal distribution has a chi-squared distribution with one degree of freedom (see Part 1), the last step in the above derivation has an approximate chi-distribution with 1 df.

In order for the chi-squared distribution to provide an adequate approximation to the test statistic in (1), a rule of thumb requires that the expected cell counts E[Y_j] are at least five. The null hypothesis to be tested is that the cell probabilities p_j are certain specified values p_{j,0} for j=1,2,\cdots,k. The following is the formal statement.

    H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The hull hypothesis is to be tested against all possible alternatives. In other words, the alternative hypothesis H_1 is the statement that p_j \ne p_{j,0} for at least one j.

The chi-squared test statistic in (1) can be used for a goodness-of-fit test, i.e. to test how well a probability model fit the sample data, in other words, to test whether the observed data come from a hypothesized probability distribution. Example 2 below will test whether the Poisson model is a good fit for a set of claim frequency data.

_______________________________________________________________________________________________

Degrees of Freedom

Now that we have addressed the distribution for the test statistic in (1), we need to address two more issues. One is the direction of the hypothesis test (one-tailed or two-tailed). The second is the degrees of freedom. The direction of the test is easy to see. Note that the chi-squared test statistic in (1) is always a positive value. On the other hand, if the difference between observed cell counts and expected cell counts is large, the large difference would contradict the null hypothesis. Thus if the chi-squared statistic has a large value, we should reject the null hypothesis. So the correct test to use is the upper tailed chi-squared test.

The number of degrees of freedom is obtained by subtracting one from the cell count k for each independent linear restriction placed on the cell probabilities. There is at least one linear restriction. The sum of all the cell probabilities must be 1. Thus the degrees of freedom must be the result of reducing k by one at least one time. This means that the degrees of freedom of the chi-squared statistic in (1) is at most k-1.

Furthermore, in the calculation for the specified cell probabilities, if there is any parameter that is unknown and is required to be estimated from the data, then there are further reductions in k-1. If there is any unknown parameter that needs to be estimated from data, a maximum likelihood estimator (MLE) should be used. All these points will be demonstrated by the examples below.

If the value of the chi-squared statistic in (1) is “large”, we reject the null hypothesis H_0 stated in (2). By large we mean the value of the chi-squared statistic exceeds the critical value for the desired level of significance. The critical value is the upper tail in the chi-squared distribution (with the appropriate df) of area \alpha where \alpha is the desired level of significance (e.g. \alpha=0.05 and \alpha=0.01 or some other appropriate level). Instead of using critical value, the p-value approach can also be used. Critical value or p-value can be looked up using table or computed using software. For the examples below, chi-squared functions in Excel are used.

_______________________________________________________________________________________________

Examples

Example 1
Suppose that we wish to test whether a given die is a fair die. We roll the die 240 times and the following table shows the results.

    \displaystyle \begin{array} {rr} \text{Cell} & \text{Frequency} \\ 1 & 38  \\ 2 & 35 \\ 3 & 37 \\ 4 & 38  \\ 5 & 42 \\ 6 & 50 \\ \text{ } & \text{ } \\ \text{Total} & 240   \end{array}

The null hypothesis

    \displaystyle H_0: p_1=p_2=p_3=p_4=p_5=p_6=\frac{1}{6}

is tested against the alternative that at least one of the equalities is not true. This example is the simplest problem for testing cell probabilities. Since the specified values for the cells probabilities in H_0 are known, the degrees of freedom is one less than the cell count. Thus df = 5. The following is the chi-squared statistic based on the data and the null hypothesis.

    \displaystyle \begin{aligned} \chi^2&=\frac{(38-40)^2}{40}+\frac{(35-40)^2}{40}+\frac{(37-40)^2}{40} \\& \ \ + \frac{(38-40)^2}{40}+\frac{(42-40)^2}{40}+\frac{(50-40)^2}{40}=3.65  \end{aligned}

At \alpha=0.05 level of significance, the chi-squared critical value at df = 5 is \chi_{0.05}^2(5)=11.07049769. Since 3.65 < 11.07, the hypothesis that the die is fair is not rejected at \alpha=0.05. The p-value is P[\chi^2 > 3.65]=0.6. With such a large p-value, we also come to the conclusion that the null hypothesis is not rejected. \square

In all the examples, the critical values and the p-values are obtained by using the following functions in Excel.

    critical value
    =CHISQ.INV.RT(level of significance, df)

    p-value
    =1 – CHISQ.DIST(test statistic, df, TRUE)

Example 2
We now give an example for the chi-squared goodness-of-fit test. The number of auto accident claims per year from 700 drivers are recorded by an insurance company. The claim frequency data is shown in the following table.

    \displaystyle \begin{array}{rrr} \text{Claim Count} & \text{ } & \text{Frequency} \\ 0 & \text{ } & 351  \\ 1 & \text{ } & 241 \\ 2 & \text{ } & 73 \\ 3 & \text{ } & 29  \\ 4 & \text{ } & 6 \\ 5+ & \text{ } & 0 \\ \text{ } & \text{ } & \text{ } \\ \text{Total} & \text{ } & 700   \end{array}

Test the hypothesis that the annual claim count for a driver has a Poisson distribution. Use \alpha=0.05. Assume that the claim frequency across the drivers in question are independent.

The hypothesized distribution of the annual claim frequency is a Poisson distribution with unknown mean \lambda. The MLE of the parameter \lambda is the sample mean, which in this case is \hat{\lambda}=\frac{498}{700}=0.711428571.

Under the assumption that the claim frequency is Poisson with mean \hat{\lambda}, the cell probabilities are calculated using \hat{\lambda}.

    \displaystyle p_1=P[Y=0]=e^{-\hat{\lambda}}=0.4909

    \displaystyle p_2=P[Y=1]=\hat{\lambda} \ e^{-\hat{\lambda}}=0.3493

    \displaystyle p_3=P[Y=2]=\frac{1}{2} \ \hat{\lambda}^2 \ e^{-\hat{\lambda}}=0.1242

    \displaystyle p_4=P[Y=3]=\frac{1}{3!} \ \hat{\lambda}^3 \ e^{-\hat{\lambda}}=0.0295

    \displaystyle p_5=P[Y \ge 4]=1-P[Y=0]-P[Y=1]-P[Y=2]-P[Y=3]=0.0061

Then the null hypothesis is:

    H_0: p_1=0.4909, p_2=0.3493, p_3=0.1242, p_4=0.0295, p_5=0.0061

The null hypothesis is tested against all alternatives. The following table shows the calculation of the chi-squared statistic.

    \displaystyle \begin{array}{rrrrrrrrr}   \text{Cell} & \text{ } & \text{Claim Count} & \text{ } & \text{Cell Probability} & \text{ } & \text{Expected Count} & \text{ } & \text{Chi-Squared}   \\ 1 & \text{ } & 0 & \text{ } & 0.4909 & \text{ } & 343.63 & \text{ } & 0.15807   \\ 2 & \text{ } & 1 & \text{ } & 0.3493 & \text{ } & 244.51 & \text{ } & 0.05039   \\ 3 & \text{ } & 2 & \text{ } & 0.1242 & \text{ } & 86.94 & \text{ } & 2.23515   \\ 4 & \text{ } & 3 & \text{ } & 0.0295 & \text{ } & 20.65 & \text{ } & 3.37639    \\ 5 & \text{ } & 4+ & \text{ } & 0.0061 & \text{ } & 4.27 & \text{ } & 0.70091   \\ \text{ } & \text{ } & \text{ }   \\ \text{Total} & \text{ } & \text{ } & \text{ } & 1.0000 & \text{ } & \text{ } & \text{ } & 6.52091   \end{array}

The degrees of freedom of the chi-squared statistic is df = 5 – 1 -1 = 3. The first reduction of one is due to the linear restriction of all cell probabilities summing to 1. The second reduction is due to the fact that one unknown parameter \lambda has to be estimated using sample data. Using Excel, the critical value is \chi_{0.05}^2(3)=7.814727903. The p-value is P[\chi^2 > 6.52091]=0.088841503. Thus the null hypothesis is not rejected at the level of significance \alpha=0.05. \square

Example 3
For many data sets, especially for data sets with numbers that distribute across multiple orders of magnitude, the first digits occur according to the probability distribution indicated in below:

    Probability for leading digit 1 = 0.301
    Probability for leading digit 2 = 0.176
    Probability for leading digit 3 = 0.125
    Probability for leading digit 4 = 0.097
    Probability for leading digit 5 = 0.079
    Probability for leading digit 6 = 0.067
    Probability for leading digit 7 = 0.058
    Probability for leading digit 8 = 0.051
    Probability for leading digit 9 = 0.046

This probability distribution was discovered by Simon Newcomb in 1881 and was rediscovered by physicist Frank Benford in 1938. Since then this distribution has become known as the Benford’s law. Thus in many data sets, the leading digit 1 occurs about 30% of the time. The data sets for which this law is applicable are demographic data (e.g. income data of a large population, census data such as populations of cities and counties) and scientific data. The law is also applicable in certain financial data, e.g. tax data, stock exchange data, corporate disbursement and sales data. Thus the Benford’s law is a great tool for forensic accounting and auditing.

The following shows the distribution of first digits in the population counts of all 3,143 counties in the United States (from US census data).

    Count for leading digit 1 = 972
    Count for leading digit 2 = 573
    Count for leading digit 3 = 376
    Count for leading digit 4 = 325
    Count for leading digit 5 = 205
    Count for leading digit 6 = 209
    Count for leading digit 7 = 179
    Count for leading digit 8 = 155
    Count for leading digit 9 = 149

Use the chi-squared goodness-of-fit test to test the hypothesis that the leading digits in the county population data follow the Benford’s law. This example is also discussed in this blog post. \square

For further information and more examples on chi-squared test, please see the sources listed in the reference section.

_______________________________________________________________________________________________

Reference

  1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
  2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
\copyright \ 2017 - \text{Dan Ma}

The Chi-Squared Distribution, Part 2

This is the part 2 of a 3-part series on the chi-squared distribution. In this post, we discuss several theorems, all centered around the chi-squared distribution, that play important roles in inferential statistics for the population mean and population variance of normal populations. These theorems are the basis for the test statistics used in the inferential procedures.

We first discuss the setting for the inference procedures. Then discuss the pivotal theorem (Theorem 5). We then proceed to discuss that theorems that produce the test statistics for \mu, the population mean of a normal population and for \mu_1-\mu_2, a difference of two population means from two normal populations. The discussion then shifts to the inference procedures on population variance.

_______________________________________________________________________________________________

The Settings

To facilitate the discussion, we use the notation \mathcal{N}(\mu,\sigma^2) to denote the normal distribution with mean \mu and variance \sigma^2. Whenever the random variable X follows such as distribution, we use the notation X \sim \mathcal{N}(\mu,\sigma^2).

The setting for making inference on one population is that we have a random sample Y_1,Y_2,\cdots,Y_n, drawn from a normal population \mathcal{N}(\mu,\sigma^2). The sample mean \overline{Y} and the sample variance S^2 are unbiased estimators of \mu and \sigma^2, respectively, given by:

    \displaystyle \overline{Y}=\frac{Y_1+\cdots+Y_n}{n}=\frac{1}{n} \sum \limits_{j=1}^n Y_j \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

    \displaystyle S^2=\frac{1}{n-1} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The goal is to use the information obtained from the sample, namely \overline{Y} and S^2, to estimate or make decisions about the unknown population parameters \mu and \sigma^2.

Because the sample is drawn from a normal population, the sample \overline{Y} has a normal distribution, more specifically \overline{Y} \sim \mathcal{N}(\mu,\frac{\sigma^2}{n}), which has two unknown parameters. To perform inferential procedures on the population mean \mu, it is preferable to have a test statistic that depends on \mu only. To this end, a t-statistic is used (see Theorem 7), which has the t-distribution with n-1 degrees of freedom (one less than the sample size). Because the parameter \sigma^2 is replaced by the sample variance S^2, the t-statistic has only \mu as the unknown parameter.

On the other hand, to perform inferential procedures on the population variance \sigma^2, we use a statistic that has a chi-squared distribution and that has only one unknown parameter \sigma^2 (see Theorem 5).

Now, the setting for performing inference on two normal populations. Let X_1,X_2,\cdots,X_n be a random sample drawn from the distribution \mathcal{N}(\mu_X,\sigma_X^2). Let Y_1,Y_2,\cdots,Y_m be a random sample drawn from the distribution \mathcal{N}(\mu_Y,\sigma_Y^2). Because the two samples are independent, the difference of the sample means \overline{X}-\overline{Y} has a normal distribution. Specifically, \overline{X}-\overline{Y} \sim \mathcal{N}(\mu_X-\mu_Y,\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{n}). Theorem 8 gives a t-statistic that is in terms of the difference \mu_X-\mu_Y such that the two unknown population variances are replaced by the pooled sample variance. This is done with the simplifying assumption that the two population variances are identical.

On the other hand, the inference on the population variances \sigma_X^2 and \sigma_Y^2, a statistic that has the F distribution can be used (See Theorem 10). One caveat is that this test statistic is sensitive to non-normality.

_______________________________________________________________________________________________

Connection between Normal Distribution and Chi-squared Distribution

There is an intimate relation between the sample items from a normal distribution and the chi-squared distribution. This is discussed in Part 1. Let’s recall this connection. If we normalize one sample item Y_j and then square it, we obtain a chi-squared random variable with df = 1. Likewise, if we normalize each sample item and then square it, the sum of the squares will be a chi-squared random variable with df = n. The following results are discussed in Part 1 and are restated here for clarity.

Theorem 2
Suppose that the random variable X follows a standard normal distribution, i.e. the normal distribution with mean 0 and standard deviation 1. Then Y=X^2 follows a chi-squared distribution with 1 degree of freedom.

Corollary 3
Suppose that the random variable X follows a normal distribution with mean \mu and standard deviation \sigma. Then Y=[(X-\mu) / \sigma]^2 follows a chi-squared distribution with 1 degree of freedom.

Corollary 4
Suppose that X_1,X_2,\cdots,X_n is a random sample drawn from a normal distribution with mean \mu and standard deviation \sigma. Then the following random variable follows a chi-squared distribution with n degrees of freedom.

    \displaystyle \sum \limits_{j=1}^n \biggl( \frac{X_j-\mu}{\sigma} \biggr)^2=\biggl( \frac{X_1-\mu}{\sigma} \biggr)^2+\biggl( \frac{X_2-\mu}{\sigma} \biggr)^2+\cdots+\biggl( \frac{X_n-\mu}{\sigma} \biggr)^2

_______________________________________________________________________________________________

A Pivotal Theorem

The statistic in Corollary 4 has two unknown parameters \mu and \sigma^2. It turns out that the statistic will become more useful if \mu is replaced by the sample mean \overline{Y}. The cost is that one degree of freedom is lost in the chi-squared distribution. The following theorem gives the details. The result is a statistic that is a function of the sample variance S^2 and the population variance \sigma^2.

Theorem 5
Let Y_1,Y_2,\cdots,Y_n be a random sample drawn from a normal distribution with mean \mu and variance \sigma^2. Then the following conditions hold.

  • The sample mean \overline{Y} and the sample variance S^2 are independent.
  • The statistic \displaystyle \frac{(n-1) S^2}{\sigma^2}=\frac{1}{\sigma^2} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2 has a chi-squared distribution with n-1 degrees of freedom.

Proof of Theorem 5
We do not prove the first bullet point. For a proof, see Exercise 13.93 in [2]. For the second bullet point, note that

    \displaystyle \begin{aligned}\sum \limits_{j=1}^n \biggl( \frac{Y_j-\mu}{\sigma} \biggr)^2&=\sum \limits_{j=1}^n \biggl( \frac{(Y_j-\overline{Y})+(\overline{Y}-\mu)}{\sigma} \biggr)^2 \\&=\sum \limits_{j=1}^n \biggl( \frac{Y_j-\overline{Y}}{\sigma} \biggr)^2 +\frac{n (\overline{Y}-\mu)^2}{\sigma^2} \\&=\frac{(n-1) S^2}{\sigma^2} +\biggl( \frac{\overline{Y}-\mu}{\frac{\sigma}{\sqrt{n}}} \biggr)^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)\end{aligned}

Note that in expanding [(Y_j-\overline{Y})+(\overline{Y}-\mu)]^2, the sum of the middle terms equals to 0. Furthermore the result (3) can be restated as follows:

    \displaystyle Q=\frac{(n-1) S^2}{\sigma^2}+Z^2

where Q=\sum \limits_{j=1}^n \biggl( \frac{Y_j-\mu}{\sigma} \biggr)^2 and Z=\frac{\overline{Y}-\mu}{\frac{\sigma}{\sqrt{n}}}. Note that Z is a standard normal random variable. Thus Z^2 has a chi-squared distribution with df = 1 (by Theorem 2). Since Q is an independent sum of squares of standardized normal variables, Q has a chi-squared distribution with n degrees of freedom. Furthermore, since \overline{Y} and S^2 are independent, Z^2 and S^2 are independent. Let H=\frac{(n-1) S^2}{\sigma^2}. As a result, H and Z^2 are independent. The following gives the moment generating function (MGF) of Q.

    \displaystyle \begin{aligned}E[e^{t \ Q}]&=E[e^{t \ (H+Z^2)}] \\&=E[e^{t \ H}] \ E[e^{t \ Z^2}]  \end{aligned}

Since Q and Z^2 follow chi-squared distributions, we can plug in the chi-squared MGFs to obtain the MGF of the random variable H.

    \displaystyle \biggl(\frac{1}{1-2t} \biggr)^{\frac{n}{2}}=E[e^{t \ H}] \ \biggl(\frac{1}{1-2t} \biggr)^{\frac{1}{2}}

    \displaystyle E[e^{t \ H}]=\biggl(\frac{1}{1-2t} \biggr)^{\frac{n-1}{2}}

The MGF for H is that of a chi-squared distribution with n-1 degrees of freedom. \square

Remark
It is interesting to compare the following two quantities:

    \displaystyle \sum \limits_{j=1}^n \biggl( \frac{X_j-\mu}{\sigma} \biggr)^2

    \displaystyle \frac{(n-1) S^2}{\sigma^2}=\frac{1}{\sigma^2} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2=\sum \limits_{j=1}^n \biggl( \frac{X_j-\overline{Y}}{\sigma} \biggr)^2

The first quantity is from Corollary 4 and has a chi-squared distribution with n degrees of freedom. The second quantity is from Theorem 5 and has a chi-squared distribution with n-1 degrees of freedom. Thus the effect of Theorem 5 is that by replacing the population mean \mu with the sample mean \overline{Y}, one degree of freedom is lost in the chi-squared distribution.

Theorem 5 is a pivotal theorem that has wide applications. For our purposes at hand, it can be used for inference on both the mean and variance. Even though one degree of freedom is lost, the statistic \displaystyle \frac{(n-1) S^2}{\sigma^2} is a function of one unknown parameter, namely the population variance \sigma^2. Since the sampling distribution is known (chi-squared), we can make probability statement about the statistic. Hence the statistic is useful for making inference about the population variance \sigma^2. As we will see below, in conjunction with other statistics, the statistic in Theorem 5 can be used for inference of two population variances as well as for inference on the mean (one sample and two samples).

_______________________________________________________________________________________________

Basis for Inference on Population Mean

Inference on the population mean of a single normal population and on the difference of the means of two independent normal populations relies on the t-statistic. Theorem 6 shows how to obtain a t-statistic using a chi-squared statistic and the standard normal statistic. Theorem 7 provides the one-sample t-statistic and Theorem 8 provides the two-sample t-statistic.

Theorem 6
Let Z be the standard normal random variable. Let U be a random variable that has a chi-squared distribution with r degrees of freedom. Then the random variable

    \displaystyle T=\frac{Z}{\sqrt{\frac{U}{r}}}

has a t-distribution with r degrees of freedom and its probability density function (PDF) is

    \displaystyle g(t)=\frac{\Gamma(\frac{r+1}{2})}{\sqrt{\pi r} \Gamma(\frac{r}{2}) \biggl(1+\frac{t^2}{r} \biggr)^{\frac{r+1}{2}}} \ \ \ \ \ \ \ \ \ -\infty <t< \infty

Remark
The probability density function given here is not important for the purpose at hand. For the proof of Theorem 6, see [2]. The following two theorems give two applications of Theorem 6.

Theorem 7
Let Y_1,Y_2,\cdots,Y_n be a random sample drawn from a normal distribution with mean \mu and variance \sigma^2. Let S^2 be the sample variance defined in (2). Then the random variable

    \displaystyle T=\frac{\overline{Y}-\mu}{\frac{S}{\sqrt{n}}}

has a t-distribution with n-1 degrees of freedom.

Proof of Theorem 7
Consider the following statistics.

    \displaystyle Z=\frac{\overline{Y}-\mu}{\frac{\sqrt{\sigma}}{n}}

    \displaystyle U=\frac{(n-1) \ S^2}{\sigma^2}

Note that Z has the standard normal distribution. By Theorem 5, the quantity U has a chi-square distribution with df = n-1. By Theorem 6, the following quantity has a t-distribution with df = n-1.

    \displaystyle T=\frac{Z}{\sqrt{\frac{U}{n-1}}}=\frac{\overline{Y}-\mu}{\frac{S}{\sqrt{n}}}

The above result is obtained after performing algebraic simplification. \square

Theorem 8
Let X_1,X_2,\cdots,X_n be a random sample drawn from a normal distribution with mean \mu_X and variance \sigma_X^2. Let Y_1,Y_2,\cdots,Y_m be a random sample drawn from a normal distribution with mean \mu_Y and variance \sigma_Y^2. Suppose that \sigma_X^2=\sigma_Y^2=\sigma^2. Then the following statistic:

    \displaystyle T=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{S_p \ \sqrt{\frac{1}{n}+\frac{1}{m}}}

has a t-distribution with df = n+m-2 where \displaystyle S_p^2=\frac{(n-1) \ S_X^2+ (m-1) \ S_Y^2}{n+m-2}.

Note that S_p^2 is the pooled variance of the two sample variances S_X^2 and S_Y^2.

Proof of Theorem 8
First, the sample mean \overline{X} has a normal distributions with mean and variance \mu_X and \frac{\sigma_X^2}{n}, respectively. The sample mean \overline{Y} has a normal distribution with mean and variance \mu_Y and \frac{\sigma_Y^2}{n}, respectively. Since the two samples are independent, \overline{X} and \overline{Y} are independent. Thus the sample difference \overline{X}-\overline{Y} has a normal distribution with mean \mu_X-\mu_Y and variance \frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{n}. The following is a standardized normal random variance:

    \displaystyle Z=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{\sqrt{\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{m}}}

On the other hand, by Theorem 5 the following quantities have chi-squared distributions with degrees of freedom n-1 and m-1, respectively.

    \displaystyle \frac{(n-1) S_X^2}{\sigma_X^2}=\frac{\sum \limits_{j=1}^n (X_j-\overline{X})^2}{\sigma_X^2}

    \displaystyle \frac{(n-1) S_Y^2}{\sigma_Y^2}=\frac{\sum \limits_{j=1}^m (Y_j-\overline{Y})^2}{\sigma_Y^2}

Because the two samples are independent, the two chi-squared statistics are independent. Then the following is a chi-squared statistic with n+m-2 degrees of freedom.

    \displaystyle U=\frac{(n-1) S_X^2}{\sigma_X^2}+\frac{(n-1) S_Y^2}{\sigma_Y^2}

By Theorem 6, the following ratio

    \displaystyle T=\frac{Z}{\sqrt{\frac{U}{n+m-2}}}

has a t-distribution with n+m-2 degrees of freedom. Here’s where the simplifying assumption of \sigma_X^2=\sigma_Y^2=\sigma^2 is used. Plugging in this assumption gives the following:

    \displaystyle T=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{\sqrt{\frac{(n-1) S_X^2+(m-1) S_Y^2}{n+m-2} \ (\frac{1}{n}+\frac{1}{m})}}=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{S_p \ \sqrt{\frac{1}{n}+\frac{1}{m}}}

where S_p^2 is the pooled sample variance of the two samples as indicated above. \square

_______________________________________________________________________________________________

Basis for Inference on Population Variance

As indicated above, the statistic \displaystyle \frac{(n-1) S^2}{\sigma^2} given in Theorem 5 can be used for inference on the variance of a normal population. The following theorem gives the basis for the statistic used for comparing the variances of two normal populations.

Theorem 9
Suppose that the random variables U and V are independent chi-squared random variables with r_1 and r_2 degrees of freedom, respectively. Then the statistic

    \displaystyle F=\frac{U / r_1}{V / r_2}

has an F-distribution with r_1 and r_2 degrees of freedom.

Remark
The F-distribution depends on two parameters r_1 and r_2. The order they are given is important. We regard the first parameter as the degrees of freedom of the chi-squared distribution in the numerator and the second parameter as the degrees of freedom of the chi-squared distribution in the denominator.

It is not important to know the probability density functions for both the t-distribution and the F-distribution (in both Theorem 6 and Theorem 9). When doing inference procedures with these distributions, either tables or software will be used.

Given two independent normal random samples X_1,X_2,\cdots,X_n and Y_1,Y_2,\cdots,Y_n (as discussed in the above section on the settings of inference), the sample variance S_X^2 is an unbiased estimator of the population variance \sigma_X^2 of the first population, and the sample variance S_Y^2 is an unbiased estimator of the population variance \sigma_Y^2 of the second population. It seems to make sense that the ratio \displaystyle \frac{S_X^2}{S_Y^2} can be used to make inference about the relative magnitude of \sigma_X^2 and \sigma_Y^2. The following theorem indicates that this is a valid approach.

Theorem 10
Let X_1,X_2,\cdots,X_n be a random sample drawn from a normal distribution with mean \mu_X and variance \sigma_X^2. Let Y_1,Y_2,\cdots,Y_m be a random sample drawn from a normal distribution with mean \mu_Y and variance \sigma_Y^2. Then the statistic

    \displaystyle \frac{S_X^2 \ / \ \sigma_X^2}{S_Y^2 \ / \ \sigma_Y^2}=\frac{S_X^2}{S_Y^2} \times \frac{\sigma_Y^2}{\sigma_X^2}

has the F-distribution with degrees of freedom n-1 and m-1.

Proof of Theorem 10
By Theorem 5, \displaystyle \frac{(n-1) S_X^2}{\sigma_X^2} has a chi-squared distribution with n-1 degrees of freedom and \displaystyle \frac{(m-1) S_Y^2}{\sigma_Y^2} has a chi-squared distribution with m-1 degrees of freedom. By Theorem 9, the following statistic

    \displaystyle \frac{[(n-1) S_X^2 \ / \ \sigma_X^2] \ / \ (n-1)}{[(m-1) S_Y^2 \ / \ \sigma_Y^2] \ / \ (m-1)}

has the F-distribution with n-1 and m-1 degrees of freedom. The statistic is further simplified to become the statistic as stated in the theorem. \square

_______________________________________________________________________________________________

Concluding Remarks

Theorem 7 and Theorem 8 produce the one-sample t-statistic and two-sample t-statistic, respectively. They are procedures for inference about one population mean and the difference of two population means, respectively. They can be used for estimation (e.g. construction of confidence intervals) or decision making (e.g. hypothesis testing). On the other hand, Theorem 5 produces a chi-squared statistic for inference about one population variance. Theorem 10 produces an F-statistic that can be used for inference about two population variances. Since the F-statistic is a ratio of the two population variances, it can be used for inference about the relative magnitude of the variances.

The purpose in this post is to highlight the important roles of the chi-squared distribution. We now discuss briefly the quality of the derived statistical procedures. The procedures discussed here (t or F) are exactly correct if the populations from which the samples are drawn are normal. Real life data usually do not exactly follow normal distributions. Thus the usefulness of these statistics in practice depends on how strongly they are affected by non-normality. In other words, if there is a significant deviation from the assumption of normal distribution, are these procedures still reliable?

A statistical inference procedure is called robust if the calculated results drawn from the procedure are insensitive to deviations of assumptions. For a non-robust procedure, the result would be distorted if there is deviation from assumptions. For example, the t procedures are not robust against outliers. The presence of outliers in the data can distort the results since the t procedures are based on the sample mean \overline{x} and sample variance S^2, which are not resistant to outliers.

On the other hand, the t procedures for inference about means are quite robust against slight deviation of normal population assumptions. The F procedures for inference about variances are not so robust. So it must be used with care. Even if there is a slight deviation from normal assumptions, the results from the F procedures may not be reliable. For a more detailed but accessible discussion on robustness, see [1].

When the sample sizes are large, the sample mean \overline{x} is close to a normal distribution (this result is called the central limit theorem). So the discussion about deviation of normality assumption is no longer important. When the sample sizes are large, simply use the Z statistic for inference about the means. On the other hand, when the sample sizes are large, the sample variance S^2 will be an accurate estimate of the population variance \sigma^2 regardless of the assumption of the population distribution. This fact is related to the law of large numbers. Thus the statistical procedures described here are for small sample sizes and for assuming normal populations.

_______________________________________________________________________________________________

Reference

  1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
  2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

The Chi-Squared Distribution, Part 1

The chi-squared distribution has a simple definition from a mathematical standpoint and yet plays an important role in statistical sampling theory. This post is the first post in a three-part series that gives a mathematical story of the chi-squared distribution.

This post is an introduction which highlights the fact that mathematically chi-squared distribution arises from the gamma distribution and that the chi-squared distribution has an intimate connection with the normal distribution. This post lays the ground work for the subsequent post.

The next post (Part 2) describe the roles played by the chi-squared distribution in forming the various sampling distributions related to the normal distribution. These sampling distributions are used for making inference about the population from which the sample is taken. The population parameters of interest here are the population mean, variance, and standard deviation. The population from which the sample is take is assumed to be modeled adequately by a normal distribution.

Part 3 describes the chi-squared test, which is used for making inference on categorical data (versus quantitative data).

These three parts only scratches the surface with respect to the roles played the chi-squared distribution in statistics. Thus the discussion in this series only serves as an introduction on chi-squared distribution.

_______________________________________________________________________________________________

Defining the Chi-Squared Distribution

A random variable Y is said to follow the chi-squared distribution with k degrees of freedom if the following is the density function of Y.

    \displaystyle f_Y(y)=\frac{1}{\Gamma(\frac{k}{2}) \ 2^{\frac{k}{2}}}  \ y^{\frac{k}{2}-1} \ e^{-\frac{y}{2}} \ \ \ \ \ \ \ y>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

where k is a positive integer. In some sources, the distribution is sometimes named \chi^2 distribution. Essentially the distribution defined in (1) is a gamma distribution with shape parameter \frac{k}{2} and scale parameter 2 (or rate parameter \frac{1}{2}). Note that the chi-squared distribution with 2 degrees of freedom (when k=2) is simply an exponential distribution with mean 2. The following figure shows the chi-squared density functions for degrees of freedom 1, 2, 3, 5 and 10.

Figure 1 – Chi-squared Density Curves
chi-squared-densities-df-1-2-3-5-10

Just from the gamma connection, the mean and variance are E[Y]=k and Var[Y]=2k. In other words, the mean of a chi-squared distribution is the same as the degrees of freedom and its variance is always twice the degrees of freedom. As a gamma distribution, the higher moments E[Y^n] are also known. Consequently the properties that depend on E[Y^n] can be easily computed. See here for the basic properties of the gamma distribution. The following gives the mean, variance and the moment generating function (MGF) for the chi-squared random variable Y with k degrees of freedom.

    E[Y]=k

    Var[Y]=2 k

    \displaystyle M_Y(t)=\biggl( \frac{1}{1-2t} \biggr)^{\frac{k}{2}} \ \ \ \ \ \ t<\frac{1}{2}

_______________________________________________________________________________________________

Independent Sum of Chi-Squared Distributions

In general, the MGF of an independent sum Y_1+Y_2+\cdots+Y_n is simply the product of the MGFs of the individual random variables X_i. Note that the product of Chi-squared MGFs is also a Chi-squared MGF, with the exponent being the sum of the individual exponents. This brings up another point that is important for the subsequent discussion, i.e. the independent sum of chi-squared distributions is also a chi-squared distribution. The following theorem states this fact more precisely.

Theorem 1
If Y_1,Y_2,\cdots,Y_n are chi-squared random variables with degrees of freedom k_1,k_2,\cdots,k_n, respectively, then the independent sum Y_1+Y_2+\cdots+Y_n has a chi-squared distribution with k_1+k_2+\cdots+k_n degrees of freedom.

Thus the result of summing independent chi-squared distributions is another chi-squared distribution with degree of freedom being the total of all degrees of freedom. This follows from the fact that if the gamma distributions have identical scale parameter, then the independent sum is a gamma distribution with the shape parameter being the sum of the shape parameters. This point is discussed in more details here.

_______________________________________________________________________________________________

The Connection with Normal Distributions

As shown in the above section, the chi-squared distribution is simple from a mathematical standpoint. Since it is a gamma distribution, it possesses all the properties that are associated with the gamma family. Of course, the gamma connection is far from the whole story. One important fact is that the chi-squared distribution is naturally obtained from sampling from a normal distribution.

Theorem 2
Suppose that the random variable X follows a standard normal distribution, i.e. the normal distribution with mean 0 and standard deviation 1. Then Y=X^2 follows a chi-squared distribution with 1 degree of freedom.

Proof
By definition, the following is the cumulative distribution function (CDF) of Y=X^2.

    \displaystyle \begin{aligned}F_Y(y)=P[Y \le y] &=P[X^2 \le y]=P[-\sqrt{y} \le X \le \sqrt{y}]=2 \ P[0 \le X \le \sqrt{y}] \\&=2 \ \int_0^{\sqrt{y}} \frac{1}{\sqrt{2 \pi}} \ e^{-\frac{x^2}{2}} \ dx  \end{aligned}

Upon differentiating F_Y(y), the density function is obtained.

    \displaystyle \begin{aligned}f_Y(y)=\frac{d}{dy}F_Y(y) &=2 \ \frac{d}{dy} \int_0^{\sqrt{y}} \frac{1}{\sqrt{2 \pi}} \ e^{-\frac{x^2}{2}} \ dx  \\ &=\frac{1}{\Gamma(\frac{1}{2}) \ 2^{\frac{1}{2}}}  \ y^{\frac{1}{2}-1} \ e^{-\frac{y}{2}}\end{aligned}

Note that the density is that of a chi-squared distribution with 1 degree of freedom. \square

With the basic result in Theorem 1, there are more ways to obtain chi-squared distributions from sampling from normal distributions. For example, first normalizing a sample item from normal sampling and then squaring it will produce a chi-squared observation with 1 degree of freedom. Similarly, by performing the same normalizing in each sample item in a normal sample and by squaring each normalized observation, the resulting sum is a chi-squared distribution. These are made more precise in the following corollaries.

Corollary 3
Suppose that the random variable X follows a normal distribution with mean \mu and standard deviation \sigma. Then Y=[(X-\mu) / \sigma]^2 follows a chi-squared distribution with 1 degree of freedom.

Corollary 4
Suppose that X_1,X_2,\cdots,X_n is a random sample drawn from a normal distribution with mean \mu and standard deviation \sigma. Then the following random variable follows a chi-squared distribution with n degrees of freedom.

    \displaystyle \sum \limits_{j=1}^n \biggl( \frac{X_j-\mu}{\sigma} \biggr)^2=\biggl( \frac{X_1-\mu}{\sigma} \biggr)^2+\biggl( \frac{X_2-\mu}{\sigma} \biggr)^2+\cdots+\biggl( \frac{X_n-\mu}{\sigma} \biggr)^2

_______________________________________________________________________________________________

Calculating Chi-Squared Probabilities

In working with the chi-squared distribution, it is necessary to evaluate the cumulative distribution function (CDF). In hypothesis testing, it is necessary to calculate the p-value given the value of the chi-squared statistic. In confidence interval estimation, it is necessary to determine the the critical value at a given confidence level. The standard procedure at one point in time is to use a chi-squared table. A typical chi-squared table can be found here. We demonstrate how to find chi-squared probabilities first using the table approach and subsequently using software (Excel in particular).

The table gives the probabilities on the right tail. The table in the link given above will give the chi-squared value (on the x-axis) \chi_{\alpha}^2 for a given area of the right tail (\alpha) per df. This table lookup is illustrated in the below diagram.

Figure 2 – Right Tail of Chi-squared Distribution
chi-squared-shaded-area

For df = 1, \chi_{0.1}^2=2.706, thus P[\chi^2 > 2.706]=0.1 and P[\chi^2 < 2.706]=0.9. So for df = 1, the 90th percentile of the chi-squared distribution is 2.706. The following shows more table lookup.

    df = 2, \chi_{0.01}^2=9.210.
    P[\chi^2 > 9.210]=0.01 and P[\chi^2 < 9.210]=0.99
    The 99th percentile of the chi-squared distribution with df = 2 is 9.210.

    df = 15, \chi_{0.9}^2=8.547.
    P[\chi^2 > 8.547]=0.9 and P[\chi^2 < 8.547]=0.1
    The 10th percentile of the chi-squared distribution with df = 15 is 8.547.

The choices for \alpha in the table are limited. Using software will have more selection for \alpha and will give more precise values. For example, Microsoft Excel provides the following two functions.

    =CHISQ.DIST(x, degree_freedom, cumulative)

    =CHISQ.INV(probability, degree_freedom)

The two functions in Excel give information about the left-tail of the chi-squared distribution. The function CHISQ.DIST returns the left-tailed probability of the chi-squared distribution. The parameter cumulative is either TRUE or FALSE, with TRUE meaning that the result is the cumulative distribution function and FALSE meaning that the result is the probability density function. On the other hand, the function CHISQ.INV returns the inverse of the left-tailed probability of the chi-squared distribution.

If the goal is to find probability given an x-value, use the function CHISQ.DIST. On the other hand, if the goal is to look for the x-value given the left-tailed value (probability), then use the function CHISQ.INV. In the table approach, once the value \chi_{\alpha}^2=x is found, the interplay between the probability (\alpha) and x-value is clear. In the case of Excel, one must choose the function first depending on the goal. The following gives the equivalent results for the table lookup presented above.

    =CHISQ.DIST(2.706, 1, TRUE) = 0.900028622
    =CHISQ.INV(0.9, 1) = 2.705543454

    =CHISQ.DIST(9.21, 2, TRUE) = 0.989998298
    =CHISQ.INV(0.99, 2) = 9.210340372

    =CHISQ.DIST(8.547, 15, TRUE) = 0.100011427
    =CHISQ.INV(0.1, 15) = 8.546756242

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}