# A catalog of parametric severity models

Various parametric continuous probability models have been presented and discussed in this blog. The number of parameters in these models ranges from one to two, and in a small number of cases three. They are all potential candidates for models of severity in insurance applications and in other actuarial applications. This post highlights these models. The list presented here is not exhaustive; it is only a brief catalog. There are other models that are also suitable for actuarial applications but not accounted for here. However, the list is a good place to begin. This post also serves a navigation device (the table shown below contains links to the blog posts).

A Catalog

Many of the models highlighted here are related to gamma distribution either directly or indirectly. So the catalog starts with the gamma distribution at the top and then branches out to the other related models. Mathematically, the gamma distribution is a two-parameter continuous distribution defined using the gamma function. The gamma sub family includes the exponential distribution, Erlang distribution and chi-squared distribution. These are distributions that are gamma distributions with certain restrictions on the one or both of the gamma parameters. Other distributions are obtained by raising a distribution to a power. Others are obtained by mixing distributions.

Here’s a listing of the models. Click on the links to find out more about the distributions.

……Derived From ………………….Model
Gamma function
Gamma sub families
Independent sum of gamma
Exponentiation
Raising to a power Raising exponential to a positive power

Raising exponential to a power

Raising gamma to a power

Raising Pareto to a power

Burr sub families
Mixture
Others

The above table categorizes the distributions according to how they are mathematically derived. For example, the gamma distribution is derived from the gamma function. The Pareto distribution is mathematically an exponential-gamma mixture. The Burr distribution is a transformed Pareto distribution, i.e. obtained by raising a Pareto distribution to a positive power. Even though these distributions can be defined simply by giving the PDF and CDF, knowing how their mathematical origins informs us of the specific mathematical properties of the distributions. Organizing according to the mathematical origin gives us a concise summary of the models.

$\text{ }$

$\text{ }$

From a mathematical standpoint, the gamma distribution is defined using the gamma function.

$\displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt$

In this above integral, the argument $\alpha$ is a positive number. The expression $t^{\alpha-1} \ e^{-t}$ in the integrand is always positive. The area in between the curve $t^{\alpha-1} \ e^{-t}$ and the x-axis is $\Gamma(\alpha)$. When this expression is normalized, i.e. divided by $\Gamma(\alpha)$, it becomes a density function.

$\displaystyle f(t)=\frac{1}{\Gamma(\alpha)} \ t^{\alpha-1} \ e^{-t}$

The above function $f(t)$ is defined over all positive $t$. The integral of $f(t)$ over all positive $t$ is 1. Thus $f(t)$ is a density function. It only has one parameter, the $\alpha$, which is the shape parameter. Adding the scale parameter $\theta$ making it a two-parameter distribution. The result is called the gamma distribution. The following is the density function.

$\displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ x>0$

Both parameters $\alpha$ and $\theta$ are positive real numbers. The first parameter $\alpha$ is the shape parameter and $\theta$ is the scale parameter.

As mentioned above, many of the distributions listed in the above table is related to the gamma distribution. Some of the distributions are sub families of gamma. For example, when $\alpha$ are positive integers, the resulting distributions are called Erlang distribution (important in queuing theory). When $\alpha=1$, the results are the exponential distributions. When $\alpha=\frac{k}{2}$ and $\theta=2$ where $k$ is a positive integer, the results are the chi-squared distributions (the parameter $k$ is referred to the degrees of freedom). The chi-squared distribution plays an important role in statistics.

Taking independent sum of $n$ independent and identically distributed exponential random variables produces the Erlang distribution, a sub gamma family of distribution. Taking independent sum of $n$ exponential random variables, with pairwise distinct means, produces the hypoexponential distributions. On the other hand, the mixture of $n$ independent exponential random variables produces the hyperexponential distribution.

The Pareto distribution (Pareto Type II Lomax) is the mixture of exponential distributions with gamma mixing weights. Despite the connection with the gamma distribution, the Pareto distribution is a heavy tailed distribution. Thus the Pareto distribution is suitable for modeling extreme losses, e.g. in modeling rare but potentially catastrophic losses.

As mentioned earlier, raising a Pareto distribution to a positive power generates the Burr distribution. Restricting the parameters in a Burr distribution in a certain way will produces the paralogistic distribution. The table indicates the relationships in a concise way. For details, go into the blog posts to get more information.

Tail Weight

Another informative way to categorize the distributions listed in the table is through looking at the tail weight. At first glance, all the distributions may look similar. For example, the distributions in the table are right skewed distributions. Upon closer look, some of the distributions put more weights (probabilities) on the larger values. Hence some of the models are more suitable for models of phenomena with significantly higher probabilities of large or extreme values.

When a distribution significantly puts more probabilities on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). In general tail weight is a relative concept. For example, we say model A has a larger tail weight than model B (or model A has a heavier tail than model B). However, there are several ways to check for tail weight of a given distribution. Here are the four criteria.

Tail Weight Measure What to Look for
1 Existence of moments The existence of more positive moments indicates a lighter tailed distribution.
2 Hazard rate function An increasing hazard rate function indicates a lighter tailed distribution.
3 Mean excess loss function An increasing mean excess loss function indicates a heavier tailed distribution.
4 Speed of decay of survival function A survival function that decays rapidly to zero (as compared to another distribution) indicates a lighter tailed distribution.

Existence of moments
For a positive real number $k$, the moment $E(X^k)$ is defined by the integral $\int_0^\infty x^k \ f(x) \ dx$ where $f(x)$ is the density function of the distribution in question. If the distribution puts significantly more probabilities in the larger values in the right tail, this integral may not exist (may not converge) for some $k$. Thus the existence of moments $E(X^k)$ for all positive $k$ is an indication that the distribution is a light tailed distribution.

In the above table, the only distributions for which all positive moments exist are gamma (including all gamma sub families such as exponential), Weibull, lognormal, hyperexponential, hypoexponential and beta. Such distributions are considered light tailed distributions.

The existence of positive moments exists only up to a certain value of a positive integer $k$ is an indication that the distribution has a heavy right tail. All the other distributions in the table are considered heavy tailed distribution as compared to gamma, Weibull and lognormal. Consider a Pareto distribution with shape parameter $\alpha$ and scale parameter $\theta$. Note that the existence of the Pareto higher moments $E(X^k)$ is capped by the shape parameter $\alpha$. If the Pareto distribution is to model a random loss, and if the mean is infinite (when $\alpha=1$), the risk is uninsurable! On the other hand, when $\alpha \le 2$, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

Hazard rate function
The hazard rate function $h(x)$ of a random variable $X$ is defined as the ratio of the density function and the survival function.

$\displaystyle h(x)=\frac{f(x)}{S(x)}$

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged $x$ will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for $x$ units of time.

Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. On the other hand, a distribution with an increasing hazard rate function has a light tailed distribution. If the hazard rate function is decreasing (over time if the random variable is a time variable), then the population die off at a decreasing rate, hence a heavier tail for the distribution in question.

The Pareto distribution is a heavy tailed distribution since the hazard rate is $h(x)=\alpha/x$ (Pareto Type I) and $h(x)=\alpha/(x+\theta)$ (Pareto Type II Lomax). Both hazard rates are decreasing function.

The Weibull distribution is a flexible model in that when its shape parameter is $0<\tau<1$, the Weibull hazard rate is decreasing and when $\tau>1$, the hazard rate is increasing. When $\tau=1$, Weibull is the exponential distribution, which has a constant hazard rate.

The point about decreasing hazard rate as an indication of a heavy tailed distribution has a connection with the fourth criterion. The idea is that a decreasing hazard rate means that the survival function decays to zero slowly. This point is due to the fact that the hazard rate function generates the survival function through the following.

$\displaystyle S(x)=e^{\displaystyle -\int_0^x h(t) \ dt}$

Thus if the hazard rate function is decreasing in $x$, then the survival function will decay more slowly to zero. To see this, let $H(x)=\int_0^x h(t) \ dt$, which is called the cumulative hazard rate function. As indicated above, $S(x)=e^{-H(x)}$. If $h(x)$ is decreasing in $x$, $H(x)$ has a lower rate of increase and consequently $S(x)=e^{-H(x)}$ has a slower rate of decrease to zero.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The mean excess loss function
The mean excess loss is the conditional expectation $e_X(d)=E(X-d \lvert X>d)$. If the random variable $X$ represents insurance losses, mean excess loss is the expected loss in excess of a threshold conditional on the event that the threshold has been exceeded. Suppose that the threshold $d$ is an ordinary deductible that is part of an insurance coverage. Then $e_X(d)$ is the expected payment made by the insurer in the event that the loss exceeds the deductible.

Whenever $e_X(d)$ is an increasing function of the deductible $d$, the loss $X$ is a heavy tailed distribution. If the mean excess loss function is a decreasing function of $d$, then the loss $X$ is a lighter tailed distribution.

The Pareto distribution can also be classified as a heavy tailed distribution based on an increasing mean excess loss function. For a Pareto distribution (Type I) with shape parameter $\alpha$ and scale parameter $\theta$, the mean excess loss is $e(X)=d/(\alpha-1)$, which is increasing. The mean excess loss for Pareto Type II Lomax is $e(X)=(d+\theta)/(\alpha-1)$, which is also decreasing. They are both increasing functions of the deductible $d$! This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If the underlying distribution for a random loss is Pareto, it is a catastrophic risk situation.

In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Speed of decay of the survival function to zero
The survival function $S(x)=P(X>x)$ captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed. This point is touched on when discussing hazard rate function.

The following is a comparison of a Pareto Type II survival function and an exponential survival function. The Pareto survival function has parameters ($\alpha=2$ and $\theta=2$). The two survival functions are set to have the same 75th percentile, which is $x=2$. The following table is a comparison of the two survival functions.

$\displaystyle \begin{array}{llllllll} \text{ } &x &\text{ } & \text{Pareto } S_X(x) & \text{ } & \text{Exponential } S_Y(x) & \text{ } & \displaystyle \frac{S_X(x)}{S_Y(x)} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{ } &2 &\text{ } & 0.25 & \text{ } & 0.25 & \text{ } & 1 \\ \text{ } &10 &\text{ } & 0.027777778 & \text{ } & 0.000976563 & \text{ } & 28 \\ \text{ } &20 &\text{ } & 0.008264463 & \text{ } & 9.54 \times 10^{-7} & \text{ } & 8666 \\ \text{ } &30 &\text{ } & 0.00390625 & \text{ } & 9.31 \times 10^{-10} & \text{ } & 4194304 \\ \text{ } &40 &\text{ } & 0.002267574 & \text{ } & 9.09 \times 10^{-13} & \text{ } & 2.49 \times 10^{9} \\ \text{ } &60 &\text{ } & 0.001040583 & \text{ } & 8.67 \times 10^{-19} & \text{ } & 1.20 \times 10^{15} \\ \text{ } &80 &\text{ } & 0.000594884 & \text{ } & 8.27 \times 10^{-25} & \text{ } & 7.19 \times 10^{20} \\ \text{ } &100 &\text{ } & 0.000384468 & \text{ } & 7.89 \times 10^{-31} & \text{ } & 4.87 \times 10^{26} \\ \text{ } &120 &\text{ } & 0.000268745 & \text{ } & 7.52 \times 10^{-37} & \text{ } & 3.57 \times 10^{32} \\ \text{ } &140 &\text{ } & 0.000198373 & \text{ } & 7.17 \times 10^{-43} & \text{ } & 2.76 \times 10^{38} \\ \text{ } &160 &\text{ } & 0.000152416 & \text{ } & 6.84 \times 10^{-49} & \text{ } & 2.23 \times 10^{44} \\ \text{ } &180 &\text{ } & 0.000120758 & \text{ } & 6.53 \times 10^{-55} & \text{ } & 1.85 \times 10^{50} \\ \text{ } & \text{ } \\ \end{array}$

Note that at the large values, the Pareto right tails retain much more probabilities. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. Using an exponential distribution to model a Pareto random phenomenon would be a severe modeling error even though the exponential distribution may be a good model for describing the loss up to the 75th percentile (in the above comparison). It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

$\displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\theta^\alpha}{(x+\theta)^\alpha}}{e^{-\lambda x}}=\frac{\theta^\alpha e^{\lambda x}}{(x+\theta)^\alpha} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty$

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in $x$. This ratio goes to infinity as $x \rightarrow \infty$.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator.

It is important to examine the tail behavior of a distribution when considering it as a candidate for a model. The four criteria discussed here provide a crucial way to classify parametric models according to the tail weight.

severity models
math

Daniel Ma
mathematics

$\copyright$ 2017 – Dan Ma

# Generalized beta distribution

The beta distribution is defined using the beta function. The beta distribution can also be naturally generated as order statistics by sampling from the uniform distribution. This post presents a generalization of the standard beta distribution.

There are many generalized beta distributions. This post defines a “basic” generalized beta distribution that has four parameters. Recall that the standard beta distribution has two parameters $a$ and $b$. Both $a$ and $b$ drive the shape of the beta distribution, e.g. its skewness is driven by the magnitude of $b-a$. The generalized beta distribution defined here has four parameters $a$, $b$, $\tau$ and $\theta$. The value $\tau$ is an exponent parameter and the parameter $\theta$ is a scale parameter to translate the distribution to an interval other than the unit interval. The generalized beta distribution discussed here is called the generalized beta distribution of the first kind (see the paper listed in the reference section).

The role of the parameter $\tau$ is interesting in that it affects the shape of the new distribution, e.g. making the distribution more skewed or less skewed. Yet it is not a strictly a shape parameter. But it can greatly accentuate or reduce the skewness of the starting beta distribution depending on the value of $\tau$ (see the last section below).

_______________________________________________________________________________________________

Example 1

Before formally define the distribution, let’s look at the effect of the two additional parameters $\tau$ and $\theta$ through an example. The parameter $\theta$ is a translation parameter. First, look at the effect of adding $\tau$.

Let $X$ be a random variable that follows the beta distribution with parameters $a=3$ and $b=7$. The following is the density function of $X$.

$\displaystyle f(x)=252 \ x^{3-1} \ (1-x)^{7-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0

Consider the following random variables:

$\displaystyle Y=X^{\frac{1}{\tau}} \ \ \ \ \ \ \ \ \ \ \ \ \tau =\frac{1}{2}$

$\displaystyle T=X^{\frac{1}{\tau}} \ \ \ \ \ \ \ \ \ \ \ \ \tau =2$

$\displaystyle Y_1=5 \ Y$

$\displaystyle T_1=5 \ T$

Essentially $Y$ is the square of $X$ and $T$ is the square root of $X$. To know more about the random variables $Y$ and $T$, let’s look at the graphs of the density functions. The following diagram shows the density functions of $X$ (blue curve), $Y$ (tall red curve) and $T$ (black curve).

Figure 1

The standard beta density curve for the random variable $X$ has moderate right skewness (the blue curve). Squaring $X$ produces a density curve with much more pronounced skewness (the red curve). This is because the action of squaring puts more probabilities on the smaller numbers. Squaring tends to shift the data closer to the origin. For example, 0.9 becomes 0.81, 0.5 becomes 0.25, 0.1 becomes 0.01, 0.001 becomes 0.000001 and so on.

Yet taking the square root on the standard beta has the opposite effect. The effect is to push the data toward 1.0, producing a density curve (the black curve) that is slightly negatively skewed (it looks almost symmetric).

Even though the random variable $Y$ is obtained by squaring the beta $X$, the density function of $Y$ is via a square root. On the other hand, while the random variable $T$ is obtained by taking square root of $X$, the density function of $T$ is obtained via squaring. The following are the density functions of $Y$ and $T$.

\displaystyle \begin{aligned}f_Y(y)&=f_X(y^{1/2}) \ \frac{d}{dy} y^{1/2} \\&\text{ } \\&=252 \ [y^{1/2} ]^2 \ [1-y^{1/2} ]^6 \frac{1}{2 y^{1/2}} \\&\text{ } \\&=126 \ y^{1/2} \ [1-y^{1/2} ]^6 \ \ \ \ \ \ \ \ \ \ \ 0

$\text{ }$

\displaystyle \begin{aligned}f_T(t)&=f_X(t^2) \ \frac{d}{dy} t^2 \\&\text{ } \\&=252 \ [t^2 ]^2 \ [1-t^2 ]^6 \ 2t \\&\text{ } \\&=504 \ t^5 \ [1-t^2 ]^6 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0

Because $Y$ and $T$ are defined by raising $X$ to a power, the properties involving moments can be derived from the beta distribution on $X$, via $E(Y^k)=E(X^{2k})$ and $E(T^k)=E(X^{k/2})$. The following table shows the first four moments of $Y$ and $T$ (using the formula for beta moments found here).

$\displaystyle \begin{array}{lllll} \text{ } &\text{ } & E(Y^k)=E(X^{2k}) & \text{ } & E(T^k)=E(X^{k/2}) \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ k=1 &\text{ } & \displaystyle E(X^2)=\frac{6}{55} & \text{ } & \displaystyle E(X^{1/2})=\frac{24576}{46189} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ k=2 &\text{ } & \displaystyle E(X^4)=\frac{3}{143} & \text{ } & \displaystyle E(X)=\frac{3}{10} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ k=3 &\text{ } & \displaystyle E(X^6)=\frac{4}{715} & \text{ } & \displaystyle E(X^{3/2})=\frac{57344}{323323} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ k=4 &\text{ } & \displaystyle E(X^8)=\frac{9}{4862} & \text{ } & \displaystyle E(X^2)=\frac{6}{55} \\ \end{array}$

The following table shows a comparison of the three random variables.

$\displaystyle \begin{array}{lllllll} \text{ } &\text{ } & X & \text{ } & Y=X^2& \text{ } & T=X^{1/2} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Mean} &\text{ } & \displaystyle E(X)=\frac{3}{10}=0.3 & \text{ } & \displaystyle E(Y)=\frac{6}{55}=0.11 & \text{ } & \displaystyle E(T)=\frac{24576}{46189}=0.53\\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Variance} &\text{ } & \displaystyle Var(X)=\frac{21}{1100} & \text{ } & \displaystyle Var(Y)=0.009078195 & \text{ } & \displaystyle Var(T)=0.016896475 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Std Dev} &\text{ } & \displaystyle \sigma_X=0.1382 & \text{ } & \displaystyle \sigma_Y=0.09528 & \text{ } & \displaystyle \sigma_T=0.1300 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{CV} &\text{ } & \displaystyle \frac{\sigma_X}{\mu_X}=0.4606 & \text{ } & \displaystyle \frac{\sigma_Y}{\mu_Y}=0.8734 & \text{ } & \displaystyle \frac{\sigma_T}{\mu_T}=0.2443 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Skewness} &\text{ } & \displaystyle \gamma_X=0.4825 & \text{ } & \displaystyle \gamma_Y=1.5320 & \text{ } & \displaystyle \gamma_T=-0.1113 \\ \end{array}$

The variance is calculated by letting the second moment subtracting the square of the first moment. The standard deviation is the square root of the variance. CV stands for coefficient of variation, which is the ratio of the standard deviation to the mean. The skewness is the third central moment divided by the cube of the standard deviation.

Note that the skewness calculation confirms what we see in the three density curves in Figure 1. The skewness of the beta distribution is moderate (right skewed). Squaring the beta distribution has the effect of pushing the data to the origin, hence the standard deviation is smaller and the right skew is more pronounced (3 times as strong). Taking the square root of the beta distribution goes the opposite direction, leading to a slightly left skewed distribution.

The above discussion only focuses on the effect of the parameter $\tau$ (the effect of raising the base distribution to a power). The other parameter $\theta$ is a scale parameter that translates the transformed distribution from the interval $(0,1)$ to the interval $(0, \theta)$. The following diagrams show the density function of $Y_1=5 \ Y$ (Figure 2) and the density function of $T_1=5 \ T$ (Figure 3).

Figure 2

Figure 3

Multiplying by 5 certainly affects the mean and variance. The CV and skewness remain the same. Thus the scale parameter does not change the shape.

_______________________________________________________________________________________________

Basic Properties

Let $a$, $b$, $\tau$ and $\theta$ be some fixed positive real numbers. A random variable $T$ follows the generalized beta distribution with parameters $a$, $b$, $\tau$ and $\theta$ if

$\displaystyle T=\theta \ X^{\frac{1}{\tau}}$

where $X$ is a random variable that follows the beta distribution with parameters $a$ and $b$. In other words, if we start with a standard beta distribution, raising it to a power and then multiplying a scale parameter would produce a generalized beta distribution. On the other hand, if we start with a generalized beta distribution, dividing it by a parameter called $\theta$ and then raising it to a power would produce a standard beta distribution. In this post, we prefer to work with the first progression – defining the generalized beta from the standard beta.

We first derive the density function of the random variable $\displaystyle Y=X^{1/\tau}$, i.e. the generalized beta distribution without the parameter $\theta$. The new random variable is obtained by raising the old to $1/\tau$. As a result, the new density function is obtained by plugging $x^\tau$ into the old density function and multiplying the derivative of $x^\tau$.

$\displaystyle f_X(x)=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ x^{a-1} \ (1-x)^{b-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0

\displaystyle \begin{aligned}f_Y(x)&=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ (x^\tau)^{a-1} \ (1-x^\tau)^{b-1} \ \tau x^{\tau-1} \\&=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ (x^\tau)^{a} \ (1-x^\tau)^{b-1} \ \frac{\tau}{x} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0

Now add the scale parameter. The effect is that $x/ \theta$ is plugged into the density function of $Y$ and that the density function is multiplied by $1/ \theta$ (the derivative).

\displaystyle \begin{aligned}f_T(x)&=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ \biggl[\biggl(\frac{x}{\theta}\biggr)^\tau \biggr]^{a-1} \ \biggl[1-\biggl(\frac{x}{\theta}\biggr)^\tau \biggr]^{b-1} \ \tau \biggl(\frac{x}{\theta}\biggr)^{\tau-1} \ \frac{1}{\theta} \\&=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ \biggl[\biggl(\frac{x}{\theta}\biggr)^\tau \biggr]^{a} \ \biggl[1-\biggl(\frac{x}{\theta}\biggr)^\tau \biggr]^{b-1} \ \frac{\tau}{x} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0

Recall that the cumulative distribution function of the standard beta $X$ can be expressed using the incomplete beta function.

$\displaystyle F_X(x)=\frac{B(a,b,x)}{B(a,b)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0

$\displaystyle B(a,b,x)=\int_0^x \ t^{a-1} \ (1-t)^{b-1} \ dt$

$\displaystyle B(a,b)=\int_0^1 \ t^{a-1} \ (1-t)^{b-1} \ dt$

The CDF in $(4)$ has no closed form. Since $Y$ and $T$ are obtained by raising $X$ to a power, their CDFs can still be expressed using the incomplete beta function.

$\displaystyle F_Y(x)=\frac{B(a,b,x^\tau)}{B(a,b)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0

$\displaystyle F_T(x)=\frac{B(a,b,(x/ \theta )^\tau )}{B(a,b)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0

For the standard beta distribution, all positive moments exist, i.e. $E(X^k)$ is defined for all positive real numbers $k$. As a result, all positive moments exist for the generalized $Y$ and $T$ as well.

$\displaystyle E(X^k) = \left\{ \begin{array}{ll} \displaystyle \frac{\Gamma(a+b)}{\Gamma(a)} \ \frac{\Gamma(a+k)}{\Gamma(a+b+k)} &\ \ \ \ \ \ k>-a \\ \text{ } & \text{ } \\ \displaystyle \frac{a (a+1) \cdots (a+k-1)}{(a+b) (a+b+1) \cdots (a+b+k-1)} &\ \ \ \ \ \ k \text{ is a positive integer} \end{array} \right.$

$\displaystyle E(Y^k)=E(X^{k / \tau})=\displaystyle \frac{\Gamma(a+b)}{\Gamma(a)} \ \frac{\Gamma(a+k/ \tau)}{\Gamma(a+b+k / \tau)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ k>-a \tau \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (7)$

$\displaystyle E(T^k)=E(\theta^k \ X^{k / \tau})=\displaystyle \theta^k \ \frac{\Gamma(a+b)}{\Gamma(a)} \ \frac{\Gamma(a+k/ \tau)}{\Gamma(a+b+k / \tau)} \ \ \ \ \ \ \ \ \ \ \ k>-a \tau \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)$

Once the moments are known, distributional quantities such as variance, coefficient of variation and skewness and kurtosis can be routinely calculated.

_______________________________________________________________________________________________

How the Shape Can Change by the Parameter $\tau$

The parameters $a$ and $b$ are both shape parameters for the standard beta distribution. the larger the one of them (in relation to the other), the more stronger the skewness. The direction of the skew depends on which one is larger. The beta distribution has a right skew if $b$ is larger (the parameter associated with the $1-x$ term in the beta density function) and has a left skew if $a$ is larger (the parameter associated with the $x$ term in the beta density). The additional parameter $\tau$ can further tweak the skewness of the beta distribution.

In the example discussed above, the starting beta distribution with $a=3$ and $b = 7$ is a right skewed distribution. Squaring it ($\tau=0.5$) produces a stronger skewness to the right (see Figure 1). Taking a square root ($\tau=2$) produces a weaker skewness to the right (in fact a slight skewness to the left).

Consider the case $\tau<1$. Raising the beta $X$ to the power of $1/ \tau$ has the effect of pushing the data toward the origin. As a result, this action makes the random variable $X^{1 / \tau}$ to become right skewed. If the standard beta is already right skewed, raising it to the power of $1/ \tau$ will make the right skew stronger. If the standard beta is symmetric, raising it to the power of $1/ \tau$ will produce a moderate right skew. If the standard beta is left skewed, raising it to the power of $1/ \tau$ will reduce the magnitude of the left skew (possibly producing a slight right skew).

Now consider that case that $\tau>1$. Raising the beta $X$ to the power of $1/ \tau$ has the effect of pushing the data toward the end point of the interval at 1. As a result, this action makes the random variable $X^{1 / \tau}$ to become left skewed. If the standard beta is already left skewed, raising it to the power of $1/ \tau$ will make the left skew stronger. If the standard beta is symmetric, raising it to the power of $1/ \tau$ will produce a moderate left skew. If the standard beta is right skewed, raising it to the power of $1/ \tau$ will reduce the magnitude of the right skew (possibly producing a slight left skew).

The following example illustrates the idea of alternating the skewness by the parameter $\tau$. The calculation is left as an exercise.

$\displaystyle \begin{array}{ccccccc} \text{Beta } X&\text{ } & \text{Skewness of } X& \text{ } & \text{Skewness of } X^{1 / \tau}& \text{ } & \text{Skewness of } X^{1 / \tau}\\ \text{ } &\text{ } & \text{ } & \text{ } & \tau=0.5 <1& \text{ } & \tau=2 >1 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Beta 7, 3} &\text{ } & \displaystyle -0.4825 & \text{ } & \displaystyle -0.01066 & \text{ } & \displaystyle -0.7859\\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Beta 5, 5} &\text{ } & \displaystyle 0 & \text{ } & \displaystyle 0.67285 & \text{ } & \displaystyle -0.3930 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Beta 3, 7} &\text{ } & \displaystyle 0.4825 & \text{ } & \displaystyle 1.53195 & \text{ } & \displaystyle -0.11135 \\ \end{array}$

The Beta with $a=7$ and $b=3$ has a left skew since $a$ dominates. Raising it to $1 / \tau$ with $\tau<1$ pushes the data to the origin and thus reducing the left skew greatly. On the other hand, Raising it to $1 / \tau$ with $\tau>1$ pushes the data to 1.0 and as a result making the left skew even stronger.

Starting with the symmetric Beta with $a=5$ and $b=5$, the case of $\tau<1$ produces a moderate right skew (pushing the data to the origin) and the case of $\tau>1$ produces a moderate left skew (pushing the data to 1.0).

The right skewed Beta with $a=3$ and $b=7$ has the opposite dynamics as for the beta with $a=7$ and $b=3$ and is illustrated in Figure 1 above.

_______________________________________________________________________________________________

Reference

1. McDonald, J. B., Some generalization functions for the size distribution of income, Econometrica, 52, 3, 647-663 (1984).

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$

# The beta distribution as order statistics

The beta distribution is mathematically defined using the beta function as discussed in the previous post. The beta distribution can also arise naturally from random sampling from the uniform distribution. This natural generation of the beta distribution leads to an interesting discussion of order statistics and non-parametric inference.

_______________________________________________________________________________________________

Sampling from the Uniform Distribution

Suppose $X_1,X_2,\cdots,X_n$ is a random sample from a continuous distribution. Rank the sample in ascending order: $Y_1. The statistic $Y_1$ is the least sample item (the minimum statistic). The statistic $Y_2$ is the second smallest sample item (the second order statistic), and so on. Of course, $Y_n$ is the maximum statistic. Since we are sampling from a continuous distribution, assume that there is no chance for a tie among the sample items $X_i$ or the order statistics $Y_i$. Let $F(x)$ and $f(x)$ be the cumulative distribution function and the density function of the continuous distribution from which the random sample is drawn. The following is the density function of the order statistic $Y_i$ where $1 \le i \le n$.

$\displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! \ (n-i)!} \ F(y)^{i-1} \ f(y) \ [1-F(y)]^{n-i} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

To see how $(1)$ is derived, see the discussion here and here.

When sampling from the uniform distribution on $(0,1)$, the order statistics have the beta distribution. Suppose that the distribution from which the random sample is drawn is the uniform distribution on the unit interval $(0,1)$. Then $F(x)=x$ and $f(x)=1$ for all $0. Then the density function in $(1)$ becomes the following:

$\displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! \ (n-i)!} \ y^{i-1} \ [1-y]^{n-i} \ \ \ \ \ \ 0

The density in $(2)$ is a beta density function with $a=i$ and $b=n-i+1$. The following is the mean of the beta distribution described in $(2)$.

$\displaystyle E(Y_i)=\frac{a}{a+b}=\frac{i}{n+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

To summarize, in a random sample of size $n$ drawn from a uniform distribution on $(0,1)$, the $i$th order statistic $Y_i$ has a beta distribution with parameters $a=i$ and $b=n-i+1$. The following table shows the information when $n=7$.

$\displaystyle \begin{array}{ccccccc} \text{ } &\text{ } & \text{beta parameter } a & \text{ } & \text{beta parameter } b& \text{ } & E(Y_i) \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_1 &\text{ } & 1 & \text{ } & 7 & \text{ } & \displaystyle \frac{1}{8}\\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_2 &\text{ } & 2 & \text{ } & 6 & \text{ } & \displaystyle \frac{2}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_3 &\text{ } & 3 & \text{ } & 5 & \text{ } & \displaystyle \frac{3}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_4 &\text{ } & 4 & \text{ } & 4 & \text{ } & \displaystyle \frac{4}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_5 &\text{ } & 5 & \text{ } & 3 & \text{ } & \displaystyle \frac{5}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_6 &\text{ } & 6 & \text{ } & 2 & \text{ } & \displaystyle \frac{6}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_7 &\text{ } & 7 & \text{ } & 1 & \text{ } & \displaystyle \frac{7}{8} \\ \end{array}$

See here for a fuller discussion on the beta distribution.

_______________________________________________________________________________________________

The Non-Parametric Angle

In descriptive statistics, the sample statistics are used as point estimates for population parameters. For example, the sample mean can be used as an estimate for the population mean and the sample median is used as an estimate of the population median and so on. Such techniques are part of non-parametric statistics since there are no assumptions made about the probability distributions of the variables being assessed.

Let’s focus on estimation of percentiles. We do not need to assume a population probability distribution. Simply generate the sample data from the population. Then rank the sample data from the smallest to the largest. Use the sample item in the “middle” to estimate the population median. Likewise, the population 75th percentile is estimated by the sample item that ranks higher than approximately 75% of the sample items and ranks below 25% of the sample items. And so on. In other words, order statistics can be used as estimates of population percentiles. This makes intuitive sense. The estimate will always be correct from the perspective of the sample. For example, the estimate of the 75th percentile is chosen to rank approximately above 75% of the sample. But will the estimate chosen this way rank above 75% of the population? The remainder of the post shows that the answer is yes. On average the estimate will rank above the appropriate percentage of the population. Thus the non-parametric approach of using order statistics as estimate of population percentiles makes mathematical sense as well. The estimate is “unbiased” in the sense that it is expected to be rank correctly among the population values.

As before, $X_1,X_2,\cdots,X_n$ is the random sample and $Y_1 is the resulting ordered sample. We do not know the distribution from which the sample is obtained. To make the argument clear, let $f(x)$ be the density function and $F(x)$ be the CDF of the unknown population, respectively. We show that

regardless of the probability distribution from which the sample is generated, the expected area under the density curve $f(x)$ and to the left of $Y_i$ is $\displaystyle \frac{i}{n+1}$.

Thus the order statistic $Y_i$ is expected to be greater than $\displaystyle \biggl(100 \times \frac{i}{n+1}\biggr)$% of the population. This shows that it is mathematically justified to use the order statistics $Y_1 as estimates of the population percentiles.

For example, if the sample size $n$ is 11, then the middle sample item is $Y_6$. Then the area under the density curve of the unknown population distribution and to the left of $Y_6$ is expected to be 6/12 = 0.5. So $Y_6$ is expected to be greater than 50% of the distribution, even though the form of the distribution is unknown.

Note that $F(X)$ has a uniform distribution on the interval $(0,1)$ for any continuous random variable $X$. So $F(X_1),F(X_2),\cdots,F(X_n)$ is like a random sample drawn from the uniform distribution. Furthermore $F(Y_1) is an order sample from the uniform distribution since $F(x)$ as a CDF is an increasing function. Thus each item $F(Y_i)$ is an order statistic. Recall the result described in $(2)$. The order statistic $F(Y_i)$ has a beta distribution with $a=i$ and $b=n-i+1$. Then the mean of $F(Y_i)$ is

$\displaystyle E[F(Y_i)]=\frac{a}{a+b}=\frac{i}{n+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

, which is the same as the mean in $(3)$. Note that the area under the density curve $f(x)$ and to the left of $Y_i$ can be conceptually expressed as $F(Y_i)$. The expected value of this area is the ratio indicated in $(4)$. This concludes the argument that when the order statistic $Y_i$ is used as an estimate of a certain population percentile, it is on average a correct estimate in that it is expected to rank above the correct percentage of the population.

_______________________________________________________________________________________________

Remarks

A subclass of the beta distribution can be naturally generated from random sampling of the uniform distribution, as described in $(2)$. The non-parametric approach is not only for producing point estimates . It can be used as an inference procedure as well. For example, a wildlife biologist may be interested in estimating the median weight of black bear in a certain region in Alaska. The procedure is to capture a sample of bears, sedate the bears and then take the weight measurements. Rank the sample. The middle sample item is then estimate of the population median weight of black bears. The ordered sample can also be used to form a confidence interval for the median bear weight. For an explanation on how to form such distribution-free confidence intervals, see here.

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$

# Introducing the beta distribution

This post is an introduction on the standard beta distribution, i.e. the one whose support is the unit interval $(0, 1)$ and is mathematically defined from the beta function. There are many generalizations of the standard beta distribution. The most straightforward one is to resize the standard beta so that the support is an interval other than $(0, 1)$, which is discussed at the end of this post. There are many forms of generalized beta distribution. One subsequent post will define one such version.

The beta distribution and its many generalizations are important in economic modeling (e.g. modeling of income distribution). It can also be applied in actuarial modeling, e.g. for modeling insurances losses. In Bayesian analysis, the beta distribution can be used as a conjugate prior for the binomial model. The beta distribution has an interesting and important connection with order statistics and non-parametric inference, which is the subject of the next post.

_______________________________________________________________________________________________

The Beta Density Function

Let $a$ and $b$ be positive real numbers. The previous post establishes the following fact about the beta function.

$\displaystyle B(a,b)=\int_0^1 t^{a-1} \ (1-t)^{b-1} \ dt=\frac{\Gamma(a) \ \Gamma(b)}{\Gamma(a+b)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

As a result, the following is a density function.

$\displaystyle f(x)=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ x^{a-1} \ (1-x)^{b-1}; \ \ \ \ \ \ \ \ 0

A random variable $X$ is said to follow the beta distribution with parameters $a$ and $b$ if its pdf is $(1)$. In particular, if $a$ and $b$ are positive integers, then the beta density function is:

$\displaystyle f(x)=\frac{(a+b-1)!}{(a-1)! \ (b-1)!} \ x^{a-1} \ (1-x)^{b-1}; \ \ \ \ \ \ \ \ 0

It is instructive to look at the role of the parameters $a$ and $b$ in the beta distribution. It turns out that both parameters $a$ and $b$ play a role in determining the shape of the distribution. If $a$ and $b$ are less than 1, the beta density is a U shape where the density curve goes up to infinity as $x \rightarrow 0$ and $x \rightarrow 1$ (see the U shaped red curve in Figure 1).

If $a=b$, then the beta density curves are symmetric about the vertical line at $x=0.5$. When $a=b$, the value of the density curve at $x$ is identical to the value at $1-x$ (see Figure 1). Note that when $a=b=1$, the beta distribution is the uniform distribution on $(0,1)$.

Figure 1

If $a$ and $b$ are greater than 1 and if $a, then the beta density curve is skewed to the right (positively skewed). This means that a short portion of the left side of $(0,1)$ gets assigned more probabilities and the side to the right gets assigned less probabilities. More specifically, when $a, the random quantity that is described by the beta distribution is concentrated more on the lower end of the interval $(0, 1)$ (see Figure 2). The greater the $b$ as compared to $a$, the more pronounced the skewness.

Figure 2

If $a$ and $b$ are greater than 1 and if $a>b$, the shape is the opposite (see Figure 3). The skewness is to the left (negatively skewed). Most of the probabilities are concentrated on a relatively short segment on the right side of $(0, 1)$. Then the longer side to the left is assigned much less probabilities. Hence the distribution has a longer left tail.

Figure 3

_______________________________________________________________________________________________

Distributional Quantities Based on Moments

As the above figures of density curves show, both parameters $a$ and $b$ drive the shape of the beta distribution. Several of the distributional quantities are also easily derived from $a$ and $b$. For example, the moments of the beta distribution are easily derived. As a result, many distributional quantities that are based on moments can be calculated in a straightforward fashion, e.g. coefficient of variation, skewness, kurtosis. These in turn can shed more light on the beta distribution.

\displaystyle \begin{aligned} E(X^k)&=\int_0^1 x^k \ \frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ x^{a-1} \ (1-x)^{b-1} \ dx \\&=\int_0^1 \frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ x^{a+k-1} \ (1-x)^{b-1} \ dx \\&=\frac{\Gamma(a+k)}{\Gamma(a)} \frac{\Gamma(a+b)}{\Gamma(a+b+k)} \ \int_0^1 \frac{\Gamma(a+b+k)}{\Gamma(a+k) \ \Gamma(b)} \ x^{a+k-1} \ (1-x)^{b-1} \ dx\\&=\frac{\Gamma(a+k)}{\Gamma(a)} \frac{\Gamma(a+b)}{\Gamma(a+b+k)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ k \text{ is a real number with } k>-a \\&=\frac{a(a+1) \cdots (a+k-1)}{(a+b)(a+b+1) \cdots (a+b+k-1)} \ \ \ \ \ \ \ \ \ k \text{ is a positive integer} \ \ \ \ \ \ \ \ \ \ \ \ (4) \end{aligned}

The trick is to make the integrand the density of a beta distribution (in this case beta parameters $a+k$ and $b$). Then the contents that can be taken out of the integral would be the answer. To get the final expression of $E(X^k)$, we rely on a fact of the gamma function, which is that $\Gamma(\alpha+k)=\Gamma(\alpha) \ \alpha (\alpha+1) \cdots (\alpha+k-1)$. Thus the first four moments of the beta distribution are:

$\displaystyle E(X)=\frac{a}{a+b}$

$\displaystyle E(X^2)=\frac{a(a+1)}{(a+b)(a+b+1)}$

$\displaystyle E(X^3)=\frac{a(a+1)(a+2)}{(a+b)(a+b+1)(a+b+2)}$

$\displaystyle E(X^4)=\frac{a(a+1)(a+2)(a+3)}{(a+b)(a+b+1)(a+b+2)(a+b+3)}$

Here are some calculations that can be made using the first four moments established above.

$\displaystyle \begin{array}{lllll} \text{ } &\text{ } & \text{Definition} & \text{ } & \text{Standard Beta Distribution} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Mean} &\text{ } & \text{ } & \text{ } & \displaystyle \frac{a}{a+b} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Variance} &\text{ } & E(X^2)-E(X)^2 & \text{ } & \displaystyle \frac{ab}{(a+b)^2 \ (a+b+1)} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Coefficient of Variation} &\text{ } & \displaystyle \frac{\sqrt{Var(X)}}{E(X)}=\frac{\sigma}{\mu} & \text{ } & \displaystyle \sqrt{\frac{b}{a(a+b+1)}} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Skewness} &\text{ } & \displaystyle E\biggl[\biggl(\frac{X-\mu}{\sigma}\biggr)^3\biggr] & \text{ } & \displaystyle \frac{2(b-a) \sqrt{a+b+1}}{(a+b+2) \sqrt{ab}} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Kurtosis} &\text{ } & \displaystyle E\biggl[\biggl(\frac{X-\mu}{\sigma}\biggr)^4\biggr] & \text{ } & \displaystyle \frac{6[(a-b)^2 (a+b+1)-ab(a+b+2)]}{ab(a+b+2)(a+b+3)} +3 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Excess Kurtosis} &\text{ } & \displaystyle E\biggl[\biggl(\frac{X-\mu}{\sigma}\biggr)^4\biggr]-3 & \text{ } & \displaystyle \frac{6[(a-b)^2 (a+b+1)-ab(a+b+2)]}{ab(a+b+2)(a+b+3)} \end{array}$

Note the skewness coefficient agrees with the diagrams shown above. When the parameters $a$ and $b$ are equal, there is no skewness. When $b$ is greater than $a$, the skewness is positive (skewed to the right). The skewness is more pronounced when the parameter $b$ is substantially greater than the parameter $a$. Similarly, when $b$ is smaller than $a$, the skewness is negative.

Another interesting observation is about the variance. If one of the parameters $a$ and $b$ is fixed (it does not matter which one) while the other goes to infinity, the variance will go to zero. To see this, let $a$ be fixed. In the expression for $Var(X)$, divide the numerator and denominator by $b$ and obtain:

$\displaystyle Var(X)=\frac{a}{(a+b)^2 \ (\frac{a}{b}+1+\frac{1}{b})}$

The quantity $\frac{a}{b}+1+\frac{1}{b}$ in the denominator approaches 1 and the quantity $(a+b)^2$ approaches infinity as $b \rightarrow \infty$. This means that the spread is driven by the parameters $a$ and $b$ too. The spread is narrower and narrower as one of the parameters becomes larger and larger. This phenomenon is also reflected in the above diagrams of beta density curves.

_______________________________________________________________________________________________

Other Distributional Quantities

The following is the cumulative distribution function of the beta distribution.

$\displaystyle F(x)=\int_0^x \frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ t^{a-1} \ (1-t)^{b-1} \ dt \ \ \ \ \ \ \ \ \ 0

The CDF can also be expressed as $\displaystyle F(x)=\frac{B(a,b,x)}{B(a,b)}$ where $B(a,b,x)$ is the incomplete beta function defined by:

$\displaystyle B(a,b,x)=\int_0^x t^{a-1} \ (1-t)^{b-1} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)$

The CDF as exhibited in $(5)$ has no closed form. As a result, the values of the CDF will have to be obtained thru numerical approximation or using software. Likewise, percentiles of the beta distribution can only be approximated. For the distributional quantities not discussed here (e.g. moment generating function), see the Wikepedia entry on the beta function.

_______________________________________________________________________________________________

Extending the Support

The standard beta distribution takes on values in the unit interval $(0,1)$. If the random quantity to be modeled can extend beyond the unit interval, the beta distribution can also be transformed to match the situation. Let $X$ be the standard beta distribution with parameters $a$ and $b$. Let $\theta$ be a positive number. Let $Y=\theta \ X$. The support of $Y$ is then the interval $(0, \theta)$. As a constant multiple of the standard beta $X$, the random variable $Y$ would have similar shapes and properties. The following are the density function and CDF.

\displaystyle \begin{aligned} f_Y(y)&=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ \biggl(\frac{y}{\theta}\biggr)^{a-1} \ \biggl[1-\biggl(\frac{y}{\theta}\biggr)\biggr]^{b-1} \ \frac{1}{\theta} \ \ \ \ \ \ \ \ 0

$\displaystyle F_Y(y)=\int_0^{\frac{y}{\theta}} \frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ t^{a-1} \ (1-t)^{b-1} \ dt \ \ \ \ \ \ \ \ \ 0

$\displaystyle F_Y(y)=\frac{B(a,b,\frac{y}{\theta})}{B(a,b)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8a)$

Note that $(8a)$ expresses the CDF of $Y$ using the incomplete beta function defined in $(6)$. All the distributional quantities of $X$ discussed above can be obtained for $Y$ by applying the appropriate multiplier. For example, $E(Y^k)=\theta^k E(X)$. As a result, a similar table can be obtained for the transformed beta distribution.

$\displaystyle \begin{array}{lllll} \text{ } &\text{ } & \text{Definition} & \text{ } & \text{Transformed Beta } Y=\theta X \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Mean} &\text{ } & \text{ } & \text{ } & \displaystyle \theta \ \frac{a}{a+b} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Variance} &\text{ } & E(Y^2)-E(Y)^2 & \text{ } & \displaystyle \theta^2 \ \frac{ab}{(a+b)^2 \ (a+b+1)} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Coefficient of Variation} &\text{ } & \displaystyle \frac{\sqrt{Var(Y)}}{E(Y)}=\frac{\sigma_Y}{\mu_Y} & \text{ } & \displaystyle \sqrt{\frac{b}{a(a+b+1)}} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Skewness} &\text{ } & \displaystyle E\biggl[\biggl(\frac{Y-\mu_Y}{\sigma_Y}\biggr)^3\biggr] & \text{ } & \displaystyle \frac{2(b-a) \sqrt{a+b+1}}{(a+b+2) \sqrt{ab}} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Kurtosis} &\text{ } & \displaystyle E\biggl[\biggl(\frac{Y-\mu_Y}{\sigma_Y}\biggr)^4\biggr] & \text{ } & \displaystyle \frac{6[(a-b)^2 (a+b+1)-ab(a+b+2)]}{ab(a+b+2)(a+b+3)} +3 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Excess Kurtosis} &\text{ } & \displaystyle E\biggl[\biggl(\frac{Y-\mu_Y}{\sigma_Y}\biggr)^4\biggr]-3 & \text{ } & \displaystyle \frac{6[(a-b)^2 (a+b+1)-ab(a+b+2)]}{ab(a+b+2)(a+b+3)} \end{array}$

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$

# Introducing the beta function

The gamma distribution is mathematically defined from the gamma function. This post gives a brief introduction to the beta function. The goal is to establish one property that is the basis for defining the beta distribution.

_______________________________________________________________________________________________

The Beta Function

For any positive constants $a$ and $b$, the beta function is defined to be the following integral:

$\displaystyle B(a,b)=\int_0^1 t^{a-1} \ (1-t)^{b-1} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)$

The beta function can be evaluated directly if the parameters $a$ and $b$ are not too large. For example, $B(3,2)$ is the integral $\displaystyle \int_0^1 t^2 (1-t) \ dt$, which is $1/12$. Evaluating $(0)$ in a case by case basis does not shed light on the beta function. Direct calculation can also be cumbersome (e.g. for large parameters that are integers) or challenging (e.g. for parameters $a$ and $b$ that are fractional). It turns out that the evaluation of the beta function $B(a,b)$ is based on the gamma function.

_______________________________________________________________________________________________

Connection to the Gamma Function

The remainder of the post is to establish the following value of the beta function:

$\displaystyle B(a,b)=\int_0^1 t^{a-1} \ (1-t)^{b-1} \ dt=\frac{\Gamma(a) \ \Gamma(b)}{\Gamma(a+b)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

To start the proof of $(1)$, let $X$ and $Y$ be two independent random variables such that $X$ follows a gamma distribution with shape parameter $a$ and rate parameter $\beta$ and that $Y$ follows a gamma distribution with shape parameter $b$ and rate parameter $\beta$. It does not matter what $\beta$ is, as long as it is the rate parameter for both $X$ and $Y$. Then the sum $S=X+Y$ has a gamma distribution with shape parameter $a+b$ and rate parameter $\beta$. The following is the density function for $S=X+Y$.

$\displaystyle f_S(s)=\frac{1}{\Gamma(a+b)} \ \beta^{a+b} \ s^{a+b-1} \ e^{-\beta s} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The density function of $S=X+Y$ can also be derived from the convolution formula using the density functions of $X$ and $Y$ as follows:

\displaystyle \begin{aligned} f_S(s)&=\int_0^s f_Y(s-x) \ f_X(x) \ dx \ \ \ \ \ \text{(convolution)} \\&=\int_0^s \frac{1}{\Gamma(b)} \ \beta^{b} \ (s-x)^{b-1} \ e^{-\beta (s-x)} \ \frac{1}{\Gamma(a)} \ \beta^{a} \ x^{a-1} \ e^{-\beta x} \ dx \\&=\frac{\beta^{a+b}}{\Gamma(a) \ \Gamma(b)} \ e^{-\beta s} \ \int_0^s x^{a-1} \ (s-x)^{b-1} \ dx \\&=\frac{\beta^{a+b}}{\Gamma(a) \ \Gamma(b)} \ e^{-\beta s} \ s^{a+b-1} \ \int_0^1 t^{a-1} \ (1-t)^{b-1} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

See here for more information on how to use the convolution formula. The last step in $(3)$ is obtained by a change of variable in the integral from the step immediately above it by letting $x=st$. The last step in $(3)$ must equal to $(2)$. Setting the two equal would produce the equality in $(1)$.

Note that if the function $t^{a-1} \ (1-t)^{b-1}$ is normalized by the value $B(a,b)$, it would be a density function, which is the beta distribution. The following is the density function of the beta distribution.

$\displaystyle f(x)=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ x^{a-1} \ (1-x)^{b-1}; \ \ \ \ \ \ \ \ 0

The beta distribution is further examined in the next post.

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$