# The beta distribution as order statistics

The beta distribution is mathematically defined using the beta function as discussed in the previous post. The beta distribution can also arise naturally from random sampling from the uniform distribution. This natural generation of the beta distribution leads to an interesting discussion of order statistics and non-parametric inference.

_______________________________________________________________________________________________

Sampling from the Uniform Distribution

Suppose $X_1,X_2,\cdots,X_n$ is a random sample from a continuous distribution. Rank the sample in ascending order: $Y_1. The statistic $Y_1$ is the least sample item (the minimum statistic). The statistic $Y_2$ is the second smallest sample item (the second order statistic), and so on. Of course, $Y_n$ is the maximum statistic. Since we are sampling from a continuous distribution, assume that there is no chance for a tie among the sample items $X_i$ or the order statistics $Y_i$. Let $F(x)$ and $f(x)$ be the cumulative distribution function and the density function of the continuous distribution from which the random sample is drawn. The following is the density function of the order statistic $Y_i$ where $1 \le i \le n$.

$\displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! \ (n-i)!} \ F(y)^{i-1} \ f(y) \ [1-F(y)]^{n-i} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

To see how $(1)$ is derived, see the discussion here and here.

When sampling from the uniform distribution on $(0,1)$, the order statistics have the beta distribution. Suppose that the distribution from which the random sample is drawn is the uniform distribution on the unit interval $(0,1)$. Then $F(x)=x$ and $f(x)=1$ for all $0. Then the density function in $(1)$ becomes the following:

$\displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! \ (n-i)!} \ y^{i-1} \ [1-y]^{n-i} \ \ \ \ \ \ 0

The density in $(2)$ is a beta density function with $a=i$ and $b=n-i+1$. The following is the mean of the beta distribution described in $(2)$.

$\displaystyle E(Y_i)=\frac{a}{a+b}=\frac{i}{n+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

To summarize, in a random sample of size $n$ drawn from a uniform distribution on $(0,1)$, the $i$th order statistic $Y_i$ has a beta distribution with parameters $a=i$ and $b=n-i+1$. The following table shows the information when $n=7$.

$\displaystyle \begin{array}{ccccccc} \text{ } &\text{ } & \text{beta parameter } a & \text{ } & \text{beta parameter } b& \text{ } & E(Y_i) \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_1 &\text{ } & 1 & \text{ } & 7 & \text{ } & \displaystyle \frac{1}{8}\\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_2 &\text{ } & 2 & \text{ } & 6 & \text{ } & \displaystyle \frac{2}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_3 &\text{ } & 3 & \text{ } & 5 & \text{ } & \displaystyle \frac{3}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_4 &\text{ } & 4 & \text{ } & 4 & \text{ } & \displaystyle \frac{4}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_5 &\text{ } & 5 & \text{ } & 3 & \text{ } & \displaystyle \frac{5}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_6 &\text{ } & 6 & \text{ } & 2 & \text{ } & \displaystyle \frac{6}{8} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ Y_7 &\text{ } & 7 & \text{ } & 1 & \text{ } & \displaystyle \frac{7}{8} \\ \end{array}$

See here for a fuller discussion on the beta distribution.

_______________________________________________________________________________________________

The Non-Parametric Angle

In descriptive statistics, the sample statistics are used as point estimates for population parameters. For example, the sample mean can be used as an estimate for the population mean and the sample median is used as an estimate of the population median and so on. Such techniques are part of non-parametric statistics since there are no assumptions made about the probability distributions of the variables being assessed.

Let’s focus on estimation of percentiles. We do not need to assume a population probability distribution. Simply generate the sample data from the population. Then rank the sample data from the smallest to the largest. Use the sample item in the “middle” to estimate the population median. Likewise, the population 75th percentile is estimated by the sample item that ranks higher than approximately 75% of the sample items and ranks below 25% of the sample items. And so on. In other words, order statistics can be used as estimates of population percentiles. This makes intuitive sense. The estimate will always be correct from the perspective of the sample. For example, the estimate of the 75th percentile is chosen to rank approximately above 75% of the sample. But will the estimate chosen this way rank above 75% of the population? The remainder of the post shows that the answer is yes. On average the estimate will rank above the appropriate percentage of the population. Thus the non-parametric approach of using order statistics as estimate of population percentiles makes mathematical sense as well. The estimate is “unbiased” in the sense that it is expected to be rank correctly among the population values.

As before, $X_1,X_2,\cdots,X_n$ is the random sample and $Y_1 is the resulting ordered sample. We do not know the distribution from which the sample is obtained. To make the argument clear, let $f(x)$ be the density function and $F(x)$ be the CDF of the unknown population, respectively. We show that

regardless of the probability distribution from which the sample is generated, the expected area under the density curve $f(x)$ and to the left of $Y_i$ is $\displaystyle \frac{i}{n+1}$.

Thus the order statistic $Y_i$ is expected to be greater than $\displaystyle \biggl(100 \times \frac{i}{n+1}\biggr)$% of the population. This shows that it is mathematically justified to use the order statistics $Y_1 as estimates of the population percentiles.

For example, if the sample size $n$ is 11, then the middle sample item is $Y_6$. Then the area under the density curve of the unknown population distribution and to the left of $Y_6$ is expected to be 6/12 = 0.5. So $Y_6$ is expected to be greater than 50% of the distribution, even though the form of the distribution is unknown.

Note that $F(X)$ has a uniform distribution on the interval $(0,1)$ for any continuous random variable $X$. So $F(X_1),F(X_2),\cdots,F(X_n)$ is like a random sample drawn from the uniform distribution. Furthermore $F(Y_1) is an order sample from the uniform distribution since $F(x)$ as a CDF is an increasing function. Thus each item $F(Y_i)$ is an order statistic. Recall the result described in $(2)$. The order statistic $F(Y_i)$ has a beta distribution with $a=i$ and $b=n-i+1$. Then the mean of $F(Y_i)$ is

$\displaystyle E[F(Y_i)]=\frac{a}{a+b}=\frac{i}{n+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

, which is the same as the mean in $(3)$. Note that the area under the density curve $f(x)$ and to the left of $Y_i$ can be conceptually expressed as $F(Y_i)$. The expected value of this area is the ratio indicated in $(4)$. This concludes the argument that when the order statistic $Y_i$ is used as an estimate of a certain population percentile, it is on average a correct estimate in that it is expected to rank above the correct percentage of the population.

_______________________________________________________________________________________________

Remarks

A subclass of the beta distribution can be naturally generated from random sampling of the uniform distribution, as described in $(2)$. The non-parametric approach is not only for producing point estimates . It can be used as an inference procedure as well. For example, a wildlife biologist may be interested in estimating the median weight of black bear in a certain region in Alaska. The procedure is to capture a sample of bears, sedate the bears and then take the weight measurements. Rank the sample. The middle sample item is then estimate of the population median weight of black bears. The ordered sample can also be used to form a confidence interval for the median bear weight. For an explanation on how to form such distribution-free confidence intervals, see here.

_______________________________________________________________________________________________
$\copyright \ 2016 - \text{Dan Ma}$