The beta distribution as order statistics

The beta distribution is mathematically defined using the beta function as discussed in the previous post. The beta distribution can also arise naturally from random sampling from the uniform distribution. This natural generation of the beta distribution leads to an interesting discussion of order statistics and non-parametric inference.

_______________________________________________________________________________________________

Sampling from the Uniform Distribution

Suppose X_1,X_2,\cdots,X_n is a random sample from a continuous distribution. Rank the sample in ascending order: Y_1<Y_2< \cdots <Y_n. The statistic Y_1 is the least sample item (the minimum statistic). The statistic Y_2 is the second smallest sample item (the second order statistic), and so on. Of course, Y_n is the maximum statistic. Since we are sampling from a continuous distribution, assume that there is no chance for a tie among the sample items X_i or the order statistics Y_i. Let F(x) and f(x) be the cumulative distribution function and the density function of the continuous distribution from which the random sample is drawn. The following is the density function of the order statistic Y_i where 1 \le i \le n.

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! \ (n-i)!} \ F(y)^{i-1} \ f(y) \ [1-F(y)]^{n-i} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

To see how (1) is derived, see the discussion here and here.

When sampling from the uniform distribution on (0,1), the order statistics have the beta distribution. Suppose that the distribution from which the random sample is drawn is the uniform distribution on the unit interval (0,1). Then F(x)=x and f(x)=1 for all 0<x<1. Then the density function in (1) becomes the following:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! \ (n-i)!} \ y^{i-1} \ [1-y]^{n-i} \ \ \ \ \ \ 0<y<1 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The density in (2) is a beta density function with a=i and b=n-i+1. The following is the mean of the beta distribution described in (2).

    \displaystyle E(Y_i)=\frac{a}{a+b}=\frac{i}{n+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

To summarize, in a random sample of size n drawn from a uniform distribution on (0,1), the ith order statistic Y_i has a beta distribution with parameters a=i and b=n-i+1. The following table shows the information when n=7.

    \displaystyle \begin{array}{ccccccc} \text{ } &\text{ } & \text{beta parameter } a & \text{ } & \text{beta parameter } b& \text{ } & E(Y_i) \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  Y_1 &\text{ } & 1 & \text{ } & 7  & \text{ } & \displaystyle \frac{1}{8}\\     \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  Y_2 &\text{ } & 2 & \text{ } & 6 & \text{ } & \displaystyle \frac{2}{8} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  Y_3 &\text{ } & 3 & \text{ } & 5 & \text{ } & \displaystyle \frac{3}{8} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  Y_4 &\text{ } & 4 & \text{ } & 4 & \text{ } & \displaystyle \frac{4}{8} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  Y_5 &\text{ } & 5 & \text{ } & 3 & \text{ } & \displaystyle \frac{5}{8} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  Y_6 &\text{ } & 6 & \text{ } & 2 & \text{ } & \displaystyle \frac{6}{8} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  Y_7 &\text{ } & 7 & \text{ } & 1 & \text{ } & \displaystyle \frac{7}{8} \\  \end{array}

See here for a fuller discussion on the beta distribution.

_______________________________________________________________________________________________

The Non-Parametric Angle

In descriptive statistics, the sample statistics are used as point estimates for population parameters. For example, the sample mean can be used as an estimate for the population mean and the sample median is used as an estimate of the population median and so on. Such techniques are part of non-parametric statistics since there are no assumptions made about the probability distributions of the variables being assessed.

Let’s focus on estimation of percentiles. We do not need to assume a population probability distribution. Simply generate the sample data from the population. Then rank the sample data from the smallest to the largest. Use the sample item in the “middle” to estimate the population median. Likewise, the population 75th percentile is estimated by the sample item that ranks higher than approximately 75% of the sample items and ranks below 25% of the sample items. And so on. In other words, order statistics can be used as estimates of population percentiles. This makes intuitive sense. The estimate will always be correct from the perspective of the sample. For example, the estimate of the 75th percentile is chosen to rank approximately above 75% of the sample. But will the estimate chosen this way rank above 75% of the population? The remainder of the post shows that the answer is yes. On average the estimate will rank above the appropriate percentage of the population. Thus the non-parametric approach of using order statistics as estimate of population percentiles makes mathematical sense as well. The estimate is “unbiased” in the sense that it is expected to be rank correctly among the population values.

As before, X_1,X_2,\cdots,X_n is the random sample and Y_1<Y_2< \cdots <Y_n is the resulting ordered sample. We do not know the distribution from which the sample is obtained. To make the argument clear, let f(x) be the density function and F(x) be the CDF of the unknown population, respectively. We show that

    regardless of the probability distribution from which the sample is generated, the expected area under the density curve f(x) and to the left of Y_i is \displaystyle \frac{i}{n+1}.

Thus the order statistic Y_i is expected to be greater than \displaystyle \biggl(100 \times \frac{i}{n+1}\biggr)% of the population. This shows that it is mathematically justified to use the order statistics Y_1<Y_2< \cdots <Y_n as estimates of the population percentiles.

For example, if the sample size n is 11, then the middle sample item is Y_6. Then the area under the density curve of the unknown population distribution and to the left of Y_6 is expected to be 6/12 = 0.5. So Y_6 is expected to be greater than 50% of the distribution, even though the form of the distribution is unknown.

Note that F(X) has a uniform distribution on the interval (0,1) for any continuous random variable X. So F(X_1),F(X_2),\cdots,F(X_n) is like a random sample drawn from the uniform distribution. Furthermore F(Y_1)<F(Y_2)< \cdots <F(Y_n) is an order sample from the uniform distribution since F(x) as a CDF is an increasing function. Thus each item F(Y_i) is an order statistic. Recall the result described in (2). The order statistic F(Y_i) has a beta distribution with a=i and b=n-i+1. Then the mean of F(Y_i) is

    \displaystyle E[F(Y_i)]=\frac{a}{a+b}=\frac{i}{n+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

, which is the same as the mean in (3). Note that the area under the density curve f(x) and to the left of Y_i can be conceptually expressed as F(Y_i). The expected value of this area is the ratio indicated in (4). This concludes the argument that when the order statistic Y_i is used as an estimate of a certain population percentile, it is on average a correct estimate in that it is expected to rank above the correct percentage of the population.

_______________________________________________________________________________________________

Remarks

A subclass of the beta distribution can be naturally generated from random sampling of the uniform distribution, as described in (2). The non-parametric approach is not only for producing point estimates . It can be used as an inference procedure as well. For example, a wildlife biologist may be interested in estimating the median weight of black bear in a certain region in Alaska. The procedure is to capture a sample of bears, sedate the bears and then take the weight measurements. Rank the sample. The middle sample item is then estimate of the population median weight of black bears. The ordered sample can also be used to form a confidence interval for the median bear weight. For an explanation on how to form such distribution-free confidence intervals, see here.

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

2 thoughts on “The beta distribution as order statistics

  1. Pingback: Generalized beta distribution | Topics in Actuarial Modeling

  2. Pingback: Practice Problem Set 5 – Exercises for Severity Models « Practice Problems in Actuarial Modeling

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s