The beta distribution is mathematically defined using the beta function as discussed in the previous post. The beta distribution can also arise naturally from random sampling from the uniform distribution. This natural generation of the beta distribution leads to an interesting discussion of order statistics and non-parametric inference.
Sampling from the Uniform Distribution
Suppose is a random sample from a continuous distribution. Rank the sample in ascending order: . The statistic is the least sample item (the minimum statistic). The statistic is the second smallest sample item (the second order statistic), and so on. Of course, is the maximum statistic. Since we are sampling from a continuous distribution, assume that there is no chance for a tie among the sample items or the order statistics . Let and be the cumulative distribution function and the density function of the continuous distribution from which the random sample is drawn. The following is the density function of the order statistic where .
When sampling from the uniform distribution on , the order statistics have the beta distribution. Suppose that the distribution from which the random sample is drawn is the uniform distribution on the unit interval . Then and for all . Then the density function in becomes the following:
The density in is a beta density function with and . The following is the mean of the beta distribution described in .
To summarize, in a random sample of size drawn from a uniform distribution on , the th order statistic has a beta distribution with parameters and . The following table shows the information when .
See here for a fuller discussion on the beta distribution.
The Non-Parametric Angle
In descriptive statistics, the sample statistics are used as point estimates for population parameters. For example, the sample mean can be used as an estimate for the population mean and the sample median is used as an estimate of the population median and so on. Such techniques are part of non-parametric statistics since there are no assumptions made about the probability distributions of the variables being assessed.
Let’s focus on estimation of percentiles. We do not need to assume a population probability distribution. Simply generate the sample data from the population. Then rank the sample data from the smallest to the largest. Use the sample item in the “middle” to estimate the population median. Likewise, the population 75th percentile is estimated by the sample item that ranks higher than approximately 75% of the sample items and ranks below 25% of the sample items. And so on. In other words, order statistics can be used as estimates of population percentiles. This makes intuitive sense. The estimate will always be correct from the perspective of the sample. For example, the estimate of the 75th percentile is chosen to rank approximately above 75% of the sample. But will the estimate chosen this way rank above 75% of the population? The remainder of the post shows that the answer is yes. On average the estimate will rank above the appropriate percentage of the population. Thus the non-parametric approach of using order statistics as estimate of population percentiles makes mathematical sense as well. The estimate is “unbiased” in the sense that it is expected to be rank correctly among the population values.
As before, is the random sample and is the resulting ordered sample. We do not know the distribution from which the sample is obtained. To make the argument clear, let be the density function and be the CDF of the unknown population, respectively. We show that
regardless of the probability distribution from which the sample is generated, the expected area under the density curve and to the left of is .
Thus the order statistic is expected to be greater than % of the population. This shows that it is mathematically justified to use the order statistics as estimates of the population percentiles.
For example, if the sample size is 11, then the middle sample item is . Then the area under the density curve of the unknown population distribution and to the left of is expected to be 6/12 = 0.5. So is expected to be greater than 50% of the distribution, even though the form of the distribution is unknown.
Note that has a uniform distribution on the interval for any continuous random variable . So is like a random sample drawn from the uniform distribution. Furthermore is an order sample from the uniform distribution since as a CDF is an increasing function. Thus each item is an order statistic. Recall the result described in . The order statistic has a beta distribution with and . Then the mean of is
, which is the same as the mean in . Note that the area under the density curve and to the left of can be conceptually expressed as . The expected value of this area is the ratio indicated in . This concludes the argument that when the order statistic is used as an estimate of a certain population percentile, it is on average a correct estimate in that it is expected to rank above the correct percentage of the population.
A subclass of the beta distribution can be naturally generated from random sampling of the uniform distribution, as described in . The non-parametric approach is not only for producing point estimates . It can be used as an inference procedure as well. For example, a wildlife biologist may be interested in estimating the median weight of black bear in a certain region in Alaska. The procedure is to capture a sample of bears, sedate the bears and then take the weight measurements. Rank the sample. The middle sample item is then estimate of the population median weight of black bears. The ordered sample can also be used to form a confidence interval for the median bear weight. For an explanation on how to form such distribution-free confidence intervals, see here.