A catalog of parametric severity models

Various parametric continuous probability models have been presented and discussed in this blog. The number of parameters in these models ranges from one to two, and in a small number of cases three. They are all potential candidates for models of severity in insurance applications and in other actuarial applications. This post highlights these models. The list presented here is not exhaustive; it is only a brief catalog. There are other models that are also suitable for actuarial applications but not accounted for here. However, the list is a good place to begin. This post also serves a navigation device (the table shown below contains links to the blog posts).

A Catalog

Many of the models highlighted here are related to gamma distribution either directly or indirectly. So the catalog starts with the gamma distribution at the top and then branches out to the other related models. Mathematically, the gamma distribution is a two-parameter continuous distribution defined using the gamma function. The gamma sub family includes the exponential distribution, Erlang distribution and chi-squared distribution. These are distributions that are gamma distributions with certain restrictions on the one or both of the gamma parameters. Other distributions are obtained by raising a distribution to a power. Others are obtained by mixing distributions.

Here’s a listing of the models. Click on the links to find out more about the distributions.

……Derived From ………………….Model
Gamma function
Gamma sub families
Independent sum of gamma
Exponentiation
Raising to a power Raising exponential to a positive power

Raising exponential to a power

Raising gamma to a power

Raising Pareto to a power

Burr sub families
Mixture
Others

The above table categorizes the distributions according to how they are mathematically derived. For example, the gamma distribution is derived from the gamma function. The Pareto distribution is mathematically an exponential-gamma mixture. The Burr distribution is a transformed Pareto distribution, i.e. obtained by raising a Pareto distribution to a positive power. Even though these distributions can be defined simply by giving the PDF and CDF, knowing how their mathematical origins informs us of the specific mathematical properties of the distributions. Organizing according to the mathematical origin gives us a concise summary of the models.

\text{ }

\text{ }

Further Comments on the Table

From a mathematical standpoint, the gamma distribution is defined using the gamma function.

    \displaystyle \Gamma(\alpha)=\int_0^\infty t^{\alpha-1} \ e^{-t} \ dt

In this above integral, the argument \alpha is a positive number. The expression t^{\alpha-1} \ e^{-t} in the integrand is always positive. The area in between the curve t^{\alpha-1} \ e^{-t} and the x-axis is \Gamma(\alpha). When this expression is normalized, i.e. divided by \Gamma(\alpha), it becomes a density function.

    \displaystyle f(t)=\frac{1}{\Gamma(\alpha)} \ t^{\alpha-1} \ e^{-t}

The above function f(t) is defined over all positive t. The integral of f(t) over all positive t is 1. Thus f(t) is a density function. It only has one parameter, the \alpha, which is the shape parameter. Adding the scale parameter \theta making it a two-parameter distribution. The result is called the gamma distribution. The following is the density function.

    \displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \biggl(\frac{1}{\theta}\biggr)^\alpha \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ x>0

Both parameters \alpha and \theta are positive real numbers. The first parameter \alpha is the shape parameter and \theta is the scale parameter.

As mentioned above, many of the distributions listed in the above table is related to the gamma distribution. Some of the distributions are sub families of gamma. For example, when \alpha are positive integers, the resulting distributions are called Erlang distribution (important in queuing theory). When \alpha=1, the results are the exponential distributions. When \alpha=\frac{k}{2} and \theta=2 where k is a positive integer, the results are the chi-squared distributions (the parameter k is referred to the degrees of freedom). The chi-squared distribution plays an important role in statistics.

Taking independent sum of n independent and identically distributed exponential random variables produces the Erlang distribution, a sub gamma family of distribution. Taking independent sum of n exponential random variables, with pairwise distinct means, produces the hypoexponential distributions. On the other hand, the mixture of n independent exponential random variables produces the hyperexponential distribution.

The Pareto distribution (Pareto Type II Lomax) is the mixture of exponential distributions with gamma mixing weights. Despite the connection with the gamma distribution, the Pareto distribution is a heavy tailed distribution. Thus the Pareto distribution is suitable for modeling extreme losses, e.g. in modeling rare but potentially catastrophic losses.

As mentioned earlier, raising a Pareto distribution to a positive power generates the Burr distribution. Restricting the parameters in a Burr distribution in a certain way will produces the paralogistic distribution. The table indicates the relationships in a concise way. For details, go into the blog posts to get more information.

Tail Weight

Another informative way to categorize the distributions listed in the table is through looking at the tail weight. At first glance, all the distributions may look similar. For example, the distributions in the table are right skewed distributions. Upon closer look, some of the distributions put more weights (probabilities) on the larger values. Hence some of the models are more suitable for models of phenomena with significantly higher probabilities of large or extreme values.

When a distribution significantly puts more probabilities on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). In general tail weight is a relative concept. For example, we say model A has a larger tail weight than model B (or model A has a heavier tail than model B). However, there are several ways to check for tail weight of a given distribution. Here are the four criteria.

Tail Weight Measure What to Look for
1 Existence of moments The existence of more positive moments indicates a lighter tailed distribution.
2 Hazard rate function An increasing hazard rate function indicates a lighter tailed distribution.
3 Mean excess loss function An increasing mean excess loss function indicates a heavier tailed distribution.
4 Speed of decay of survival function A survival function that decays rapidly to zero (as compared to another distribution) indicates a lighter tailed distribution.

Existence of moments
For a positive real number k, the moment E(X^k) is defined by the integral \int_0^\infty x^k \ f(x) \ dx where f(x) is the density function of the distribution in question. If the distribution puts significantly more probabilities in the larger values in the right tail, this integral may not exist (may not converge) for some k. Thus the existence of moments E(X^k) for all positive k is an indication that the distribution is a light tailed distribution.

In the above table, the only distributions for which all positive moments exist are gamma (including all gamma sub families such as exponential), Weibull, lognormal, hyperexponential, hypoexponential and beta. Such distributions are considered light tailed distributions.

The existence of positive moments exists only up to a certain value of a positive integer k is an indication that the distribution has a heavy right tail. All the other distributions in the table are considered heavy tailed distribution as compared to gamma, Weibull and lognormal. Consider a Pareto distribution with shape parameter \alpha and scale parameter \theta. Note that the existence of the Pareto higher moments E(X^k) is capped by the shape parameter \alpha. If the Pareto distribution is to model a random loss, and if the mean is infinite (when \alpha=1), the risk is uninsurable! On the other hand, when \alpha \le 2, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

Hazard rate function
The hazard rate function h(x) of a random variable X is defined as the ratio of the density function and the survival function.

    \displaystyle h(x)=\frac{f(x)}{S(x)}

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged x will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for x units of time.

Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. On the other hand, a distribution with an increasing hazard rate function has a light tailed distribution. If the hazard rate function is decreasing (over time if the random variable is a time variable), then the population die off at a decreasing rate, hence a heavier tail for the distribution in question.

The Pareto distribution is a heavy tailed distribution since the hazard rate is h(x)=\alpha/x (Pareto Type I) and h(x)=\alpha/(x+\theta) (Pareto Type II Lomax). Both hazard rates are decreasing function.

The Weibull distribution is a flexible model in that when its shape parameter is 0<\tau<1, the Weibull hazard rate is decreasing and when \tau>1, the hazard rate is increasing. When \tau=1, Weibull is the exponential distribution, which has a constant hazard rate.

The point about decreasing hazard rate as an indication of a heavy tailed distribution has a connection with the fourth criterion. The idea is that a decreasing hazard rate means that the survival function decays to zero slowly. This point is due to the fact that the hazard rate function generates the survival function through the following.

    \displaystyle S(x)=e^{\displaystyle -\int_0^x h(t) \ dt}

Thus if the hazard rate function is decreasing in x, then the survival function will decay more slowly to zero. To see this, let H(x)=\int_0^x h(t) \ dt, which is called the cumulative hazard rate function. As indicated above, S(x)=e^{-H(x)}. If h(x) is decreasing in x, H(x) has a lower rate of increase and consequently S(x)=e^{-H(x)} has a slower rate of decrease to zero.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The mean excess loss function
The mean excess loss is the conditional expectation e_X(d)=E(X-d \lvert X>d). If the random variable X represents insurance losses, mean excess loss is the expected loss in excess of a threshold conditional on the event that the threshold has been exceeded. Suppose that the threshold d is an ordinary deductible that is part of an insurance coverage. Then e_X(d) is the expected payment made by the insurer in the event that the loss exceeds the deductible.

Whenever e_X(d) is an increasing function of the deductible d, the loss X is a heavy tailed distribution. If the mean excess loss function is a decreasing function of d, then the loss X is a lighter tailed distribution.

The Pareto distribution can also be classified as a heavy tailed distribution based on an increasing mean excess loss function. For a Pareto distribution (Type I) with shape parameter \alpha and scale parameter \theta, the mean excess loss is e(X)=d/(\alpha-1), which is increasing. The mean excess loss for Pareto Type II Lomax is e(X)=(d+\theta)/(\alpha-1), which is also decreasing. They are both increasing functions of the deductible d! This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If the underlying distribution for a random loss is Pareto, it is a catastrophic risk situation.

In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Speed of decay of the survival function to zero
The survival function S(x)=P(X>x) captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed. This point is touched on when discussing hazard rate function.

The following is a comparison of a Pareto Type II survival function and an exponential survival function. The Pareto survival function has parameters (\alpha=2 and \theta=2). The two survival functions are set to have the same 75th percentile, which is x=2. The following table is a comparison of the two survival functions.

    \displaystyle \begin{array}{llllllll} \text{ } &x &\text{ } & \text{Pareto } S_X(x) & \text{ } & \text{Exponential } S_Y(x) & \text{ } & \displaystyle \frac{S_X(x)}{S_Y(x)} \\  \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\  \text{ } &2 &\text{ } & 0.25 & \text{ } & 0.25 & \text{ } & 1  \\    \text{ } &10 &\text{ } & 0.027777778 & \text{ } & 0.000976563 & \text{ } & 28  \\  \text{ } &20 &\text{ } & 0.008264463 & \text{ } & 9.54 \times 10^{-7} & \text{ } & 8666  \\   \text{ } &30 &\text{ } & 0.00390625 & \text{ } & 9.31 \times 10^{-10} & \text{ } & 4194304  \\  \text{ } &40 &\text{ } & 0.002267574 & \text{ } & 9.09 \times 10^{-13} & \text{ } & 2.49 \times 10^{9}  \\  \text{ } &60 &\text{ } & 0.001040583 & \text{ } & 8.67 \times 10^{-19} & \text{ } & 1.20 \times 10^{15}  \\  \text{ } &80 &\text{ } & 0.000594884 & \text{ } & 8.27 \times 10^{-25} & \text{ } & 7.19 \times 10^{20}  \\  \text{ } &100 &\text{ } & 0.000384468 & \text{ } & 7.89 \times 10^{-31} & \text{ } & 4.87 \times 10^{26}  \\  \text{ } &120 &\text{ } & 0.000268745 & \text{ } & 7.52 \times 10^{-37} & \text{ } & 3.57 \times 10^{32}  \\  \text{ } &140 &\text{ } & 0.000198373 & \text{ } & 7.17 \times 10^{-43} & \text{ } & 2.76 \times 10^{38}  \\  \text{ } &160 &\text{ } & 0.000152416 & \text{ } & 6.84 \times 10^{-49} & \text{ } & 2.23 \times 10^{44}  \\  \text{ } &180 &\text{ } & 0.000120758 & \text{ } & 6.53 \times 10^{-55} & \text{ } & 1.85 \times 10^{50}  \\  \text{ } & \text{ } \\    \end{array}

Note that at the large values, the Pareto right tails retain much more probabilities. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. Using an exponential distribution to model a Pareto random phenomenon would be a severe modeling error even though the exponential distribution may be a good model for describing the loss up to the 75th percentile (in the above comparison). It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

    \displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\theta^\alpha}{(x+\theta)^\alpha}}{e^{-\lambda x}}=\frac{\theta^\alpha e^{\lambda x}}{(x+\theta)^\alpha} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in x. This ratio goes to infinity as x \rightarrow \infty.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator.

It is important to examine the tail behavior of a distribution when considering it as a candidate for a model. The four criteria discussed here provide a crucial way to classify parametric models according to the tail weight.

severity models
math

Daniel Ma
mathematics

\copyright 2017 – Dan Ma

Chi-squared test

The chi-squared test is a useful and versatile test. There are several interpretations of the chi-squared test, which are discussed in three previous posts. The different uses of the same test can be confusing to the students. This post attempts to connect the ideas in the three previous posts and to supplement the previous discussions.

The chi-squared test is based on the chi-squared statistic, which is a measure of magnitude of the difference between the observed counts and the expected counts in an experimental design that involves one or more categorical variables. The null hypothesis is the assumption that the observed counts and expected counts are the same. A large value of the chi-squared statistic gives evidence for the rejection of the null hypothesis.

The chi-squared test is also simple to use. The chi-squared statistic has an approximate chi-squared distribution, which makes it easy to evaluate the sample data. The chi-squared test is included in various software packages. For applications with a small number of categories, the calculation can even be done with a hand-held calculator.

_______________________________________________________________________________________________

The Goodness-of-Fit Test and the Test of Homogeneity

The three interpretations of the chi-squared test have been discussed in these posts: goodness-of-fit test, test of homogeneity and test of independence.

The three different uses of the test as discussed in the three previous posts can be kept straight by having a firm understanding of the underlying experimental design.

For the goodness-of-fit test, there is only one population involved. The experiment is to measure one categorical variable on one population. Thus only one sample is used in applying the chi-squared test. The one-sample data would produce the observed counts for the categorical variable in questions. Let’s say the variable has k cells. Then there would be k observed counts. The expected counts for the k cells would come from an hypothesized distribution of the categorical variable. The chi-squared statistic is then the sum of k squared differences of the observed and expected counted (normalized by dividing the expected counts). Essentially the hypothesized distribution is the null hypothesis. More specifically, the null hypothesis would be the statement that the cell probabilities are derived from the hypothesized distribution.

As a quick example, we may want to answer the question whether a given die is a fair die. We then observe n rolls of the die and classify the rolls into 6 cells (the value of 1 to 6). The null hypothesis is that the values of the die follow a uniform distribution. Another way to state the hypothesis is that each cell probability is 1/6. Another example is the testing of the hypothesis of whether the claim frequency of a group of insured drivers follows a Poisson distribution. The cell probabilities are then calculated based on the assumption of a Poisson distribution. In short, the goodness-of-fit test is to test whether the observed counts for one categorical variable come from (or fit) a hypothesized distribution. See Example 1 and Example 2 in the post on goodness-of-fit test.

In the test of homogeneity, the focus is to compare two or more populations (or two or more subpopulations of a population) on the same categorical variable, i.e. whether the categorical variable in question follow the same distribution across the different populations. For example, do two different groups of insured drivers exhibit the same claim frequency rates? For example, do adults with different educational attainment levels have the same proportions of current smokers/former smokers/never smokers? For example, are political affiliations similar across racial/ethnic groups? In this test, the goal is to determine whether cells in the categorical variable have the same proportions across the populations, hence the name of test of homogeneity. In the experiment, researchers would sample each population (or group) separately on the categorical variable in questions. Thus there will be multiple samples (one for each group) and the samples are independent.

In the test of homogeneity, the calculation of the chi-squared statistic would involve adding up the squared differences of the observed counts and expected counts for the multiple samples. For illustration, see Example 1 and Example 2 in the post on test of homogeneity.

_______________________________________________________________________________________________

Test of Independence

The test of independence can be confused with the test of homogeneity. It is possible that the objectives for both tests are similar. For example, a test of hypothesis might seek to determine whether the proportions of smoking statuses (current smoker, former smoker and never smoker) are the same across the groups with different education levels. This sounds like a test of homogeneity since it seeks to determine whether the distribution of smoking status is the same across the different groups (levels of educational attainment). However, a test of independence can also have this same objective.

The difference between the test of homogeneity and the test of independence is one of experimental design. In the test of homogeneity, the researchers sample each group (or population) separately. For example, they would sample individuals from groups with various levels of education separately and classify the individuals in each group by smoking status. The chi-squared test to use in this case is the test of homogeneity. In this experimental design, the experimenter might sample 1,000 individuals who are not high school graduate, 1,000 individuals who are high school graduates, 1,000 individuals who have some college and so on. Then the experimenter would compare the distribution of smoking status across the different samples.

An experimenter using a test of independence might try to answer the same question but is proceeding in a different way. The experimenter would sample the individuals from a given population and observe two categorical variables (e.g. level of education and smoking status) for the same individual.

Then the researchers would classify each individual into a cell in a two-way table. See Table 3b in the previous post on test of independence. The values of the level of education go across the column in the table (the column variable). The values of the smoking status go down the rows (the row variable). Each individual in the sample would belong to one cell in the table according to the values of the row and column variables. The two-way table is to help determine whether the row variable and the column variable are associated in the given population. In other words, the experimenter is interested in finding out whether one variable explains the other (or one variable affects the other).

For the sake of ease in the discussion, let’s say the column variable (level of education) is the explanatory variable. The experimenter would then be interested in whether the conditional distribution of the row variable (smoking status) would be similar or different across the columns. If the conclusion is similar, it means that the column variable does not affect the row variable (or the two variables are not associated). This would also mean that the distribution of smoking status are the same across the different levels of education (a conclusion of homogeneity).

If the conclusion is that the conditional distribution of the row variable (smoking status) would be different across the columns, then the column variable does affect the row variable (or the two variables are associated). This would also mean that the distribution of smoking status are different across the different levels of education (a conclusion of non-homogeneity).

The test of independence and the test of homogeneity are based on two different experimental designs. Hence their implementations of the chi-squared statistic are different. However, each design can be structured to answer similar questions.

_______________________________________________________________________________________________
\copyright 2017 – Dan Ma

The Chi-Squared Distribution, Part 3c

This post is part of a series of posts on chi-squared distribution. Three of the posts, this one and the previous two posts, deal with inference involving categorical variables. This post discusses the chi-squared test of independence. The previous two posts are on chi-squared goodness of fit test (part 3a) and on chi-squared test of homogeneity (part 3b).

The first post in the series is an introduction on chi-squared distribution. The second post is on several inference procedures that are based on chi-squared distribution that deal with quantitative measurements.

The 3-part discussion in part 3a, part 3b and part 3c are three different interpretations of the chi-squared test. Refer to the other two posts for the other two interpretations. We also make remarks below on the three chi-squared tests.

_______________________________________________________________________________________________

Two-Way Tables

In certain analysis of count data of categorical variables, we are interested in whether two categorical variables are associated with one another (or are related to one another). In such analysis, it is useful to represent the count data in a two-way table or contingency table. The following gives two examples using survival data in the ocean liner Titanic.

    Table 1 – Survival status of the passengers in the Titanic by gender group
    \displaystyle \begin{array} {ccccccccc} \text{Survival} & \text{ } & \text{Women}  & \text{ } & \text{Children}  & \text{ } & \text{Men} & \text{ } & \text{ } \\  \text{Status} & \text{ } & \text{ }  & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Yes} & \text{ } & 304  & \text{ } & 56 & \text{ } & 130  & \text{ } & \text{ }      \\ \text{No} & \text{ } & 112  & \text{ } & 56 & \text{ } & 638  & \text{ } & \text{ }       \end{array}

    Table 2 – Survival status of the passengers in the Titanic by passenger class
    \displaystyle \begin{array} {ccccccccc} \text{Survival} & \text{ } & \text{First}  & \text{ } & \text{Second}  & \text{ } & \text{Third} & \text{ } & \text{ } \\  \text{Status} & \text{ } & \text{Class}  & \text{ } & \text{Class}  & \text{ } & \text{Class} & \text{ } & \text{ }    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Yes} & \text{ } & 200  & \text{ } & 117 & \text{ } & 172  & \text{ } & \text{ }      \\ \text{No} & \text{ } & 119  & \text{ } & 152 & \text{ } & 527  & \text{ } & \text{ }       \end{array}

Table 1 shows the count data for the survival status (survived or not survived) and gender of the passengers in the one and only voyage of the passenger liner Titanic. Table 2 shows the survival status and the passenger class of the passengers of Titanic. Both tables are contingency tables or two-way since each table relates two categorical variables – survival status and gender in Table 1 and survival status and passenger class in Table 2. Each table summarizes the categorical data by counting the number of observations that fall into each group for the two variables. For example, Table 1 shows that there were 304 women passengers who survived. In both tables, the survival status is the row variable. The column variable is gender (Table 1) or passenger class (Table 2).

It is clear from both tables that most of the deaths were either men or third class passengers. This observation is not surprising because of the mentality of “Women and Children First” and the fact that first class passengers were better treated than the other classes. Thus we can say that there is an association between gender and survival and there is an association between passenger class and survival in the sinking of Titanic. More specifically, the survival rates for women and children were much higher than for men and the survival rates for first class passenger was much higher than for the other two classes.

When a study measures two categorical variables on each individual in a random sample, the results can be summarized in a two-way table, which can then be used for studying the relationship between the two variables. As a first step, joint distribution, marginal distributions and conditional distributions are analyzed. Table 1 is analyzed in here. Table 2 is analyzed here. Though the Titanic survival data show a clear association between survival and gender (and passenger class), the discussion of the Titanic survival data in these two previous posts is still very useful. These two posts demonstrate how to analyze the relationship between two categorical variables by looking at the marginal distributions and conditional distributions.

This post goes one step further by analyzing the relationship in a two-way table using the chi-squared test. The method discussed here is called the chi-squared test of independence, in other words, the test determines whether there is a relationship between the two categorical variables displayed in a two-way table.

_______________________________________________________________________________________________

Test of Independence

We demonstrate the test of independence by working through the following example (Example 1). When describing how the method works in general, the two-way table has r rows and c columns (not including the total row and the total column).

Example 1
The following table shows the smoking status and the level of education of residents (aged 25 or over) of a medium size city on the East coast of the United States based on a survey conducted on a random sample of 1,078 adults aged 25 or older.

    Table 3 – Smoking Status and Level of Education
    \displaystyle \begin{array} {lllllllllll} \text{Smoking} & \text{ } & \text{Did Not}  & \text{ } & \text{High}  & \text{ } & \text{Some} & \text{ } & \text{Bachelor's} \\  \text{Status} & \text{ } & \text{Finish}  & \text{ } & \text{School}  & \text{ } & \text{College or} & \text{ } & \text{Degree}   \\ \text{ } & \text{ } & \text{High}  & \text{ } & \text{Graduate} & \text{ } & \text{Associate}  & \text{ } & \text{or}  \\ \text{ } & \text{ } & \text{School}  & \text{ } & \text{No College} & \text{ } & \text{Degree}  & \text{ } & \text{Higher}      \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Current} & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{Smoker} & \text{ } & 177  & \text{ } & 141 & \text{ } & 48  & \text{ } & 35    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Former} & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{Smoker} & \text{ } & 89  & \text{ } & 70 & \text{ } & 26  & \text{ } & 36    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Never} & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{Smoker} & \text{ } & 210  & \text{ } & 146 & \text{ } & 47  & \text{ } & 53         \end{array}

The researcher is interested in finding out whether smoking status is associated with the level of education among the adults in this city. Do the data in Table 3 provide sufficient evidence to indicate that smoking status is affected by the level of education among the adults in this city?

The two categorical variables in this example are smoking status (current, former and never smoker) and level of education (the 4 categories listed in the columns in Table 3). The researcher views level of education as an explanatory variable and smoking status as the response variable. Table 3 has 3 rows and 4 columns. Thus there are 12 cells in the table. It is helpful to obtain the total for each row and the total for each column.

    Table 3a – Smoking Status and Level of Education
    \displaystyle \begin{array} {lllllllllll} \text{Smoking} & \text{ } & \text{Did Not}  & \text{ } & \text{High}  & \text{ } & \text{Some} & \text{ } & \text{Bachelor's} & \text{ } & \text{Total}\\  \text{Status} & \text{ } & \text{Finish}  & \text{ } & \text{School}  & \text{ } & \text{College or} & \text{ } & \text{Degree}   \\ \text{ } & \text{ } & \text{High}  & \text{ } & \text{Graduate} & \text{ } & \text{Associate}  & \text{ } & \text{or}  \\ \text{ } & \text{ } & \text{School}  & \text{ } & \text{No College} & \text{ } & \text{Degree}  & \text{ } & \text{Higher}    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Current} & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{Smoker} & \text{ } & 177  & \text{ } & 141 & \text{ } & 48  & \text{ } & 35 & \text{ } & 401      \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Former} & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{Smoker} & \text{ } & 89  & \text{ } & 70 & \text{ } & 26  & \text{ } & 36 & \text{ } & 221      \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Never} & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{Smoker} & \text{ } & 210  & \text{ } & 146 & \text{ } & 47  & \text{ } & 53 & \text{ } & 456    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{Total} & \text{ } & 476  & \text{ } & 357 & \text{ } & 121  & \text{ } & 124 & \text{ } & 1078         \end{array}

The null hypothesis H_0 in a two-way table is the statement that there is no association between the row variable and the column variable, i.e., the row variable and the column variable are independent. The alternative hypothesis H_1 states that there is an association between the two variables. For Table 3, the null hypothesis is that there is no association between the smoking status and the level of education among the adults in this city.

In Table 3 or 3a, each column is a distribution of smoking status (for each level of education). Another way to state the null hypothesis is that the distributions of the smoking status are the same across the four levels of educations. The alternative hypothesis is that the distributions are not the same.

Our goal is to use the chi-squared statistic to evaluate the data in the two-way table. The chi-squared statistic is based the squared difference between the observed counts in Table 3a and the expected counts derived from assuming the null hypothesis. The following shows how to calculate the expected count for each cell assuming the null hypothesis.

    \displaystyle \text{Expected Cell Count}=\frac{\text{Row Total } \times \text{ Column Total}}{n}

The n in the denominator is the total number of observations in the two-way table. For Table 3a, n= 1078. For the cell of “Current Smoker” and “Did not Finish High School”, the expected count would be 476 x 401 / 1078 = 177.06. The other expected counts are calculated accordingly and are shown in Table 3b with the expected counts in parentheses.

    Table 3b – Smoking Status and Level of Education
    \displaystyle \begin{array} {lllllllllll} \text{Smoking} & \text{ } & \text{Did Not}  & \text{ } & \text{High}  & \text{ } & \text{Some} & \text{ } & \text{Bachelor's} & \text{ } & \text{Total}\\  \text{Status} & \text{ } & \text{Finish}  & \text{ } & \text{School}  & \text{ } & \text{College or} & \text{ } & \text{Degree}   \\ \text{ } & \text{ } & \text{High}  & \text{ } & \text{Graduate} & \text{ } & \text{Associate}  & \text{ } & \text{or}  \\ \text{ } & \text{ } & \text{School}  & \text{ } & \text{No College} & \text{ } & \text{Degree}  & \text{ } & \text{Higher}    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Current} & \text{ } & 177  & \text{ } & 141 & \text{ } & 48  & \text{ } & 35 & \text{ } & 401  \\ \text{Smoker} & \text{ } & (177.06)  & \text{ } & (132.80) & \text{ } & (45.01)  & \text{ } & (46.13) & \text{ } & (401)      \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Former} & \text{ } & 89  & \text{ } & 70 & \text{ } & 26  & \text{ } & 36 & \text{ } & 221  \\ \text{Smoker} & \text{ } & (97.58)  & \text{ } & (73.19) & \text{ } & (24.81)  & \text{ } & (25.42) & \text{ } & (221)      \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Never} & \text{ } & 210  & \text{ } & 146 & \text{ } & 47  & \text{ } & 53 & \text{ } & 456  \\ \text{Smoker} & \text{ } & (201.35)  & \text{ } & (151.01) & \text{ } & (51.18)  & \text{ } & (52.45) & \text{ } & (456)    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }  \\ \text{Total} & \text{ } & 476  & \text{ } & 357 & \text{ } & 121  & \text{ } & 124 & \text{ } & 1018  \\ \text{ } & \text{ } & (476)  & \text{ } & (357) & \text{ } & (121)  & \text{ } & (124) & \text{ } & (1078)       \end{array}

How do we know if the formula for the expected cell count is correct? Look at the right margin of Table 3a or 3b (Total column). The counts 401, 221 and 456 as percentages of the total 1078 are 37.20%, 20.50% and 42.30%. If the null hypothesis that there is no relation between level of education and smoking status is true, we would expect these overall percentages to apply to each level of education. For example, there are 476 adults in the sample who did not complete high school. We would expect 37.20% of them to be current smokers, 20.50% of them to be former smokers and 42.30% of them to be never smokers if indeed smoker status is not affected by level of education. In particular, 37.20% x 476 = 177.072 (the same as 177.06 ignoring the rounding difference). Note that 37.20% is the fraction 401/1078. As a result, 37.20% x 476 is identical to 476 x 401 / 1078. This confirms the formula stated above.

We now compute the chi-squared statistic. From a two-way table perspective, the chi-squared statistic is a measure of how much the observed cell counts deviate from the expected cell counts. The following formula makes this idea more explicit.

    \displaystyle \chi^2=\sum \frac{(\text{Observed Count}-\text{Expected Count})^2}{\text{Expected Count}}

The sum in the formula is over all r \times c cells in the 2-way table where r is the number of rows and c is the number of columns. Note that the calculation of the chi-squared statistic is based on the expected counts discussed above. The expected counts are based on the assumption of the null hypothesis. Thus the chi-squared statistic is based on assuming the null hypothesis.

When the observed counts and the expected counts are very different, the value of the chi-squared statistic will be large. Thus large values of the chi-squared statistic provide evidence against the null hypothesis. In order to evaluate the observed data as captured in the chi-squared statistic, we need to have information about the sampling distribution of the chi-squared statistic as defined here.

If the null hypothesis H_0 is true, the chi-squared statistic defined above has an approximate chi-squared distribution with (r-1)(c-1) degrees of freedom. Recall that r is the number of rows and c is the number of columns in the two-way table (not counting the total row and total column).

Thus the null hypothesis H_0 is rejected if the value of the chi-squared statistic exceeds the critical value, which is the upper tail in the chi-squared distribution (with the appropriate df) of area \alpha (i.e. level of significance). The p-value approach can also be used. The p-value is the probability that the chi-squared random variable under the assumption of H_0 is more extreme than the observed chi-squared statistic.

The calculation of the chi-squared statistic for Table 3b is best done in software. Performing the calculation in Excel gives \chi^2= 9.62834 with df = (3-1) x (4-1) = 6. At level of significance \alpha= 0.05, the critical value is 12.59. Thus the chi-squared statistic is not large enough to reject the null hypothesis. So the sample results do not provide enough evidence to conclude that smoking status is affected by level of education.

The p-value is 0.1412. Since this is a large p-value, we have the same conclusion that there is not sufficient evidence to reject the null hypothesis.

Both the critical value and the p-value are evaluated using the following functions in Excel.

    Critical Value = CHISQ.INV.RT(level of significance, df)
    p-value = 1 – CHISQ.DIST(test statistic, df, TRUE)

_______________________________________________________________________________________________

Another Example

Example 2
A researcher wanted to determine whether race/ethnicity is associated with political affiliations among the residents in a medium size city in the Eastern United States. The following table represents the ethnicity and the political affiliations from a random sample of adults in this city.

    Table 4 – Race/Ethnicity and Political Affiliations
    \displaystyle \begin{array} {ccccccccc} \text{Political} & \text{ } & \text{White}  & \text{ } & \text{Black}  & \text{ } & \text{Hispanic} & \text{ } & \text{Asian} \\  \text{Affiliation} & \text{ } & \text{ }  & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Independent} & \text{ } & 142  & \text{ } & 39 & \text{ } & 36  & \text{ } & 120      \\ \text{Democratic} & \text{ } & 130  & \text{ } & 62 & \text{ } & 46  & \text{ } & 116    \\ \text{Republican} & \text{ } & 165  & \text{ } & 38 & \text{ } & 29  & \text{ } & 95           \end{array}

Use the chi-squared test as described above to test whether political affiliation is affected by race and ethnicity.

The following table show the calculation of the expected counts (in parentheses) and the total counts.

    Table 4a – Race/Ethnicity and Political Affiliations
    \displaystyle \begin{array} {ccccccccccc} \text{Political} & \text{ } & \text{White}  & \text{ } & \text{Black}  & \text{ } & \text{Hispanic} & \text{ } & \text{Asian} & \text{ } & \text{Total}\\  \text{Affiliation} & \text{ } & \text{ }  & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Independent} & \text{ } & 142  & \text{ } & 39 & \text{ } & 36  & \text{ } & 120 & \text{ } & 337  \\ \text{ } & \text{ } & (145.67)  & \text{ } & (46.01) & \text{ } & (36.75)  & \text{ } & (109.57) & \text{ } & (337)    \\ \text{Democratic} & \text{ } & 130  & \text{ } & 62 & \text{ } & 46  & \text{ } & 116 & \text{ } & 354  \\ \text{ } & \text{ } & (151.96)  & \text{ } & (48.34) & \text{ } & (38.60)  & \text{ } & (115.10) & \text{ } & (354)    \\ \text{Republican} & \text{ } & 165  & \text{ } & 38 & \text{ } & 29  & \text{ } & 95 & \text{ } & 327  \\ \text{ } & \text{ } & (140.37)  & \text{ } & (44.65) & \text{ } & (35.66)  & \text{ } & (106.32) & \text{ } & (327)    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ \text{Total} & \text{ } & 437  & \text{ } & 139 & \text{ } & 111  & \text{ } & 331 & \text{ } & 1018  \\ \text{ } & \text{ } & (437)  & \text{ } & (139) & \text{ } & (111)  & \text{ } & (331) & \text{ } & (1018)       \end{array}

The value of the chi-squared statistic, computed in Excel, is 18.35, with df = (3-1) x (4-1) = 6. The critical value at level of significance 0.01 is 16.81. Thus we reject the null hypothesis that there is no relation between race/ethnicity and political party affiliation. The two-way table provides evidence that political affiliation is affected by race/ethnicity.

The p-value is 0.0054. This is the probability of obtaining a calculated value of the chi-squared statistic that is 18.35 or greater (assuming the null hypothesis). Since this probability is so small, it is unlikely that the large chi-squared value of 18.35 occurred by chance alone.

One interesting point that should be made is that the chi-squared test of independence does not provide insight into the nature of the association between the row variable and the column variable. To help clarify the association, it will be helpful to conduct analysis using marginal distributions and conditional distributions (as discussed here and here for the Titanic survival data).

_______________________________________________________________________________________________

Remarks

Some students may confuse the test of independence discussed here with the chi-squared test of homogeneity discussed in this previous post. Both tests can be used to test whether the distributions of the row variable are the same across the columns.

Bear in mind that the test of independence as discussed here is a way test whether two categorical variables (one on the rows and one on the columns) are associated with one another in a population. We discuss two examples here – level of education and smoking status in Example 1 and race/ethnicity and political affiliation in Example 2. In both cases, we want to see if one of the variable is affected by the other variable.

The test of homogeneity is a way to test whether two or more subgroups in a population follow the same distribution of a categorical variable. For example, do adults with different educational attainment levels have the same proportions of current smokers/former smokers/never smokers? For example, do adults in different racial groups have different proportions of independents, Democrats and Republicans?

Now the examples cited for the test of homogeneity seem to be the same examples we work for the test of independence. However, the two tests are indeed different. The difference is subtle. The difference is basically in the way the study is designed.

For the test of independence to be used, the observational units are collected at random from a population and two categorical variables are observed for each unit. Hence the results will be summarized in a two-way table. For the test of homogeneity, the data are collected by random sampling from each subgroup separately. If Example 2 is to use a test of homogeneity, the study would have to sample each racial group separately (say 1000 white, 1000 blacks and so on). Then we compare the proportions of party affiliations across racial group. For Example 2 to work as a test of independence as discussed here, the study would have to observe a random sample of adults and observe the race/ethnicity and party affiliation of each unit.

Another chi-squared test is called the goodness-of-fit test, discussed here. This test is a way of testing whether a set of observed categorical dataset come from a hypothesized distribution (e.g. Poison distribution).

All three tests use the same chi-squared statistic, but they are not the same test.

_______________________________________________________________________________________________

Reference

  1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
  2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
\copyright \ 2017 - \text{Dan Ma}

The Chi-Squared Distribution, Part 3b

This post is a continuation of the previous post (Part 3a) on chi-squared test and is also part of a series of posts on chi-squared distribution. The first post (Part 1) is an introduction on the chi-squared distribution. The second post (Part 2) is on the chi-squared distribution as mathematical tools for inference involving quantitative variables. Part 3, which focuses on inference on categorical variables using the Pearson’s chi-squared statistic, is broken up in three posts. Part 3a is an introduction on the chi-squared test statistic and explains how to perform the chi-squared goodness-of-fit test. Part 3b (this post) focuses on using the chi-squared statistic to compare several populations (test of homogeneity). Part 3c (the next post) focuses on the test of independence.

_______________________________________________________________________________________________

Comparing Two Distributions

The interpretation of the chi-squared test discussed in this post is called chi-squared test of homogeneity. In this post, we show that the chi-squared statistic is employed to test whether the cell probabilities for certain categories are identical across several populations. We start by examining the two-population case, which will be fairly easily extended to the case of more than two populations.

Suppose that a multinomial experiment can result in k distinct outcomes. Suppose that the experiment is performed two times with the two samples drawn from two different populations. Let p_{1,j} be the probability that the outcome in the first experiment falls into the jth category (or cell j) and let p_{2,j} be the probability that the outcome in the second experiment falls into the jth category (or cell j) where j=1,2,\cdots,k. Furthermore, suppose that there are n_1 and n_2 independent multinomial trials in the first experiment and the second experiment, respectively.

We are interested in the random variables Y_{1,1}, Y_{1,2}, \cdots, Y_{1,k} and the random variables Y_{2,1}, Y_{2,2}, \cdots, Y_{2,k} where Y_{1,j} is the number of trials in the first experiment whose outcomes fall into cell j and Y_{2,j} is the number of trials in the second experiment whose outcomes fall into cell j. Then the sampling distribution of each of the following

    \displaystyle \sum \limits_{j=1}^k \frac{(Y_{1,j}-n_1 \ p_{1,j})^2}{n_1 \ p_{1,j}}

    \displaystyle \sum \limits_{j=1}^k \frac{(Y_{2,j}-n_2 \ p_{2,j})^2}{n_2 \ p_{2,j}}

has an approximate chi-squared distribution with k-1 degrees of freedom (discussed here). Because the two experiments are independent, the following sum

    \displaystyle \sum \limits_{j=1}^k \frac{(Y_{1,j}-n_1 \ p_{1,j})^2}{n_1 \ p_{1,j}}+\sum \limits_{j=1}^k \frac{(Y_{2,j}-n_2 \ p_{2,j})^2}{n_2 \ p_{2,j}}

has an approximate chi-squared distribution with k-1+k-1=2k-2 degrees of freedom. When p_{1,j} and p_{2,j}, j=1,2,\cdots,k, are unknown, we wish to test the following hypothesis.

    H_0: p_{1,1}=p_{2,1}, p_{1,2}=p_{2,2}, \cdots, p_{1,j}=p_{2,j}, \cdots, p_{1,k}=p_{2,k}

In other words, we wish to test the hypothesis that the cell probabilities associated with the two independent experiments are equal. Since the cell probabilities are generally unknown, we can use sample data to estimate p_{1,j} and p_{2,j}. How do we do that? If the null hypothesis H_0 is true, then the two independent experiments can be viewed as one combined experiment. Then the the following ratio

    \displaystyle \hat{p}_j=\frac{Y_{1,j}+Y_{2,j}}{n_1+n_2}

is the sample frequency of the event corresponding to cell j, j=1,2,\cdots,k. Furthermore, we only have to estimate p_{1,j} and p_{2,j} using \hat{p}_j for j=1,2,\cdots,k-1 since the estimator of p_{1,k} and p_{2,k} is 1-\hat{p}_1-\hat{p}_2-\cdots-\hat{p}_{k-1}. With all these in mind, the following is the test statistic we will need.

    \displaystyle \sum \limits_{j=1}^k \frac{(Y_{1,j}-n_1 \ \hat{p}_j)^2}{n_1 \ \hat{p}_j}+\sum \limits_{j=1}^k \frac{(Y_{2,j}-n_2 \ \hat{p}_j)^2}{n_2 \ \hat{p}_j}

Since k-1 parameters are estimated, the degrees of freedom of this test statistic is obtained by subtracting k-1 from 2k-2. Thus the degrees of freedom is 2k-2-(k-1)=k-1. We test the null hypothesis H_0 against all alternatives using the upper tailed chi-squared test. We use two examples to demonstrate how this procedure is done.

Example 1
A Million Random Digits with 100,000 Normal Deviates is a book with random numbers published by the RAND Corporation in 1955. It contains 1,000,000 random digits and was an important work in statistics and was used extensively in random number generation in the 20th century. A typical way to pick random numbers from the book is to randomly select a page and then randomly select a point on that page (row and column). Then read off the random digits from that point (going down and then continue with the next columns) until obtaining the desired number of digits. We selected 1,000 random digits in this manner from the book and compare them with the random digits generated in Excel using the Rand() function. The following table shows the frequency distributions of the digits from the two sources. In the following table, MRD = Million Random Digits. Test whether the distributions of digits are the same between MRD and Excel.

    \displaystyle \begin{array} {rrrrr} \text{Digit} & \text{ } & \text{Frequency (MRD)}  & \text{ } & \text{Frequency (Excel)}    \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ }  \\ 0 & \text{ } & 90  & \text{ } & 93  \\ 1 & \text{ } & 112  & \text{ } & 96  \\ 2 & \text{ } & 104  & \text{ } & 105  \\ 3 & \text{ } & 102  & \text{ } & 95  \\ 4 & \text{ } & 84  & \text{ } & 112  \\ 5 & \text{ } & 110  & \text{ } & 103  \\ 6 & \text{ } & 106  & \text{ } & 98  \\ 7 & \text{ } & 101  & \text{ } & 114  \\ 8 & \text{ } & 101  & \text{ } & 106  \\ 9 & \text{ } & 90  & \text{ } & 78  \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ }  \\ \text{Total} & \text{ } & 1000 & \text{ } & 1000     \end{array}

The frequencies of the digits are for the most part similar between MRD and Excel except for digits 1 and 4. The null hypothesis H_0 is that the frequencies or probabilities for the digits are the same between the two populations. The following is a precise statement of the null hypothesis.

    H_0: p_{1,j}=p_{2,j}

where j=0,1,\cdots,9 and p_{1,j} is the probability that a random MRD digit is j and p_{2,j} is the probability that a random digit from Excel is j. Under H_0, an estimate of p_{1,j}=p_{2,j} is the ratio \frac{Y_{1,j}+Y_{2,j}}{2000}. For digits 0 and 1, they are (90+93)/2000 = 0.0915, (112+96)/2000 = 0.104, respectively. The following two tables show the calculation for the chi-squared procedure.

    Chi-Squared Statistic (MRD)
    \displaystyle \begin{array} {ccccccccc} \text{Digit} & \text{ } & \text{Frequency}  & \text{ } & \text{Estimate}  & \text{ } & \text{Frequency} & \text{ } & \text{Chi-Squared} \\  \text{ } & \text{ } & \text{Observed}  & \text{ } & \text{Cell Probability}  & \text{ } & \text{Expected} & \text{ } & \text{ }    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ 0 & \text{ } & 90  & \text{ } & 0.0915 & \text{ } & 91.5  & \text{ } & \frac{(90-91.5)^2}{91.5}      \\ 1 & \text{ } & 112  & \text{ } & 0.1040 & \text{ } & 104  & \text{ } & \frac{(112-104)^2}{104}    \\ 2 & \text{ } & 104  & \text{ } & 0.1045 & \text{ } & 104.5  & \text{ } & \frac{(104-104.5)^2}{104.5}      \\ 3 & \text{ } & 102  & \text{ } & 0.0985 & \text{ } & 98.5  & \text{ } & \frac{(102-98.5)^2}{98.5}      \\ 4 & \text{ } & 84  & \text{ } & 0.0980 & \text{ } & 98  & \text{ } & \frac{(84-98)^2}{98}    \\ 5 & \text{ } & 110  & \text{ } & 0.1065 & \text{ } & 106.5  & \text{ } & \frac{(110-106.5)^2}{106.5}      \\ 6 & \text{ } & 106  & \text{ } & 0.1020 & \text{ } & 102  & \text{ } & \frac{(106-102)^2}{102}      \\ 7 & \text{ } & 101  & \text{ } & 0.1075 & \text{ } & 107.5  & \text{ } & \frac{(101-107.5)^2}{107.5}      \\ 8 & \text{ } & 101  & \text{ } & 0.1035 & \text{ } & 103.5  & \text{ } & \frac{(101-103.5)^2}{103.5}      \\ 9 & \text{ } & 90  & \text{ } & 0.0840 & \text{ } & 84  & \text{ } & \frac{(90-84)^2}{84}  \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }      \\ \text{Total} & \text{ } & 1000 & \text{ } & 1.0 & \text{ } & 1000 & \text{ } & 3.920599983     \end{array}

    Chi-Squared Statistic (Excel)
    \displaystyle \begin{array} {ccccccccc} \text{Digit} & \text{ } & \text{Frequency}  & \text{ } & \text{Estimate}  & \text{ } & \text{Frequency} & \text{ } & \text{Chi-Squared} \\  \text{ } & \text{ } & \text{Observed}  & \text{ } & \text{Cell Probability}  & \text{ } & \text{Expected} & \text{ } & \text{ }    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ 0 & \text{ } & 93  & \text{ } & 0.0915 & \text{ } & 91.5  & \text{ } & \frac{(93-91.5)^2}{91.5}      \\ 1 & \text{ } & 96  & \text{ } & 0.1040 & \text{ } & 104  & \text{ } & \frac{(96-104)^2}{104}    \\ 2 & \text{ } & 105  & \text{ } & 0.1045 & \text{ } & 104.5  & \text{ } & \frac{(105-104.5)^2}{104.5}      \\ 3 & \text{ } & 95  & \text{ } & 0.0985 & \text{ } & 98.5  & \text{ } & \frac{(95-98.5)^2}{98.5}      \\ 4 & \text{ } & 112  & \text{ } & 0.0980 & \text{ } & 98  & \text{ } & \frac{(112-98)^2}{98}    \\ 5 & \text{ } & 103  & \text{ } & 0.1065 & \text{ } & 106.5  & \text{ } & \frac{(103-106.5)^2}{106.5}      \\ 6 & \text{ } & 98  & \text{ } & 0.1020 & \text{ } & 102  & \text{ } & \frac{(98-102)^2}{102}      \\ 7 & \text{ } & 114  & \text{ } & 0.1075 & \text{ } & 107.5  & \text{ } & \frac{(114-107.5)^2}{107.5}      \\ 8 & \text{ } & 106  & \text{ } & 0.1035 & \text{ } & 103.5  & \text{ } & \frac{(106-103.5)^2}{103.5}      \\ 9 & \text{ } & 78  & \text{ } & 0.0840 & \text{ } & 84  & \text{ } & \frac{(78-84)^2}{84}  \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }      \\ \text{Total} & \text{ } & 1000 & \text{ } & 1.0 & \text{ } & 1000 & \text{ } & 3.920599983     \end{array}

The value of the chi-squared statistic is 7.841199966, the sum of the two individual ones. The degrees of freedom of the chi-squared statistic is 10 -1 = 9. At level of significance \alpha=0.05, the critical value (the upper area of the chi-squared density curve) is 16.9. Thus we do not reject the null hypothesis that the distributions of digits in these two sources of random numbers are the same. Given the value of the chi-squared statistic (7.84), the p-value is 0.55. Since the p-value is large, there is no reason to believe that the digit distributions are different between the two sources. \square

Example 2
Two groups of drivers (500 drivers in each group) are observed for a 3-year period. The frequencies of accidents of the two groups are shown below. Test whether the accident frequencies are the same between the two groups of drivers.

    \displaystyle \begin{array} {rrrrr} \text{Number of Accidents} & \text{ } & \text{Frequency (Group 1)}  & \text{ } & \text{Frequency (Group 2)}    \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ }  \\ 0 & \text{ } & 193  & \text{ } & 154  \\ 1 & \text{ } & 185  & \text{ } & 191  \\ 2 & \text{ } & 88  & \text{ } & 97  \\ 3 & \text{ } & 29  & \text{ } & 38  \\ 4 & \text{ } & 4  & \text{ } & 17  \\ 5 & \text{ } & 1  & \text{ } & 3    \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ }  \\ \text{Total} & \text{ } & 500 & \text{ } & 500     \end{array}

To ensure that the expected count in the last cell is not too small, we collapse two cells (4 and 5 accidents) into one. The following two tables show the calculation for the chi-squared procedure.

    Chi-Squared Statistic (Group 1)
    \displaystyle \begin{array} {ccccccccc} \text{Number of} & \text{ } & \text{Frequency}  & \text{ } & \text{Estimate}  & \text{ } & \text{Frequency} & \text{ } & \text{Chi-Squared} \\  \text{Accidents} & \text{ } & \text{Observed}  & \text{ } & \text{Cell Probability}  & \text{ } & \text{Expected} & \text{ } & \text{ }    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ 0 & \text{ } & 193  & \text{ } & 0.347 & \text{ } & 173.5  & \text{ } & \frac{(193-173.5)^2}{173.5}      \\ 1 & \text{ } & 185  & \text{ } &0.376  & \text{ } & 188  & \text{ } & \frac{(185-188)^2}{188}    \\ 2 & \text{ } & 88  & \text{ } &0.185  & \text{ } & 92.5  & \text{ } & \frac{(88-92.5)^2}{92.5}      \\ 3 & \text{ } & 29  & \text{ } & 0.067 & \text{ } & 33.5  & \text{ } & \frac{(29-33.5)^2}{33.5}      \\ 4+ & \text{ } & 5  & \text{ } & 0.025 & \text{ } & 12.5  & \text{ } & \frac{(5-12.5)^2}{12.5}      \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }      \\ \text{Total} & \text{ } & 500 & \text{ } & 1.0 & \text{ } & 500 & \text{ } & 7.562911523     \end{array}

    Chi-Squared Statistic (Group 2)
    \displaystyle \begin{array} {ccccccccc} \text{Number of} & \text{ } & \text{Frequency}  & \text{ } & \text{Estimate}  & \text{ } & \text{Frequency} & \text{ } & \text{Chi-Squared} \\  \text{Accidents} & \text{ } & \text{Observed}  & \text{ } & \text{Cell Probability}  & \text{ } & \text{Expected} & \text{ } & \text{ }    \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }    \\ 0 & \text{ } & 154  & \text{ } & 0.347 & \text{ } & 173.5  & \text{ } & \frac{(154-173.5)^2}{173.5}      \\ 1 & \text{ } & 191  & \text{ } &0.376  & \text{ } & 188  & \text{ } & \frac{(191-188)^2}{188}    \\ 2 & \text{ } & 97  & \text{ } &0.185  & \text{ } & 92.5  & \text{ } & \frac{(97-92.5)^2}{92.5}      \\ 3 & \text{ } & 38  & \text{ } & 0.067 & \text{ } & 33.5  & \text{ } & \frac{(38-33.5)^2}{33.5}      \\ 4+ & \text{ } & 20  & \text{ } & 0.025 & \text{ } & 12.5  & \text{ } & \frac{(20-12.5)^2}{12.5}      \\ \text{ } & \text{ } & \text{ }  & \text{ } & \text{ } & \text{ } & \text{ }  & \text{ } & \text{ }      \\ \text{Total} & \text{ } & 500 & \text{ } & 1.0 & \text{ } & 500 & \text{ } & 7.562911523     \end{array}

The total value of the chi-squared statistic is 15.1258 with df = 4. At level of significance \alpha=0.01, the critical value is 13.2767. Since the chi-squared statistic is larger than 13.2767, we reject the null hypothesis that the loss frequencies are the same between the two groups of drivers. We also reach the same conclusion by looking at the p-value. The p-value of the chi-squared statistic of 15.128 is 0.004447242, which is quite small. So we have reason to believe that the value of the chi-squared statistic 15.1258 is too large to be explained by random fluctuation. Thus we have reason to believe that the two groups have different accident rates. \square

_______________________________________________________________________________________________

Comparing Two or More Distributions

The procedure demonstrated in the previous section can be easily extended to handle more than two distributions. Suppose that the focus of interest is a certain multinomial experiment that results in k distinct outcomes. Suppose that the experiment is performed r times with the samples drawn from different populations. The r iterations of the experiment are independent. Note the following quantities.

    p_{i,j} is the probability that the outcome in the ith experiment falls into the jth cell where i=1,2,\cdots,r and j=1,2,\cdots,k.

    n_i is the number of times the ith experiment is performed.

    Y_{i,j} is the number of trials in the ith experiment whose outcomes fall into cell j where i=1,2,\cdots,r and j=1,2,\cdots,k.

With these in mind, consider the following chi-squared statistics

    \displaystyle \sum \limits_{j=1}^k \frac{(Y_{i,j}-n_i \ p_{i,j})^2}{n_i \ p_{i,j}}

where j=1,2,\cdots,k. Each of the above k statistic has an approximate chi-squared distribution with k-1 degrees of freedom. Since the experiments are independent, the sum of all these chi-squared statistics

    \displaystyle \sum \limits_{i=1}^r \sum \limits_{j=1}^k \frac{(Y_{i,j}-n_i \ p_{i,j})^2}{n_i \ p_{i,j}}

has an approximate chi-squared distribution with df = r(k-1). The null hypothesis is that the cell probabilities are the same across all populations. The following is the formal statement.

    H_0: p_{1,j}=p_{2,j}=p_{3,j}=\cdots =p_{r,j}

where j=1,2,\cdots,k. The unknown cell probabilities are to be estimated using sample data as follows:

    \displaystyle \hat{p}_j=\frac{Y_{1,j}+Y_{2,j}+\cdots+Y_{r,j}}{n_1+n_2+\cdots+n_r}

where j=1,2,\cdots,k. The reasoning behind \hat{p}_j is that if H_0 is true, then the r iterations of the experiment is just one large combined experiment. Then Y_{1,j}+Y_{2,j}+\cdots+Y_{r,j} is simply the number of observations that fall into cell j when H_0 is assumed to be true. Thus \hat{p}_j is an estimate of the cell probabilities p_{1,j}=p_{2,j}=p_{3,j}=\cdots =p_{r,j} for j=1,2,\cdots,k-1.

The next step is to replace the cell probabilities by the estimates \hat{p}_j to obtain the following chi-squared statistic.

    \displaystyle \sum \limits_{i=1}^r \sum \limits_{j=1}^k \frac{(Y_{i,j}-n_i \ \hat{p}_j)^2}{n_i \ \hat{p}_j}

Since we only have to estimate the cell probabilities p_{1,j}=p_{2,j}=p_{3,j}=\cdots =p_{r,j} for j up to k-1, the degrees of freedom of the above statistic is r(k-1)-(k-1)=(r-1) (k-1). In other words, the degrees of freedom is the number of experiments less one times the number of cells less one.

Once all the components are in place, we obtain the critical value of the chi-squared distribution of the df indicated above with an appropriate level of significance to decide on the rejection or acceptance of the null hypothesis. The p-value approach can also be used.

_______________________________________________________________________________________________

Reference

  1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
  2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
\copyright \ 2017 - \text{Dan Ma}

The Chi-Squared Distribution, Part 3a

This post is the part 3 of a three-part series on chi-squared distribution. In this post, we discuss the roles played by chi-squared distribution on experiments or random phenomena that result in measurements that are categorical rather than quantitative (part 2 deals with quantitative measurements). An introduction of the chi-squared distribution is found in part 1.

The chi-squared test discussed here is also referred to as Pearson’s chi-squared test, which was formulated by Karl Pearson in 1900. It can be used to assess three types of comparison on categorical variables – goodness of fit, homogeneity, and independence. As a result, we break up the discussion into 3 parts – part 3a (goodness of fit, this post), part 3b (test of homogeneity) and part 3c (test of independence).

_______________________________________________________________________________________________

Multinomial Experiments

Let’s look at the setting for Pearson’s goodness-of-fit test. Consider a random experiment consisting of a series of independent trials each of which results in exactly one of k categories. We are interested in summarizing the counts of the trials that fall into the k distinct categories. Some examples of such random experiments are:

  • In rolling a die n times, consider the counts of the faces of the die.
  • Perform a series of experiments each of which is a toss of three coins. Summarize the experiments according to the number of heads, 0, 1, 2, and 3, that occur in each experiment.
  • Blood donors can be classified into the blood types A, B, AB and O.
  • Record the number of automobile accidents per week in a one-mile stretch of highway. Classify the weekly accident counts into the groupings 0, 1, 2, 3, 4 and 5+.
  • A group of auto insurance policies are classified into the claim frequency rates of 0, 1, ,2, 3, 4+ accidents per year.
  • Auto insurance claims are classified into various claim size groupings, e.g. under 1000, 1000 to 5000, 5000 to 10000 and 10000+.
  • In auditing financial transactions in financial documents (accounting statements, expense reports etc), the leading digits of financial figures can be classified into 9 cells: 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Each of the example can be referred to as a multinomial experiment. The characteristics of such an experiment are

  • The experiment consists of performing n identical trials that are independent.
  • For each trial, the outcome falls into exactly one of k categories or cells.
  • The probability of the outcome of a trial falling into a particular cell is constant across all trials.

For cell j, let p_j be the probability of the outcome falling into cell j. Of course, p_1+p_2+\cdots+p_k=1. We are interested in the joint random variables Y_1,Y_2,\cdots,Y_k where Y_j is the number of trials whose outcomes fall into cell j.

If k=2 (only two categories for each trial), then the experiment is a binomial experiment. Then one of the categories can be called success (with cell probability p) and the other is called failure (with cell probability 1-p). If Y_1 is the count of the successes, then Y_1 has a binomial distribution with parameters n and p.

In general, the variables Y_1,Y_2,\cdots,Y_k have a multinomial distribution. To be a little more precise, the random variables Y_1,Y_2,\cdots,Y_{k-1} have a multinomial distribution with parameters n and p_1,p_2,\cdots,p_{k-1}. Note that the last variable Y_k is deterministic since Y_k=n-(Y_1+\cdots+Y_{k-1}).

In the discussion here, the objective is to make inference on the cell probabilities p_1,p_2,\cdots,p_k. The hypotheses in the statistical test are expressed in terms of specific values of p_j, j=1,2,\cdots,k. For example, the null hypothesis may be of the following form:

    H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k

where p_{j,0} are the hypothesized values of the cell probabilities. It is cumbersome to calculate the probabilities for the multinomial distribution. As a result, it would be difficult (if not impossible) to calculate the exact level of significance, which is the probability of type I error. Thus it is critical to use a test statistic that does not depend on the multinomial distribution. Fortunately this problem was solved by Karl Pearson. He formulated a test statistic that has an approximate chi-squared distribution.

_______________________________________________________________________________________________

Test Statistic

The random variables Y_1,Y_2,\cdots,Y_k discussed above have a multinomial distribution with parameters p_1,p_2,\cdots,p_k, respectively. Of course, each p_j is the probability that the outcome of a trial falls into cell j. The marginal distribution of each Y_j has a binomial distribution with parameters n and p_j with p_j being the probability of success. Thus the expected value and the variance of Y_j are E[Y_j]=n p_j and Var[Y_j]=n p_j (1-p_j). The following is the chi-squared test statistic.

    \displaystyle \chi^2=\sum \limits_{j=1}^k \frac{(Y_j-n \ p_j)^2}{n \ p_j}=\sum \limits_{j=1}^k \frac{(Y_j-E[Y_j])^2}{E[Y_j]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The statistic defined in (1) was proposed by Karl Pearson in 1900. It is defined by summing the squares of the difference of the observed counts Y_j and the expected counts E[Y_j] where each squared difference is normalized by the expected count (i.e. divided by the expected count). On one level, the test statistic in (1) seems intuitive since it involves all the k deviations Y_j-E[Y_j]. If the observed values Y_j are close to the expected cell counts, then the test statistic in (1) would have a small value.

The chi-squared test statistic defined in (1) has an approximate chi-squared distribution when the number of trials n is large. The proof of this fact will not be discussed here. We demonstrate with the case for k=2.

    \displaystyle \begin{aligned} \sum \limits_{j=1}^2 \frac{(Y_j-n \ p_j)^2}{n \ p_j}&=\frac{p_2 \ (Y_1-n p_1)^2+p_1 \ (Y_2-n p_2)^2}{n \ p_1 \ p_2} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+p_1 \ ((n-Y_1)-n (1-p_1))^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+ p_1 \ (Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\biggl( \frac{Y_1-n p_1}{\sqrt{n \ p_1 \ (1-p_1)}} \biggr)^2 =\biggl( \frac{Y_1-E[Y_1]}{\sqrt{Var[Y_1]}} \biggr)^2 \end{aligned}

The quantity inside the brackets in the last step is approximately normal according to the central limit theorem. Since the square of a normal distribution has a chi-squared distribution with one degree of freedom (see Part 1), the last step in the above derivation has an approximate chi-distribution with 1 df.

In order for the chi-squared distribution to provide an adequate approximation to the test statistic in (1), a rule of thumb requires that the expected cell counts E[Y_j] are at least five. The null hypothesis to be tested is that the cell probabilities p_j are certain specified values p_{j,0} for j=1,2,\cdots,k. The following is the formal statement.

    H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The hull hypothesis is to be tested against all possible alternatives. In other words, the alternative hypothesis H_1 is the statement that p_j \ne p_{j,0} for at least one j.

The chi-squared test statistic in (1) can be used for a goodness-of-fit test, i.e. to test how well a probability model fit the sample data, in other words, to test whether the observed data come from a hypothesized probability distribution. Example 2 below will test whether the Poisson model is a good fit for a set of claim frequency data.

_______________________________________________________________________________________________

Degrees of Freedom

Now that we have addressed the distribution for the test statistic in (1), we need to address two more issues. One is the direction of the hypothesis test (one-tailed or two-tailed). The second is the degrees of freedom. The direction of the test is easy to see. Note that the chi-squared test statistic in (1) is always a positive value. On the other hand, if the difference between observed cell counts and expected cell counts is large, the large difference would contradict the null hypothesis. Thus if the chi-squared statistic has a large value, we should reject the null hypothesis. So the correct test to use is the upper tailed chi-squared test.

The number of degrees of freedom is obtained by subtracting one from the cell count k for each independent linear restriction placed on the cell probabilities. There is at least one linear restriction. The sum of all the cell probabilities must be 1. Thus the degrees of freedom must be the result of reducing k by one at least one time. This means that the degrees of freedom of the chi-squared statistic in (1) is at most k-1.

Furthermore, in the calculation for the specified cell probabilities, if there is any parameter that is unknown and is required to be estimated from the data, then there are further reductions in k-1. If there is any unknown parameter that needs to be estimated from data, a maximum likelihood estimator (MLE) should be used. All these points will be demonstrated by the examples below.

If the value of the chi-squared statistic in (1) is “large”, we reject the null hypothesis H_0 stated in (2). By large we mean the value of the chi-squared statistic exceeds the critical value for the desired level of significance. The critical value is the upper tail in the chi-squared distribution (with the appropriate df) of area \alpha where \alpha is the desired level of significance (e.g. \alpha=0.05 and \alpha=0.01 or some other appropriate level). Instead of using critical value, the p-value approach can also be used. Critical value or p-value can be looked up using table or computed using software. For the examples below, chi-squared functions in Excel are used.

_______________________________________________________________________________________________

Examples

Example 1
Suppose that we wish to test whether a given die is a fair die. We roll the die 240 times and the following table shows the results.

    \displaystyle \begin{array} {rr} \text{Cell} & \text{Frequency} \\ 1 & 38  \\ 2 & 35 \\ 3 & 37 \\ 4 & 38  \\ 5 & 42 \\ 6 & 50 \\ \text{ } & \text{ } \\ \text{Total} & 240   \end{array}

The null hypothesis

    \displaystyle H_0: p_1=p_2=p_3=p_4=p_5=p_6=\frac{1}{6}

is tested against the alternative that at least one of the equalities is not true. This example is the simplest problem for testing cell probabilities. Since the specified values for the cells probabilities in H_0 are known, the degrees of freedom is one less than the cell count. Thus df = 5. The following is the chi-squared statistic based on the data and the null hypothesis.

    \displaystyle \begin{aligned} \chi^2&=\frac{(38-40)^2}{40}+\frac{(35-40)^2}{40}+\frac{(37-40)^2}{40} \\& \ \ + \frac{(38-40)^2}{40}+\frac{(42-40)^2}{40}+\frac{(50-40)^2}{40}=3.65  \end{aligned}

At \alpha=0.05 level of significance, the chi-squared critical value at df = 5 is \chi_{0.05}^2(5)=11.07049769. Since 3.65 < 11.07, the hypothesis that the die is fair is not rejected at \alpha=0.05. The p-value is P[\chi^2 > 3.65]=0.6. With such a large p-value, we also come to the conclusion that the null hypothesis is not rejected. \square

In all the examples, the critical values and the p-values are obtained by using the following functions in Excel.

    critical value
    =CHISQ.INV.RT(level of significance, df)

    p-value
    =1 – CHISQ.DIST(test statistic, df, TRUE)

Example 2
We now give an example for the chi-squared goodness-of-fit test. The number of auto accident claims per year from 700 drivers are recorded by an insurance company. The claim frequency data is shown in the following table.

    \displaystyle \begin{array}{rrr} \text{Claim Count} & \text{ } & \text{Frequency} \\ 0 & \text{ } & 351  \\ 1 & \text{ } & 241 \\ 2 & \text{ } & 73 \\ 3 & \text{ } & 29  \\ 4 & \text{ } & 6 \\ 5+ & \text{ } & 0 \\ \text{ } & \text{ } & \text{ } \\ \text{Total} & \text{ } & 700   \end{array}

Test the hypothesis that the annual claim count for a driver has a Poisson distribution. Use \alpha=0.05. Assume that the claim frequency across the drivers in question are independent.

The hypothesized distribution of the annual claim frequency is a Poisson distribution with unknown mean \lambda. The MLE of the parameter \lambda is the sample mean, which in this case is \hat{\lambda}=\frac{498}{700}=0.711428571.

Under the assumption that the claim frequency is Poisson with mean \hat{\lambda}, the cell probabilities are calculated using \hat{\lambda}.

    \displaystyle p_1=P[Y=0]=e^{-\hat{\lambda}}=0.4909

    \displaystyle p_2=P[Y=1]=\hat{\lambda} \ e^{-\hat{\lambda}}=0.3493

    \displaystyle p_3=P[Y=2]=\frac{1}{2} \ \hat{\lambda}^2 \ e^{-\hat{\lambda}}=0.1242

    \displaystyle p_4=P[Y=3]=\frac{1}{3!} \ \hat{\lambda}^3 \ e^{-\hat{\lambda}}=0.0295

    \displaystyle p_5=P[Y \ge 4]=1-P[Y=0]-P[Y=1]-P[Y=2]-P[Y=3]=0.0061

Then the null hypothesis is:

    H_0: p_1=0.4909, p_2=0.3493, p_3=0.1242, p_4=0.0295, p_5=0.0061

The null hypothesis is tested against all alternatives. The following table shows the calculation of the chi-squared statistic.

    \displaystyle \begin{array}{rrrrrrrrr}   \text{Cell} & \text{ } & \text{Claim Count} & \text{ } & \text{Cell Probability} & \text{ } & \text{Expected Count} & \text{ } & \text{Chi-Squared}   \\ 1 & \text{ } & 0 & \text{ } & 0.4909 & \text{ } & 343.63 & \text{ } & 0.15807   \\ 2 & \text{ } & 1 & \text{ } & 0.3493 & \text{ } & 244.51 & \text{ } & 0.05039   \\ 3 & \text{ } & 2 & \text{ } & 0.1242 & \text{ } & 86.94 & \text{ } & 2.23515   \\ 4 & \text{ } & 3 & \text{ } & 0.0295 & \text{ } & 20.65 & \text{ } & 3.37639    \\ 5 & \text{ } & 4+ & \text{ } & 0.0061 & \text{ } & 4.27 & \text{ } & 0.70091   \\ \text{ } & \text{ } & \text{ }   \\ \text{Total} & \text{ } & \text{ } & \text{ } & 1.0000 & \text{ } & \text{ } & \text{ } & 6.52091   \end{array}

The degrees of freedom of the chi-squared statistic is df = 5 – 1 -1 = 3. The first reduction of one is due to the linear restriction of all cell probabilities summing to 1. The second reduction is due to the fact that one unknown parameter \lambda has to be estimated using sample data. Using Excel, the critical value is \chi_{0.05}^2(3)=7.814727903. The p-value is P[\chi^2 > 6.52091]=0.088841503. Thus the null hypothesis is not rejected at the level of significance \alpha=0.05. \square

Example 3
For many data sets, especially for data sets with numbers that distribute across multiple orders of magnitude, the first digits occur according to the probability distribution indicated in below:

    Probability for leading digit 1 = 0.301
    Probability for leading digit 2 = 0.176
    Probability for leading digit 3 = 0.125
    Probability for leading digit 4 = 0.097
    Probability for leading digit 5 = 0.079
    Probability for leading digit 6 = 0.067
    Probability for leading digit 7 = 0.058
    Probability for leading digit 8 = 0.051
    Probability for leading digit 9 = 0.046

This probability distribution was discovered by Simon Newcomb in 1881 and was rediscovered by physicist Frank Benford in 1938. Since then this distribution has become known as the Benford’s law. Thus in many data sets, the leading digit 1 occurs about 30% of the time. The data sets for which this law is applicable are demographic data (e.g. income data of a large population, census data such as populations of cities and counties) and scientific data. The law is also applicable in certain financial data, e.g. tax data, stock exchange data, corporate disbursement and sales data. Thus the Benford’s law is a great tool for forensic accounting and auditing.

The following shows the distribution of first digits in the population counts of all 3,143 counties in the United States (from US census data).

    Count for leading digit 1 = 972
    Count for leading digit 2 = 573
    Count for leading digit 3 = 376
    Count for leading digit 4 = 325
    Count for leading digit 5 = 205
    Count for leading digit 6 = 209
    Count for leading digit 7 = 179
    Count for leading digit 8 = 155
    Count for leading digit 9 = 149

Use the chi-squared goodness-of-fit test to test the hypothesis that the leading digits in the county population data follow the Benford’s law. This example is also discussed in this blog post. \square

For further information and more examples on chi-squared test, please see the sources listed in the reference section.

_______________________________________________________________________________________________

Reference

  1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
  2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
\copyright \ 2017 - \text{Dan Ma}

The Chi-Squared Distribution, Part 2

This is the part 2 of a 3-part series on the chi-squared distribution. In this post, we discuss several theorems, all centered around the chi-squared distribution, that play important roles in inferential statistics for the population mean and population variance of normal populations. These theorems are the basis for the test statistics used in the inferential procedures.

We first discuss the setting for the inference procedures. Then discuss the pivotal theorem (Theorem 5). We then proceed to discuss that theorems that produce the test statistics for \mu, the population mean of a normal population and for \mu_1-\mu_2, a difference of two population means from two normal populations. The discussion then shifts to the inference procedures on population variance.

_______________________________________________________________________________________________

The Settings

To facilitate the discussion, we use the notation \mathcal{N}(\mu,\sigma^2) to denote the normal distribution with mean \mu and variance \sigma^2. Whenever the random variable X follows such as distribution, we use the notation X \sim \mathcal{N}(\mu,\sigma^2).

The setting for making inference on one population is that we have a random sample Y_1,Y_2,\cdots,Y_n, drawn from a normal population \mathcal{N}(\mu,\sigma^2). The sample mean \overline{Y} and the sample variance S^2 are unbiased estimators of \mu and \sigma^2, respectively, given by:

    \displaystyle \overline{Y}=\frac{Y_1+\cdots+Y_n}{n}=\frac{1}{n} \sum \limits_{j=1}^n Y_j \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

    \displaystyle S^2=\frac{1}{n-1} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The goal is to use the information obtained from the sample, namely \overline{Y} and S^2, to estimate or make decisions about the unknown population parameters \mu and \sigma^2.

Because the sample is drawn from a normal population, the sample \overline{Y} has a normal distribution, more specifically \overline{Y} \sim \mathcal{N}(\mu,\frac{\sigma^2}{n}), which has two unknown parameters. To perform inferential procedures on the population mean \mu, it is preferable to have a test statistic that depends on \mu only. To this end, a t-statistic is used (see Theorem 7), which has the t-distribution with n-1 degrees of freedom (one less than the sample size). Because the parameter \sigma^2 is replaced by the sample variance S^2, the t-statistic has only \mu as the unknown parameter.

On the other hand, to perform inferential procedures on the population variance \sigma^2, we use a statistic that has a chi-squared distribution and that has only one unknown parameter \sigma^2 (see Theorem 5).

Now, the setting for performing inference on two normal populations. Let X_1,X_2,\cdots,X_n be a random sample drawn from the distribution \mathcal{N}(\mu_X,\sigma_X^2). Let Y_1,Y_2,\cdots,Y_m be a random sample drawn from the distribution \mathcal{N}(\mu_Y,\sigma_Y^2). Because the two samples are independent, the difference of the sample means \overline{X}-\overline{Y} has a normal distribution. Specifically, \overline{X}-\overline{Y} \sim \mathcal{N}(\mu_X-\mu_Y,\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{n}). Theorem 8 gives a t-statistic that is in terms of the difference \mu_X-\mu_Y such that the two unknown population variances are replaced by the pooled sample variance. This is done with the simplifying assumption that the two population variances are identical.

On the other hand, the inference on the population variances \sigma_X^2 and \sigma_Y^2, a statistic that has the F distribution can be used (See Theorem 10). One caveat is that this test statistic is sensitive to non-normality.

_______________________________________________________________________________________________

Connection between Normal Distribution and Chi-squared Distribution

There is an intimate relation between the sample items from a normal distribution and the chi-squared distribution. This is discussed in Part 1. Let’s recall this connection. If we normalize one sample item Y_j and then square it, we obtain a chi-squared random variable with df = 1. Likewise, if we normalize each sample item and then square it, the sum of the squares will be a chi-squared random variable with df = n. The following results are discussed in Part 1 and are restated here for clarity.

Theorem 2
Suppose that the random variable X follows a standard normal distribution, i.e. the normal distribution with mean 0 and standard deviation 1. Then Y=X^2 follows a chi-squared distribution with 1 degree of freedom.

Corollary 3
Suppose that the random variable X follows a normal distribution with mean \mu and standard deviation \sigma. Then Y=[(X-\mu) / \sigma]^2 follows a chi-squared distribution with 1 degree of freedom.

Corollary 4
Suppose that X_1,X_2,\cdots,X_n is a random sample drawn from a normal distribution with mean \mu and standard deviation \sigma. Then the following random variable follows a chi-squared distribution with n degrees of freedom.

    \displaystyle \sum \limits_{j=1}^n \biggl( \frac{X_j-\mu}{\sigma} \biggr)^2=\biggl( \frac{X_1-\mu}{\sigma} \biggr)^2+\biggl( \frac{X_2-\mu}{\sigma} \biggr)^2+\cdots+\biggl( \frac{X_n-\mu}{\sigma} \biggr)^2

_______________________________________________________________________________________________

A Pivotal Theorem

The statistic in Corollary 4 has two unknown parameters \mu and \sigma^2. It turns out that the statistic will become more useful if \mu is replaced by the sample mean \overline{Y}. The cost is that one degree of freedom is lost in the chi-squared distribution. The following theorem gives the details. The result is a statistic that is a function of the sample variance S^2 and the population variance \sigma^2.

Theorem 5
Let Y_1,Y_2,\cdots,Y_n be a random sample drawn from a normal distribution with mean \mu and variance \sigma^2. Then the following conditions hold.

  • The sample mean \overline{Y} and the sample variance S^2 are independent.
  • The statistic \displaystyle \frac{(n-1) S^2}{\sigma^2}=\frac{1}{\sigma^2} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2 has a chi-squared distribution with n-1 degrees of freedom.

Proof of Theorem 5
We do not prove the first bullet point. For a proof, see Exercise 13.93 in [2]. For the second bullet point, note that

    \displaystyle \begin{aligned}\sum \limits_{j=1}^n \biggl( \frac{Y_j-\mu}{\sigma} \biggr)^2&=\sum \limits_{j=1}^n \biggl( \frac{(Y_j-\overline{Y})+(\overline{Y}-\mu)}{\sigma} \biggr)^2 \\&=\sum \limits_{j=1}^n \biggl( \frac{Y_j-\overline{Y}}{\sigma} \biggr)^2 +\frac{n (\overline{Y}-\mu)^2}{\sigma^2} \\&=\frac{(n-1) S^2}{\sigma^2} +\biggl( \frac{\overline{Y}-\mu}{\frac{\sigma}{\sqrt{n}}} \biggr)^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)\end{aligned}

Note that in expanding [(Y_j-\overline{Y})+(\overline{Y}-\mu)]^2, the sum of the middle terms equals to 0. Furthermore the result (3) can be restated as follows:

    \displaystyle Q=\frac{(n-1) S^2}{\sigma^2}+Z^2

where Q=\sum \limits_{j=1}^n \biggl( \frac{Y_j-\mu}{\sigma} \biggr)^2 and Z=\frac{\overline{Y}-\mu}{\frac{\sigma}{\sqrt{n}}}. Note that Z is a standard normal random variable. Thus Z^2 has a chi-squared distribution with df = 1 (by Theorem 2). Since Q is an independent sum of squares of standardized normal variables, Q has a chi-squared distribution with n degrees of freedom. Furthermore, since \overline{Y} and S^2 are independent, Z^2 and S^2 are independent. Let H=\frac{(n-1) S^2}{\sigma^2}. As a result, H and Z^2 are independent. The following gives the moment generating function (MGF) of Q.

    \displaystyle \begin{aligned}E[e^{t \ Q}]&=E[e^{t \ (H+Z^2)}] \\&=E[e^{t \ H}] \ E[e^{t \ Z^2}]  \end{aligned}

Since Q and Z^2 follow chi-squared distributions, we can plug in the chi-squared MGFs to obtain the MGF of the random variable H.

    \displaystyle \biggl(\frac{1}{1-2t} \biggr)^{\frac{n}{2}}=E[e^{t \ H}] \ \biggl(\frac{1}{1-2t} \biggr)^{\frac{1}{2}}

    \displaystyle E[e^{t \ H}]=\biggl(\frac{1}{1-2t} \biggr)^{\frac{n-1}{2}}

The MGF for H is that of a chi-squared distribution with n-1 degrees of freedom. \square

Remark
It is interesting to compare the following two quantities:

    \displaystyle \sum \limits_{j=1}^n \biggl( \frac{X_j-\mu}{\sigma} \biggr)^2

    \displaystyle \frac{(n-1) S^2}{\sigma^2}=\frac{1}{\sigma^2} \sum \limits_{j=1}^n (Y_j-\overline{Y})^2=\sum \limits_{j=1}^n \biggl( \frac{X_j-\overline{Y}}{\sigma} \biggr)^2

The first quantity is from Corollary 4 and has a chi-squared distribution with n degrees of freedom. The second quantity is from Theorem 5 and has a chi-squared distribution with n-1 degrees of freedom. Thus the effect of Theorem 5 is that by replacing the population mean \mu with the sample mean \overline{Y}, one degree of freedom is lost in the chi-squared distribution.

Theorem 5 is a pivotal theorem that has wide applications. For our purposes at hand, it can be used for inference on both the mean and variance. Even though one degree of freedom is lost, the statistic \displaystyle \frac{(n-1) S^2}{\sigma^2} is a function of one unknown parameter, namely the population variance \sigma^2. Since the sampling distribution is known (chi-squared), we can make probability statement about the statistic. Hence the statistic is useful for making inference about the population variance \sigma^2. As we will see below, in conjunction with other statistics, the statistic in Theorem 5 can be used for inference of two population variances as well as for inference on the mean (one sample and two samples).

_______________________________________________________________________________________________

Basis for Inference on Population Mean

Inference on the population mean of a single normal population and on the difference of the means of two independent normal populations relies on the t-statistic. Theorem 6 shows how to obtain a t-statistic using a chi-squared statistic and the standard normal statistic. Theorem 7 provides the one-sample t-statistic and Theorem 8 provides the two-sample t-statistic.

Theorem 6
Let Z be the standard normal random variable. Let U be a random variable that has a chi-squared distribution with r degrees of freedom. Then the random variable

    \displaystyle T=\frac{Z}{\sqrt{\frac{U}{r}}}

has a t-distribution with r degrees of freedom and its probability density function (PDF) is

    \displaystyle g(t)=\frac{\Gamma(\frac{r+1}{2})}{\sqrt{\pi r} \Gamma(\frac{r}{2}) \biggl(1+\frac{t^2}{r} \biggr)^{\frac{r+1}{2}}} \ \ \ \ \ \ \ \ \ -\infty <t< \infty

Remark
The probability density function given here is not important for the purpose at hand. For the proof of Theorem 6, see [2]. The following two theorems give two applications of Theorem 6.

Theorem 7
Let Y_1,Y_2,\cdots,Y_n be a random sample drawn from a normal distribution with mean \mu and variance \sigma^2. Let S^2 be the sample variance defined in (2). Then the random variable

    \displaystyle T=\frac{\overline{Y}-\mu}{\frac{S}{\sqrt{n}}}

has a t-distribution with n-1 degrees of freedom.

Proof of Theorem 7
Consider the following statistics.

    \displaystyle Z=\frac{\overline{Y}-\mu}{\frac{\sqrt{\sigma}}{n}}

    \displaystyle U=\frac{(n-1) \ S^2}{\sigma^2}

Note that Z has the standard normal distribution. By Theorem 5, the quantity U has a chi-square distribution with df = n-1. By Theorem 6, the following quantity has a t-distribution with df = n-1.

    \displaystyle T=\frac{Z}{\sqrt{\frac{U}{n-1}}}=\frac{\overline{Y}-\mu}{\frac{S}{\sqrt{n}}}

The above result is obtained after performing algebraic simplification. \square

Theorem 8
Let X_1,X_2,\cdots,X_n be a random sample drawn from a normal distribution with mean \mu_X and variance \sigma_X^2. Let Y_1,Y_2,\cdots,Y_m be a random sample drawn from a normal distribution with mean \mu_Y and variance \sigma_Y^2. Suppose that \sigma_X^2=\sigma_Y^2=\sigma^2. Then the following statistic:

    \displaystyle T=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{S_p \ \sqrt{\frac{1}{n}+\frac{1}{m}}}

has a t-distribution with df = n+m-2 where \displaystyle S_p^2=\frac{(n-1) \ S_X^2+ (m-1) \ S_Y^2}{n+m-2}.

Note that S_p^2 is the pooled variance of the two sample variances S_X^2 and S_Y^2.

Proof of Theorem 8
First, the sample mean \overline{X} has a normal distributions with mean and variance \mu_X and \frac{\sigma_X^2}{n}, respectively. The sample mean \overline{Y} has a normal distribution with mean and variance \mu_Y and \frac{\sigma_Y^2}{n}, respectively. Since the two samples are independent, \overline{X} and \overline{Y} are independent. Thus the sample difference \overline{X}-\overline{Y} has a normal distribution with mean \mu_X-\mu_Y and variance \frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{n}. The following is a standardized normal random variance:

    \displaystyle Z=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{\sqrt{\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{m}}}

On the other hand, by Theorem 5 the following quantities have chi-squared distributions with degrees of freedom n-1 and m-1, respectively.

    \displaystyle \frac{(n-1) S_X^2}{\sigma_X^2}=\frac{\sum \limits_{j=1}^n (X_j-\overline{X})^2}{\sigma_X^2}

    \displaystyle \frac{(n-1) S_Y^2}{\sigma_Y^2}=\frac{\sum \limits_{j=1}^m (Y_j-\overline{Y})^2}{\sigma_Y^2}

Because the two samples are independent, the two chi-squared statistics are independent. Then the following is a chi-squared statistic with n+m-2 degrees of freedom.

    \displaystyle U=\frac{(n-1) S_X^2}{\sigma_X^2}+\frac{(n-1) S_Y^2}{\sigma_Y^2}

By Theorem 6, the following ratio

    \displaystyle T=\frac{Z}{\sqrt{\frac{U}{n+m-2}}}

has a t-distribution with n+m-2 degrees of freedom. Here’s where the simplifying assumption of \sigma_X^2=\sigma_Y^2=\sigma^2 is used. Plugging in this assumption gives the following:

    \displaystyle T=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{\sqrt{\frac{(n-1) S_X^2+(m-1) S_Y^2}{n+m-2} \ (\frac{1}{n}+\frac{1}{m})}}=\frac{\overline{X}-\overline{Y}-(\mu_X-\mu_Y)}{S_p \ \sqrt{\frac{1}{n}+\frac{1}{m}}}

where S_p^2 is the pooled sample variance of the two samples as indicated above. \square

_______________________________________________________________________________________________

Basis for Inference on Population Variance

As indicated above, the statistic \displaystyle \frac{(n-1) S^2}{\sigma^2} given in Theorem 5 can be used for inference on the variance of a normal population. The following theorem gives the basis for the statistic used for comparing the variances of two normal populations.

Theorem 9
Suppose that the random variables U and V are independent chi-squared random variables with r_1 and r_2 degrees of freedom, respectively. Then the statistic

    \displaystyle F=\frac{U / r_1}{V / r_2}

has an F-distribution with r_1 and r_2 degrees of freedom.

Remark
The F-distribution depends on two parameters r_1 and r_2. The order they are given is important. We regard the first parameter as the degrees of freedom of the chi-squared distribution in the numerator and the second parameter as the degrees of freedom of the chi-squared distribution in the denominator.

It is not important to know the probability density functions for both the t-distribution and the F-distribution (in both Theorem 6 and Theorem 9). When doing inference procedures with these distributions, either tables or software will be used.

Given two independent normal random samples X_1,X_2,\cdots,X_n and Y_1,Y_2,\cdots,Y_n (as discussed in the above section on the settings of inference), the sample variance S_X^2 is an unbiased estimator of the population variance \sigma_X^2 of the first population, and the sample variance S_Y^2 is an unbiased estimator of the population variance \sigma_Y^2 of the second population. It seems to make sense that the ratio \displaystyle \frac{S_X^2}{S_Y^2} can be used to make inference about the relative magnitude of \sigma_X^2 and \sigma_Y^2. The following theorem indicates that this is a valid approach.

Theorem 10
Let X_1,X_2,\cdots,X_n be a random sample drawn from a normal distribution with mean \mu_X and variance \sigma_X^2. Let Y_1,Y_2,\cdots,Y_m be a random sample drawn from a normal distribution with mean \mu_Y and variance \sigma_Y^2. Then the statistic

    \displaystyle \frac{S_X^2 \ / \ \sigma_X^2}{S_Y^2 \ / \ \sigma_Y^2}=\frac{S_X^2}{S_Y^2} \times \frac{\sigma_Y^2}{\sigma_X^2}

has the F-distribution with degrees of freedom n-1 and m-1.

Proof of Theorem 10
By Theorem 5, \displaystyle \frac{(n-1) S_X^2}{\sigma_X^2} has a chi-squared distribution with n-1 degrees of freedom and \displaystyle \frac{(m-1) S_Y^2}{\sigma_Y^2} has a chi-squared distribution with m-1 degrees of freedom. By Theorem 9, the following statistic

    \displaystyle \frac{[(n-1) S_X^2 \ / \ \sigma_X^2] \ / \ (n-1)}{[(m-1) S_Y^2 \ / \ \sigma_Y^2] \ / \ (m-1)}

has the F-distribution with n-1 and m-1 degrees of freedom. The statistic is further simplified to become the statistic as stated in the theorem. \square

_______________________________________________________________________________________________

Concluding Remarks

Theorem 7 and Theorem 8 produce the one-sample t-statistic and two-sample t-statistic, respectively. They are procedures for inference about one population mean and the difference of two population means, respectively. They can be used for estimation (e.g. construction of confidence intervals) or decision making (e.g. hypothesis testing). On the other hand, Theorem 5 produces a chi-squared statistic for inference about one population variance. Theorem 10 produces an F-statistic that can be used for inference about two population variances. Since the F-statistic is a ratio of the two population variances, it can be used for inference about the relative magnitude of the variances.

The purpose in this post is to highlight the important roles of the chi-squared distribution. We now discuss briefly the quality of the derived statistical procedures. The procedures discussed here (t or F) are exactly correct if the populations from which the samples are drawn are normal. Real life data usually do not exactly follow normal distributions. Thus the usefulness of these statistics in practice depends on how strongly they are affected by non-normality. In other words, if there is a significant deviation from the assumption of normal distribution, are these procedures still reliable?

A statistical inference procedure is called robust if the calculated results drawn from the procedure are insensitive to deviations of assumptions. For a non-robust procedure, the result would be distorted if there is deviation from assumptions. For example, the t procedures are not robust against outliers. The presence of outliers in the data can distort the results since the t procedures are based on the sample mean \overline{x} and sample variance S^2, which are not resistant to outliers.

On the other hand, the t procedures for inference about means are quite robust against slight deviation of normal population assumptions. The F procedures for inference about variances are not so robust. So it must be used with care. Even if there is a slight deviation from normal assumptions, the results from the F procedures may not be reliable. For a more detailed but accessible discussion on robustness, see [1].

When the sample sizes are large, the sample mean \overline{x} is close to a normal distribution (this result is called the central limit theorem). So the discussion about deviation of normality assumption is no longer important. When the sample sizes are large, simply use the Z statistic for inference about the means. On the other hand, when the sample sizes are large, the sample variance S^2 will be an accurate estimate of the population variance \sigma^2 regardless of the assumption of the population distribution. This fact is related to the law of large numbers. Thus the statistical procedures described here are for small sample sizes and for assuming normal populations.

_______________________________________________________________________________________________

Reference

  1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
  2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}

The Chi-Squared Distribution, Part 1

The chi-squared distribution has a simple definition from a mathematical standpoint and yet plays an important role in statistical sampling theory. This post is the first post in a three-part series that gives a mathematical story of the chi-squared distribution.

This post is an introduction which highlights the fact that mathematically chi-squared distribution arises from the gamma distribution and that the chi-squared distribution has an intimate connection with the normal distribution. This post lays the ground work for the subsequent post.

The next post (Part 2) describe the roles played by the chi-squared distribution in forming the various sampling distributions related to the normal distribution. These sampling distributions are used for making inference about the population from which the sample is taken. The population parameters of interest here are the population mean, variance, and standard deviation. The population from which the sample is take is assumed to be modeled adequately by a normal distribution.

Part 3 describes the chi-squared test, which is used for making inference on categorical data (versus quantitative data).

These three parts only scratches the surface with respect to the roles played the chi-squared distribution in statistics. Thus the discussion in this series only serves as an introduction on chi-squared distribution.

_______________________________________________________________________________________________

Defining the Chi-Squared Distribution

A random variable Y is said to follow the chi-squared distribution with k degrees of freedom if the following is the density function of Y.

    \displaystyle f_Y(y)=\frac{1}{\Gamma(\frac{k}{2}) \ 2^{\frac{k}{2}}}  \ y^{\frac{k}{2}-1} \ e^{-\frac{y}{2}} \ \ \ \ \ \ \ y>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

where k is a positive integer. In some sources, the distribution is sometimes named \chi^2 distribution. Essentially the distribution defined in (1) is a gamma distribution with shape parameter \frac{k}{2} and scale parameter 2 (or rate parameter \frac{1}{2}). Note that the chi-squared distribution with 2 degrees of freedom (when k=2) is simply an exponential distribution with mean 2. The following figure shows the chi-squared density functions for degrees of freedom 1, 2, 3, 5 and 10.

Figure 1 – Chi-squared Density Curves
chi-squared-densities-df-1-2-3-5-10

Just from the gamma connection, the mean and variance are E[Y]=k and Var[Y]=2k. In other words, the mean of a chi-squared distribution is the same as the degrees of freedom and its variance is always twice the degrees of freedom. As a gamma distribution, the higher moments E[Y^n] are also known. Consequently the properties that depend on E[Y^n] can be easily computed. See here for the basic properties of the gamma distribution. The following gives the mean, variance and the moment generating function (MGF) for the chi-squared random variable Y with k degrees of freedom.

    E[Y]=k

    Var[Y]=2 k

    \displaystyle M_Y(t)=\biggl( \frac{1}{1-2t} \biggr)^{\frac{k}{2}} \ \ \ \ \ \ t<\frac{1}{2}

_______________________________________________________________________________________________

Independent Sum of Chi-Squared Distributions

In general, the MGF of an independent sum Y_1+Y_2+\cdots+Y_n is simply the product of the MGFs of the individual random variables X_i. Note that the product of Chi-squared MGFs is also a Chi-squared MGF, with the exponent being the sum of the individual exponents. This brings up another point that is important for the subsequent discussion, i.e. the independent sum of chi-squared distributions is also a chi-squared distribution. The following theorem states this fact more precisely.

Theorem 1
If Y_1,Y_2,\cdots,Y_n are chi-squared random variables with degrees of freedom k_1,k_2,\cdots,k_n, respectively, then the independent sum Y_1+Y_2+\cdots+Y_n has a chi-squared distribution with k_1+k_2+\cdots+k_n degrees of freedom.

Thus the result of summing independent chi-squared distributions is another chi-squared distribution with degree of freedom being the total of all degrees of freedom. This follows from the fact that if the gamma distributions have identical scale parameter, then the independent sum is a gamma distribution with the shape parameter being the sum of the shape parameters. This point is discussed in more details here.

_______________________________________________________________________________________________

The Connection with Normal Distributions

As shown in the above section, the chi-squared distribution is simple from a mathematical standpoint. Since it is a gamma distribution, it possesses all the properties that are associated with the gamma family. Of course, the gamma connection is far from the whole story. One important fact is that the chi-squared distribution is naturally obtained from sampling from a normal distribution.

Theorem 2
Suppose that the random variable X follows a standard normal distribution, i.e. the normal distribution with mean 0 and standard deviation 1. Then Y=X^2 follows a chi-squared distribution with 1 degree of freedom.

Proof
By definition, the following is the cumulative distribution function (CDF) of Y=X^2.

    \displaystyle \begin{aligned}F_Y(y)=P[Y \le y] &=P[X^2 \le y]=P[-\sqrt{y} \le X \le \sqrt{y}]=2 \ P[0 \le X \le \sqrt{y}] \\&=2 \ \int_0^{\sqrt{y}} \frac{1}{\sqrt{2 \pi}} \ e^{-\frac{x^2}{2}} \ dx  \end{aligned}

Upon differentiating F_Y(y), the density function is obtained.

    \displaystyle \begin{aligned}f_Y(y)=\frac{d}{dy}F_Y(y) &=2 \ \frac{d}{dy} \int_0^{\sqrt{y}} \frac{1}{\sqrt{2 \pi}} \ e^{-\frac{x^2}{2}} \ dx  \\ &=\frac{1}{\Gamma(\frac{1}{2}) \ 2^{\frac{1}{2}}}  \ y^{\frac{1}{2}-1} \ e^{-\frac{y}{2}}\end{aligned}

Note that the density is that of a chi-squared distribution with 1 degree of freedom. \square

With the basic result in Theorem 1, there are more ways to obtain chi-squared distributions from sampling from normal distributions. For example, first normalizing a sample item from normal sampling and then squaring it will produce a chi-squared observation with 1 degree of freedom. Similarly, by performing the same normalizing in each sample item in a normal sample and by squaring each normalized observation, the resulting sum is a chi-squared distribution. These are made more precise in the following corollaries.

Corollary 3
Suppose that the random variable X follows a normal distribution with mean \mu and standard deviation \sigma. Then Y=[(X-\mu) / \sigma]^2 follows a chi-squared distribution with 1 degree of freedom.

Corollary 4
Suppose that X_1,X_2,\cdots,X_n is a random sample drawn from a normal distribution with mean \mu and standard deviation \sigma. Then the following random variable follows a chi-squared distribution with n degrees of freedom.

    \displaystyle \sum \limits_{j=1}^n \biggl( \frac{X_j-\mu}{\sigma} \biggr)^2=\biggl( \frac{X_1-\mu}{\sigma} \biggr)^2+\biggl( \frac{X_2-\mu}{\sigma} \biggr)^2+\cdots+\biggl( \frac{X_n-\mu}{\sigma} \biggr)^2

_______________________________________________________________________________________________

Calculating Chi-Squared Probabilities

In working with the chi-squared distribution, it is necessary to evaluate the cumulative distribution function (CDF). In hypothesis testing, it is necessary to calculate the p-value given the value of the chi-squared statistic. In confidence interval estimation, it is necessary to determine the the critical value at a given confidence level. The standard procedure at one point in time is to use a chi-squared table. A typical chi-squared table can be found here. We demonstrate how to find chi-squared probabilities first using the table approach and subsequently using software (Excel in particular).

The table gives the probabilities on the right tail. The table in the link given above will give the chi-squared value (on the x-axis) \chi_{\alpha}^2 for a given area of the right tail (\alpha) per df. This table lookup is illustrated in the below diagram.

Figure 2 – Right Tail of Chi-squared Distribution
chi-squared-shaded-area

For df = 1, \chi_{0.1}^2=2.706, thus P[\chi^2 > 2.706]=0.1 and P[\chi^2 < 2.706]=0.9. So for df = 1, the 90th percentile of the chi-squared distribution is 2.706. The following shows more table lookup.

    df = 2, \chi_{0.01}^2=9.210.
    P[\chi^2 > 9.210]=0.01 and P[\chi^2 < 9.210]=0.99
    The 99th percentile of the chi-squared distribution with df = 2 is 9.210.

    df = 15, \chi_{0.9}^2=8.547.
    P[\chi^2 > 8.547]=0.9 and P[\chi^2 < 8.547]=0.1
    The 10th percentile of the chi-squared distribution with df = 15 is 8.547.

The choices for \alpha in the table are limited. Using software will have more selection for \alpha and will give more precise values. For example, Microsoft Excel provides the following two functions.

    =CHISQ.DIST(x, degree_freedom, cumulative)

    =CHISQ.INV(probability, degree_freedom)

The two functions in Excel give information about the left-tail of the chi-squared distribution. The function CHISQ.DIST returns the left-tailed probability of the chi-squared distribution. The parameter cumulative is either TRUE or FALSE, with TRUE meaning that the result is the cumulative distribution function and FALSE meaning that the result is the probability density function. On the other hand, the function CHISQ.INV returns the inverse of the left-tailed probability of the chi-squared distribution.

If the goal is to find probability given an x-value, use the function CHISQ.DIST. On the other hand, if the goal is to look for the x-value given the left-tailed value (probability), then use the function CHISQ.INV. In the table approach, once the value \chi_{\alpha}^2=x is found, the interplay between the probability (\alpha) and x-value is clear. In the case of Excel, one must choose the function first depending on the goal. The following gives the equivalent results for the table lookup presented above.

    =CHISQ.DIST(2.706, 1, TRUE) = 0.900028622
    =CHISQ.INV(0.9, 1) = 2.705543454

    =CHISQ.DIST(9.21, 2, TRUE) = 0.989998298
    =CHISQ.INV(0.99, 2) = 9.210340372

    =CHISQ.DIST(8.547, 15, TRUE) = 0.100011427
    =CHISQ.INV(0.1, 15) = 8.546756242

_______________________________________________________________________________________________
\copyright \ 2016 - \text{Dan Ma}