Chi-squared test

The chi-squared test is a useful and versatile test. There are several interpretations of the chi-squared test, which are discussed in three previous posts. The different uses of the same test can be confusing to the students. This post attempts to connect the ideas in the three previous posts and to supplement the previous discussions.

The chi-squared test is based on the chi-squared statistic, which is a measure of magnitude of the difference between the observed counts and the expected counts in an experimental design that involves one or more categorical variables. The null hypothesis is the assumption that the observed counts and expected counts are the same. A large value of the chi-squared statistic gives evidence for the rejection of the null hypothesis.

The chi-squared test is also simple to use. The chi-squared statistic has an approximate chi-squared distribution, which makes it easy to evaluate the sample data. The chi-squared test is included in various software packages. For applications with a small number of categories, the calculation can even be done with a hand-held calculator.

_______________________________________________________________________________________________

The Goodness-of-Fit Test and the Test of Homogeneity

The three interpretations of the chi-squared test have been discussed in these posts: goodness-of-fit test, test of homogeneity and test of independence.

The three different uses of the test as discussed in the three previous posts can be kept straight by having a firm understanding of the underlying experimental design.

For the goodness-of-fit test, there is only one population involved. The experiment is to measure one categorical variable on one population. Thus only one sample is used in applying the chi-squared test. The one-sample data would produce the observed counts for the categorical variable in questions. Let’s say the variable has k cells. Then there would be k observed counts. The expected counts for the k cells would come from an hypothesized distribution of the categorical variable. The chi-squared statistic is then the sum of k squared differences of the observed and expected counted (normalized by dividing the expected counts). Essentially the hypothesized distribution is the null hypothesis. More specifically, the null hypothesis would be the statement that the cell probabilities are derived from the hypothesized distribution.

As a quick example, we may want to answer the question whether a given die is a fair die. We then observe n rolls of the die and classify the rolls into 6 cells (the value of 1 to 6). The null hypothesis is that the values of the die follow a uniform distribution. Another way to state the hypothesis is that each cell probability is 1/6. Another example is the testing of the hypothesis of whether the claim frequency of a group of insured drivers follows a Poisson distribution. The cell probabilities are then calculated based on the assumption of a Poisson distribution. In short, the goodness-of-fit test is to test whether the observed counts for one categorical variable come from (or fit) a hypothesized distribution. See Example 1 and Example 2 in the post on goodness-of-fit test.

In the test of homogeneity, the focus is to compare two or more populations (or two or more subpopulations of a population) on the same categorical variable, i.e. whether the categorical variable in question follow the same distribution across the different populations. For example, do two different groups of insured drivers exhibit the same claim frequency rates? For example, do adults with different educational attainment levels have the same proportions of current smokers/former smokers/never smokers? For example, are political affiliations similar across racial/ethnic groups? In this test, the goal is to determine whether cells in the categorical variable have the same proportions across the populations, hence the name of test of homogeneity. In the experiment, researchers would sample each population (or group) separately on the categorical variable in questions. Thus there will be multiple samples (one for each group) and the samples are independent.

In the test of homogeneity, the calculation of the chi-squared statistic would involve adding up the squared differences of the observed counts and expected counts for the multiple samples. For illustration, see Example 1 and Example 2 in the post on test of homogeneity.

_______________________________________________________________________________________________

Test of Independence

The test of independence can be confused with the test of homogeneity. It is possible that the objectives for both tests are similar. For example, a test of hypothesis might seek to determine whether the proportions of smoking statuses (current smoker, former smoker and never smoker) are the same across the groups with different education levels. This sounds like a test of homogeneity since it seeks to determine whether the distribution of smoking status is the same across the different groups (levels of educational attainment). However, a test of independence can also have this same objective.

The difference between the test of homogeneity and the test of independence is one of experimental design. In the test of homogeneity, the researchers sample each group (or population) separately. For example, they would sample individuals from groups with various levels of education separately and classify the individuals in each group by smoking status. The chi-squared test to use in this case is the test of homogeneity. In this experimental design, the experimenter might sample 1,000 individuals who are not high school graduate, 1,000 individuals who are high school graduates, 1,000 individuals who have some college and so on. Then the experimenter would compare the distribution of smoking status across the different samples.

An experimenter using a test of independence might try to answer the same question but is proceeding in a different way. The experimenter would sample the individuals from a given population and observe two categorical variables (e.g. level of education and smoking status) for the same individual.

Then the researchers would classify each individual into a cell in a two-way table. See Table 3b in the previous post on test of independence. The values of the level of education go across the column in the table (the column variable). The values of the smoking status go down the rows (the row variable). Each individual in the sample would belong to one cell in the table according to the values of the row and column variables. The two-way table is to help determine whether the row variable and the column variable are associated in the given population. In other words, the experimenter is interested in finding out whether one variable explains the other (or one variable affects the other).

For the sake of ease in the discussion, let’s say the column variable (level of education) is the explanatory variable. The experimenter would then be interested in whether the conditional distribution of the row variable (smoking status) would be similar or different across the columns. If the conclusion is similar, it means that the column variable does not affect the row variable (or the two variables are not associated). This would also mean that the distribution of smoking status are the same across the different levels of education (a conclusion of homogeneity).

If the conclusion is that the conditional distribution of the row variable (smoking status) would be different across the columns, then the column variable does affect the row variable (or the two variables are associated). This would also mean that the distribution of smoking status are different across the different levels of education (a conclusion of non-homogeneity).

The test of independence and the test of homogeneity are based on two different experimental designs. Hence their implementations of the chi-squared statistic are different. However, each design can be structured to answer similar questions.

_______________________________________________________________________________________________
\copyright 2017 – Dan Ma

The Chi-Squared Distribution, Part 3a

This post is the part 3 of a three-part series on chi-squared distribution. In this post, we discuss the roles played by chi-squared distribution on experiments or random phenomena that result in measurements that are categorical rather than quantitative (part 2 deals with quantitative measurements). An introduction of the chi-squared distribution is found in part 1.

The chi-squared test discussed here is also referred to as Pearson’s chi-squared test, which was formulated by Karl Pearson in 1900. It can be used to assess three types of comparison on categorical variables – goodness of fit, homogeneity, and independence. As a result, we break up the discussion into 3 parts – part 3a (goodness of fit, this post), part 3b (test of homogeneity) and part 3c (test of independence).

_______________________________________________________________________________________________

Multinomial Experiments

Let’s look at the setting for Pearson’s goodness-of-fit test. Consider a random experiment consisting of a series of independent trials each of which results in exactly one of k categories. We are interested in summarizing the counts of the trials that fall into the k distinct categories. Some examples of such random experiments are:

  • In rolling a die n times, consider the counts of the faces of the die.
  • Perform a series of experiments each of which is a toss of three coins. Summarize the experiments according to the number of heads, 0, 1, 2, and 3, that occur in each experiment.
  • Blood donors can be classified into the blood types A, B, AB and O.
  • Record the number of automobile accidents per week in a one-mile stretch of highway. Classify the weekly accident counts into the groupings 0, 1, 2, 3, 4 and 5+.
  • A group of auto insurance policies are classified into the claim frequency rates of 0, 1, ,2, 3, 4+ accidents per year.
  • Auto insurance claims are classified into various claim size groupings, e.g. under 1000, 1000 to 5000, 5000 to 10000 and 10000+.
  • In auditing financial transactions in financial documents (accounting statements, expense reports etc), the leading digits of financial figures can be classified into 9 cells: 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Each of the example can be referred to as a multinomial experiment. The characteristics of such an experiment are

  • The experiment consists of performing n identical trials that are independent.
  • For each trial, the outcome falls into exactly one of k categories or cells.
  • The probability of the outcome of a trial falling into a particular cell is constant across all trials.

For cell j, let p_j be the probability of the outcome falling into cell j. Of course, p_1+p_2+\cdots+p_k=1. We are interested in the joint random variables Y_1,Y_2,\cdots,Y_k where Y_j is the number of trials whose outcomes fall into cell j.

If k=2 (only two categories for each trial), then the experiment is a binomial experiment. Then one of the categories can be called success (with cell probability p) and the other is called failure (with cell probability 1-p). If Y_1 is the count of the successes, then Y_1 has a binomial distribution with parameters n and p.

In general, the variables Y_1,Y_2,\cdots,Y_k have a multinomial distribution. To be a little more precise, the random variables Y_1,Y_2,\cdots,Y_{k-1} have a multinomial distribution with parameters n and p_1,p_2,\cdots,p_{k-1}. Note that the last variable Y_k is deterministic since Y_k=n-(Y_1+\cdots+Y_{k-1}).

In the discussion here, the objective is to make inference on the cell probabilities p_1,p_2,\cdots,p_k. The hypotheses in the statistical test are expressed in terms of specific values of p_j, j=1,2,\cdots,k. For example, the null hypothesis may be of the following form:

    H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k

where p_{j,0} are the hypothesized values of the cell probabilities. It is cumbersome to calculate the probabilities for the multinomial distribution. As a result, it would be difficult (if not impossible) to calculate the exact level of significance, which is the probability of type I error. Thus it is critical to use a test statistic that does not depend on the multinomial distribution. Fortunately this problem was solved by Karl Pearson. He formulated a test statistic that has an approximate chi-squared distribution.

_______________________________________________________________________________________________

Test Statistic

The random variables Y_1,Y_2,\cdots,Y_k discussed above have a multinomial distribution with parameters p_1,p_2,\cdots,p_k, respectively. Of course, each p_j is the probability that the outcome of a trial falls into cell j. The marginal distribution of each Y_j has a binomial distribution with parameters n and p_j with p_j being the probability of success. Thus the expected value and the variance of Y_j are E[Y_j]=n p_j and Var[Y_j]=n p_j (1-p_j). The following is the chi-squared test statistic.

    \displaystyle \chi^2=\sum \limits_{j=1}^k \frac{(Y_j-n \ p_j)^2}{n \ p_j}=\sum \limits_{j=1}^k \frac{(Y_j-E[Y_j])^2}{E[Y_j]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The statistic defined in (1) was proposed by Karl Pearson in 1900. It is defined by summing the squares of the difference of the observed counts Y_j and the expected counts E[Y_j] where each squared difference is normalized by the expected count (i.e. divided by the expected count). On one level, the test statistic in (1) seems intuitive since it involves all the k deviations Y_j-E[Y_j]. If the observed values Y_j are close to the expected cell counts, then the test statistic in (1) would have a small value.

The chi-squared test statistic defined in (1) has an approximate chi-squared distribution when the number of trials n is large. The proof of this fact will not be discussed here. We demonstrate with the case for k=2.

    \displaystyle \begin{aligned} \sum \limits_{j=1}^2 \frac{(Y_j-n \ p_j)^2}{n \ p_j}&=\frac{p_2 \ (Y_1-n p_1)^2+p_1 \ (Y_2-n p_2)^2}{n \ p_1 \ p_2} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+p_1 \ ((n-Y_1)-n (1-p_1))^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+ p_1 \ (Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\biggl( \frac{Y_1-n p_1}{\sqrt{n \ p_1 \ (1-p_1)}} \biggr)^2 =\biggl( \frac{Y_1-E[Y_1]}{\sqrt{Var[Y_1]}} \biggr)^2 \end{aligned}

The quantity inside the brackets in the last step is approximately normal according to the central limit theorem. Since the square of a normal distribution has a chi-squared distribution with one degree of freedom (see Part 1), the last step in the above derivation has an approximate chi-distribution with 1 df.

In order for the chi-squared distribution to provide an adequate approximation to the test statistic in (1), a rule of thumb requires that the expected cell counts E[Y_j] are at least five. The null hypothesis to be tested is that the cell probabilities p_j are certain specified values p_{j,0} for j=1,2,\cdots,k. The following is the formal statement.

    H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The hull hypothesis is to be tested against all possible alternatives. In other words, the alternative hypothesis H_1 is the statement that p_j \ne p_{j,0} for at least one j.

The chi-squared test statistic in (1) can be used for a goodness-of-fit test, i.e. to test how well a probability model fit the sample data, in other words, to test whether the observed data come from a hypothesized probability distribution. Example 2 below will test whether the Poisson model is a good fit for a set of claim frequency data.

_______________________________________________________________________________________________

Degrees of Freedom

Now that we have addressed the distribution for the test statistic in (1), we need to address two more issues. One is the direction of the hypothesis test (one-tailed or two-tailed). The second is the degrees of freedom. The direction of the test is easy to see. Note that the chi-squared test statistic in (1) is always a positive value. On the other hand, if the difference between observed cell counts and expected cell counts is large, the large difference would contradict the null hypothesis. Thus if the chi-squared statistic has a large value, we should reject the null hypothesis. So the correct test to use is the upper tailed chi-squared test.

The number of degrees of freedom is obtained by subtracting one from the cell count k for each independent linear restriction placed on the cell probabilities. There is at least one linear restriction. The sum of all the cell probabilities must be 1. Thus the degrees of freedom must be the result of reducing k by one at least one time. This means that the degrees of freedom of the chi-squared statistic in (1) is at most k-1.

Furthermore, in the calculation for the specified cell probabilities, if there is any parameter that is unknown and is required to be estimated from the data, then there are further reductions in k-1. If there is any unknown parameter that needs to be estimated from data, a maximum likelihood estimator (MLE) should be used. All these points will be demonstrated by the examples below.

If the value of the chi-squared statistic in (1) is “large”, we reject the null hypothesis H_0 stated in (2). By large we mean the value of the chi-squared statistic exceeds the critical value for the desired level of significance. The critical value is the upper tail in the chi-squared distribution (with the appropriate df) of area \alpha where \alpha is the desired level of significance (e.g. \alpha=0.05 and \alpha=0.01 or some other appropriate level). Instead of using critical value, the p-value approach can also be used. Critical value or p-value can be looked up using table or computed using software. For the examples below, chi-squared functions in Excel are used.

_______________________________________________________________________________________________

Examples

Example 1
Suppose that we wish to test whether a given die is a fair die. We roll the die 240 times and the following table shows the results.

    \displaystyle \begin{array} {rr} \text{Cell} & \text{Frequency} \\ 1 & 38  \\ 2 & 35 \\ 3 & 37 \\ 4 & 38  \\ 5 & 42 \\ 6 & 50 \\ \text{ } & \text{ } \\ \text{Total} & 240   \end{array}

The null hypothesis

    \displaystyle H_0: p_1=p_2=p_3=p_4=p_5=p_6=\frac{1}{6}

is tested against the alternative that at least one of the equalities is not true. This example is the simplest problem for testing cell probabilities. Since the specified values for the cells probabilities in H_0 are known, the degrees of freedom is one less than the cell count. Thus df = 5. The following is the chi-squared statistic based on the data and the null hypothesis.

    \displaystyle \begin{aligned} \chi^2&=\frac{(38-40)^2}{40}+\frac{(35-40)^2}{40}+\frac{(37-40)^2}{40} \\& \ \ + \frac{(38-40)^2}{40}+\frac{(42-40)^2}{40}+\frac{(50-40)^2}{40}=3.65  \end{aligned}

At \alpha=0.05 level of significance, the chi-squared critical value at df = 5 is \chi_{0.05}^2(5)=11.07049769. Since 3.65 < 11.07, the hypothesis that the die is fair is not rejected at \alpha=0.05. The p-value is P[\chi^2 > 3.65]=0.6. With such a large p-value, we also come to the conclusion that the null hypothesis is not rejected. \square

In all the examples, the critical values and the p-values are obtained by using the following functions in Excel.

    critical value
    =CHISQ.INV.RT(level of significance, df)

    p-value
    =1 – CHISQ.DIST(test statistic, df, TRUE)

Example 2
We now give an example for the chi-squared goodness-of-fit test. The number of auto accident claims per year from 700 drivers are recorded by an insurance company. The claim frequency data is shown in the following table.

    \displaystyle \begin{array}{rrr} \text{Claim Count} & \text{ } & \text{Frequency} \\ 0 & \text{ } & 351  \\ 1 & \text{ } & 241 \\ 2 & \text{ } & 73 \\ 3 & \text{ } & 29  \\ 4 & \text{ } & 6 \\ 5+ & \text{ } & 0 \\ \text{ } & \text{ } & \text{ } \\ \text{Total} & \text{ } & 700   \end{array}

Test the hypothesis that the annual claim count for a driver has a Poisson distribution. Use \alpha=0.05. Assume that the claim frequency across the drivers in question are independent.

The hypothesized distribution of the annual claim frequency is a Poisson distribution with unknown mean \lambda. The MLE of the parameter \lambda is the sample mean, which in this case is \hat{\lambda}=\frac{498}{700}=0.711428571.

Under the assumption that the claim frequency is Poisson with mean \hat{\lambda}, the cell probabilities are calculated using \hat{\lambda}.

    \displaystyle p_1=P[Y=0]=e^{-\hat{\lambda}}=0.4909

    \displaystyle p_2=P[Y=1]=\hat{\lambda} \ e^{-\hat{\lambda}}=0.3493

    \displaystyle p_3=P[Y=2]=\frac{1}{2} \ \hat{\lambda}^2 \ e^{-\hat{\lambda}}=0.1242

    \displaystyle p_4=P[Y=3]=\frac{1}{3!} \ \hat{\lambda}^3 \ e^{-\hat{\lambda}}=0.0295

    \displaystyle p_5=P[Y \ge 4]=1-P[Y=0]-P[Y=1]-P[Y=2]-P[Y=3]=0.0061

Then the null hypothesis is:

    H_0: p_1=0.4909, p_2=0.3493, p_3=0.1242, p_4=0.0295, p_5=0.0061

The null hypothesis is tested against all alternatives. The following table shows the calculation of the chi-squared statistic.

    \displaystyle \begin{array}{rrrrrrrrr}   \text{Cell} & \text{ } & \text{Claim Count} & \text{ } & \text{Cell Probability} & \text{ } & \text{Expected Count} & \text{ } & \text{Chi-Squared}   \\ 1 & \text{ } & 0 & \text{ } & 0.4909 & \text{ } & 343.63 & \text{ } & 0.15807   \\ 2 & \text{ } & 1 & \text{ } & 0.3493 & \text{ } & 244.51 & \text{ } & 0.05039   \\ 3 & \text{ } & 2 & \text{ } & 0.1242 & \text{ } & 86.94 & \text{ } & 2.23515   \\ 4 & \text{ } & 3 & \text{ } & 0.0295 & \text{ } & 20.65 & \text{ } & 3.37639    \\ 5 & \text{ } & 4+ & \text{ } & 0.0061 & \text{ } & 4.27 & \text{ } & 0.70091   \\ \text{ } & \text{ } & \text{ }   \\ \text{Total} & \text{ } & \text{ } & \text{ } & 1.0000 & \text{ } & \text{ } & \text{ } & 6.52091   \end{array}

The degrees of freedom of the chi-squared statistic is df = 5 – 1 -1 = 3. The first reduction of one is due to the linear restriction of all cell probabilities summing to 1. The second reduction is due to the fact that one unknown parameter \lambda has to be estimated using sample data. Using Excel, the critical value is \chi_{0.05}^2(3)=7.814727903. The p-value is P[\chi^2 > 6.52091]=0.088841503. Thus the null hypothesis is not rejected at the level of significance \alpha=0.05. \square

Example 3
For many data sets, especially for data sets with numbers that distribute across multiple orders of magnitude, the first digits occur according to the probability distribution indicated in below:

    Probability for leading digit 1 = 0.301
    Probability for leading digit 2 = 0.176
    Probability for leading digit 3 = 0.125
    Probability for leading digit 4 = 0.097
    Probability for leading digit 5 = 0.079
    Probability for leading digit 6 = 0.067
    Probability for leading digit 7 = 0.058
    Probability for leading digit 8 = 0.051
    Probability for leading digit 9 = 0.046

This probability distribution was discovered by Simon Newcomb in 1881 and was rediscovered by physicist Frank Benford in 1938. Since then this distribution has become known as the Benford’s law. Thus in many data sets, the leading digit 1 occurs about 30% of the time. The data sets for which this law is applicable are demographic data (e.g. income data of a large population, census data such as populations of cities and counties) and scientific data. The law is also applicable in certain financial data, e.g. tax data, stock exchange data, corporate disbursement and sales data. Thus the Benford’s law is a great tool for forensic accounting and auditing.

The following shows the distribution of first digits in the population counts of all 3,143 counties in the United States (from US census data).

    Count for leading digit 1 = 972
    Count for leading digit 2 = 573
    Count for leading digit 3 = 376
    Count for leading digit 4 = 325
    Count for leading digit 5 = 205
    Count for leading digit 6 = 209
    Count for leading digit 7 = 179
    Count for leading digit 8 = 155
    Count for leading digit 9 = 149

Use the chi-squared goodness-of-fit test to test the hypothesis that the leading digits in the county population data follow the Benford’s law. This example is also discussed in this blog post. \square

For further information and more examples on chi-squared test, please see the sources listed in the reference section.

_______________________________________________________________________________________________

Reference

  1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
  2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________
\copyright \ 2017 - \text{Dan Ma}