# The Chi-Squared Distribution, Part 3a

This post is the part 3 of a three-part series on chi-squared distribution. In this post, we discuss the roles played by chi-squared distribution on experiments or random phenomena that result in measurements that are categorical rather than quantitative (part 2 deals with quantitative measurements). An introduction of the chi-squared distribution is found in part 1.

The chi-squared test discussed here is also referred to as Pearson’s chi-squared test, which was formulated by Karl Pearson in 1900. It can be used to assess three types of comparison on categorical variables – goodness of fit, homogeneity, and independence. As a result, we break up the discussion into 3 parts – part 3a (goodness of fit, this post), part 3b (test of homogeneity) and part 3c (test of independence).

_______________________________________________________________________________________________

Multinomial Experiments

Let’s look at the setting for Pearson’s goodness-of-fit test. Consider a random experiment consisting of a series of independent trials each of which results in exactly one of $k$ categories. We are interested in summarizing the counts of the trials that fall into the $k$ distinct categories. Some examples of such random experiments are:

• In rolling a die $n$ times, consider the counts of the faces of the die.
• Perform a series of experiments each of which is a toss of three coins. Summarize the experiments according to the number of heads, 0, 1, 2, and 3, that occur in each experiment.
• Blood donors can be classified into the blood types A, B, AB and O.
• Record the number of automobile accidents per week in a one-mile stretch of highway. Classify the weekly accident counts into the groupings 0, 1, 2, 3, 4 and 5+.
• A group of auto insurance policies are classified into the claim frequency rates of 0, 1, ,2, 3, 4+ accidents per year.
• Auto insurance claims are classified into various claim size groupings, e.g. under 1000, 1000 to 5000, 5000 to 10000 and 10000+.
• In auditing financial transactions in financial documents (accounting statements, expense reports etc), the leading digits of financial figures can be classified into 9 cells: 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Each of the example can be referred to as a multinomial experiment. The characteristics of such an experiment are

• The experiment consists of performing $n$ identical trials that are independent.
• For each trial, the outcome falls into exactly one of $k$ categories or cells.
• The probability of the outcome of a trial falling into a particular cell is constant across all trials.

For cell $j$, let $p_j$ be the probability of the outcome falling into cell $j$. Of course, $p_1+p_2+\cdots+p_k=1$. We are interested in the joint random variables $Y_1,Y_2,\cdots,Y_k$ where $Y_j$ is the number of trials whose outcomes fall into cell $j$.

If $k=2$ (only two categories for each trial), then the experiment is a binomial experiment. Then one of the categories can be called success (with cell probability $p$) and the other is called failure (with cell probability $1-p$). If $Y_1$ is the count of the successes, then $Y_1$ has a binomial distribution with parameters $n$ and $p$.

In general, the variables $Y_1,Y_2,\cdots,Y_k$ have a multinomial distribution. To be a little more precise, the random variables $Y_1,Y_2,\cdots,Y_{k-1}$ have a multinomial distribution with parameters $n$ and $p_1,p_2,\cdots,p_{k-1}$. Note that the last variable $Y_k$ is deterministic since $Y_k=n-(Y_1+\cdots+Y_{k-1})$.

In the discussion here, the objective is to make inference on the cell probabilities $p_1,p_2,\cdots,p_k$. The hypotheses in the statistical test are expressed in terms of specific values of $p_j$, $j=1,2,\cdots,k$. For example, the null hypothesis may be of the following form: $H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k$

where $p_{j,0}$ are the hypothesized values of the cell probabilities. It is cumbersome to calculate the probabilities for the multinomial distribution. As a result, it would be difficult (if not impossible) to calculate the exact level of significance, which is the probability of type I error. Thus it is critical to use a test statistic that does not depend on the multinomial distribution. Fortunately this problem was solved by Karl Pearson. He formulated a test statistic that has an approximate chi-squared distribution.

_______________________________________________________________________________________________

Test Statistic

The random variables $Y_1,Y_2,\cdots,Y_k$ discussed above have a multinomial distribution with parameters $p_1,p_2,\cdots,p_k$, respectively. Of course, each $p_j$ is the probability that the outcome of a trial falls into cell $j$. The marginal distribution of each $Y_j$ has a binomial distribution with parameters $n$ and $p_j$ with $p_j$ being the probability of success. Thus the expected value and the variance of $Y_j$ are $E[Y_j]=n p_j$ and $Var[Y_j]=n p_j (1-p_j)$. The following is the chi-squared test statistic. $\displaystyle \chi^2=\sum \limits_{j=1}^k \frac{(Y_j-n \ p_j)^2}{n \ p_j}=\sum \limits_{j=1}^k \frac{(Y_j-E[Y_j])^2}{E[Y_j]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

The statistic defined in (1) was proposed by Karl Pearson in 1900. It is defined by summing the squares of the difference of the observed counts $Y_j$ and the expected counts $E[Y_j]$ where each squared difference is normalized by the expected count (i.e. divided by the expected count). On one level, the test statistic in (1) seems intuitive since it involves all the $k$ deviations $Y_j-E[Y_j]$. If the observed values $Y_j$ are close to the expected cell counts, then the test statistic in (1) would have a small value.

The chi-squared test statistic defined in (1) has an approximate chi-squared distribution when the number of trials $n$ is large. The proof of this fact will not be discussed here. We demonstrate with the case for $k=2$. \displaystyle \begin{aligned} \sum \limits_{j=1}^2 \frac{(Y_j-n \ p_j)^2}{n \ p_j}&=\frac{p_2 \ (Y_1-n p_1)^2+p_1 \ (Y_2-n p_2)^2}{n \ p_1 \ p_2} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+p_1 \ ((n-Y_1)-n (1-p_1))^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(1-p_1) \ (Y_1-n p_1)^2+ p_1 \ (Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\frac{(Y_1-n p_1)^2}{n \ p_1 \ (1-p_1)} \\&=\biggl( \frac{Y_1-n p_1}{\sqrt{n \ p_1 \ (1-p_1)}} \biggr)^2 =\biggl( \frac{Y_1-E[Y_1]}{\sqrt{Var[Y_1]}} \biggr)^2 \end{aligned}

The quantity inside the brackets in the last step is approximately normal according to the central limit theorem. Since the square of a normal distribution has a chi-squared distribution with one degree of freedom (see Part 1), the last step in the above derivation has an approximate chi-distribution with 1 df.

In order for the chi-squared distribution to provide an adequate approximation to the test statistic in (1), a rule of thumb requires that the expected cell counts $E[Y_j]$ are at least five. The null hypothesis to be tested is that the cell probabilities $p_j$ are certain specified values $p_{j,0}$ for $j=1,2,\cdots,k$. The following is the formal statement. $H_0: p_j=p_{j,0} \text{ for } j=1,2,3,\cdots,k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The hull hypothesis is to be tested against all possible alternatives. In other words, the alternative hypothesis $H_1$ is the statement that $p_j \ne p_{j,0}$ for at least one $j$.

The chi-squared test statistic in (1) can be used for a goodness-of-fit test, i.e. to test how well a probability model fit the sample data, in other words, to test whether the observed data come from a hypothesized probability distribution. Example 2 below will test whether the Poisson model is a good fit for a set of claim frequency data.

_______________________________________________________________________________________________

Degrees of Freedom

Now that we have addressed the distribution for the test statistic in (1), we need to address two more issues. One is the direction of the hypothesis test (one-tailed or two-tailed). The second is the degrees of freedom. The direction of the test is easy to see. Note that the chi-squared test statistic in (1) is always a positive value. On the other hand, if the difference between observed cell counts and expected cell counts is large, the large difference would contradict the null hypothesis. Thus if the chi-squared statistic has a large value, we should reject the null hypothesis. So the correct test to use is the upper tailed chi-squared test.

The number of degrees of freedom is obtained by subtracting one from the cell count $k$ for each independent linear restriction placed on the cell probabilities. There is at least one linear restriction. The sum of all the cell probabilities must be 1. Thus the degrees of freedom must be the result of reducing $k$ by one at least one time. This means that the degrees of freedom of the chi-squared statistic in (1) is at most $k-1$.

Furthermore, in the calculation for the specified cell probabilities, if there is any parameter that is unknown and is required to be estimated from the data, then there are further reductions in $k-1$. If there is any unknown parameter that needs to be estimated from data, a maximum likelihood estimator (MLE) should be used. All these points will be demonstrated by the examples below.

If the value of the chi-squared statistic in (1) is “large”, we reject the null hypothesis $H_0$ stated in (2). By large we mean the value of the chi-squared statistic exceeds the critical value for the desired level of significance. The critical value is the upper tail in the chi-squared distribution (with the appropriate df) of area $\alpha$ where $\alpha$ is the desired level of significance (e.g. $\alpha=0.05$ and $\alpha=0.01$ or some other appropriate level). Instead of using critical value, the p-value approach can also be used. Critical value or p-value can be looked up using table or computed using software. For the examples below, chi-squared functions in Excel are used.

_______________________________________________________________________________________________

Examples

Example 1
Suppose that we wish to test whether a given die is a fair die. We roll the die 240 times and the following table shows the results. $\displaystyle \begin{array} {rr} \text{Cell} & \text{Frequency} \\ 1 & 38 \\ 2 & 35 \\ 3 & 37 \\ 4 & 38 \\ 5 & 42 \\ 6 & 50 \\ \text{ } & \text{ } \\ \text{Total} & 240 \end{array}$

The null hypothesis $\displaystyle H_0: p_1=p_2=p_3=p_4=p_5=p_6=\frac{1}{6}$

is tested against the alternative that at least one of the equalities is not true. This example is the simplest problem for testing cell probabilities. Since the specified values for the cells probabilities in $H_0$ are known, the degrees of freedom is one less than the cell count. Thus df = 5. The following is the chi-squared statistic based on the data and the null hypothesis. \displaystyle \begin{aligned} \chi^2&=\frac{(38-40)^2}{40}+\frac{(35-40)^2}{40}+\frac{(37-40)^2}{40} \\& \ \ + \frac{(38-40)^2}{40}+\frac{(42-40)^2}{40}+\frac{(50-40)^2}{40}=3.65 \end{aligned}

At $\alpha=0.05$ level of significance, the chi-squared critical value at df = 5 is $\chi_{0.05}^2(5)=11.07049769$. Since 3.65 < 11.07, the hypothesis that the die is fair is not rejected at $\alpha=0.05$. The p-value is $P[\chi^2 > 3.65]=0.6$. With such a large p-value, we also come to the conclusion that the null hypothesis is not rejected. $\square$

In all the examples, the critical values and the p-values are obtained by using the following functions in Excel.

critical value
=CHISQ.INV.RT(level of significance, df)

p-value
=1 – CHISQ.DIST(test statistic, df, TRUE)

Example 2
We now give an example for the chi-squared goodness-of-fit test. The number of auto accident claims per year from 700 drivers are recorded by an insurance company. The claim frequency data is shown in the following table. $\displaystyle \begin{array}{rrr} \text{Claim Count} & \text{ } & \text{Frequency} \\ 0 & \text{ } & 351 \\ 1 & \text{ } & 241 \\ 2 & \text{ } & 73 \\ 3 & \text{ } & 29 \\ 4 & \text{ } & 6 \\ 5+ & \text{ } & 0 \\ \text{ } & \text{ } & \text{ } \\ \text{Total} & \text{ } & 700 \end{array}$

Test the hypothesis that the annual claim count for a driver has a Poisson distribution. Use $\alpha=0.05$. Assume that the claim frequency across the drivers in question are independent.

The hypothesized distribution of the annual claim frequency is a Poisson distribution with unknown mean $\lambda$. The MLE of the parameter $\lambda$ is the sample mean, which in this case is $\hat{\lambda}=\frac{498}{700}=0.711428571$.

Under the assumption that the claim frequency is Poisson with mean $\hat{\lambda}$, the cell probabilities are calculated using $\hat{\lambda}$. $\displaystyle p_1=P[Y=0]=e^{-\hat{\lambda}}=0.4909$ $\displaystyle p_2=P[Y=1]=\hat{\lambda} \ e^{-\hat{\lambda}}=0.3493$ $\displaystyle p_3=P[Y=2]=\frac{1}{2} \ \hat{\lambda}^2 \ e^{-\hat{\lambda}}=0.1242$ $\displaystyle p_4=P[Y=3]=\frac{1}{3!} \ \hat{\lambda}^3 \ e^{-\hat{\lambda}}=0.0295$ $\displaystyle p_5=P[Y \ge 4]=1-P[Y=0]-P[Y=1]-P[Y=2]-P[Y=3]=0.0061$

Then the null hypothesis is: $H_0: p_1=0.4909, p_2=0.3493, p_3=0.1242, p_4=0.0295, p_5=0.0061$

The null hypothesis is tested against all alternatives. The following table shows the calculation of the chi-squared statistic. $\displaystyle \begin{array}{rrrrrrrrr} \text{Cell} & \text{ } & \text{Claim Count} & \text{ } & \text{Cell Probability} & \text{ } & \text{Expected Count} & \text{ } & \text{Chi-Squared} \\ 1 & \text{ } & 0 & \text{ } & 0.4909 & \text{ } & 343.63 & \text{ } & 0.15807 \\ 2 & \text{ } & 1 & \text{ } & 0.3493 & \text{ } & 244.51 & \text{ } & 0.05039 \\ 3 & \text{ } & 2 & \text{ } & 0.1242 & \text{ } & 86.94 & \text{ } & 2.23515 \\ 4 & \text{ } & 3 & \text{ } & 0.0295 & \text{ } & 20.65 & \text{ } & 3.37639 \\ 5 & \text{ } & 4+ & \text{ } & 0.0061 & \text{ } & 4.27 & \text{ } & 0.70091 \\ \text{ } & \text{ } & \text{ } \\ \text{Total} & \text{ } & \text{ } & \text{ } & 1.0000 & \text{ } & \text{ } & \text{ } & 6.52091 \end{array}$

The degrees of freedom of the chi-squared statistic is df = 5 – 1 -1 = 3. The first reduction of one is due to the linear restriction of all cell probabilities summing to 1. The second reduction is due to the fact that one unknown parameter $\lambda$ has to be estimated using sample data. Using Excel, the critical value is $\chi_{0.05}^2(3)=7.814727903$. The p-value is $P[\chi^2 > 6.52091]=0.088841503$. Thus the null hypothesis is not rejected at the level of significance $\alpha=0.05$. $\square$

Example 3
For many data sets, especially for data sets with numbers that distribute across multiple orders of magnitude, the first digits occur according to the probability distribution indicated in below:

Probability for leading digit 1 = 0.301
Probability for leading digit 2 = 0.176
Probability for leading digit 3 = 0.125
Probability for leading digit 4 = 0.097
Probability for leading digit 5 = 0.079
Probability for leading digit 6 = 0.067
Probability for leading digit 7 = 0.058
Probability for leading digit 8 = 0.051
Probability for leading digit 9 = 0.046

This probability distribution was discovered by Simon Newcomb in 1881 and was rediscovered by physicist Frank Benford in 1938. Since then this distribution has become known as the Benford’s law. Thus in many data sets, the leading digit 1 occurs about 30% of the time. The data sets for which this law is applicable are demographic data (e.g. income data of a large population, census data such as populations of cities and counties) and scientific data. The law is also applicable in certain financial data, e.g. tax data, stock exchange data, corporate disbursement and sales data. Thus the Benford’s law is a great tool for forensic accounting and auditing.

The following shows the distribution of first digits in the population counts of all 3,143 counties in the United States (from US census data).

Count for leading digit 1 = 972
Count for leading digit 2 = 573
Count for leading digit 3 = 376
Count for leading digit 4 = 325
Count for leading digit 5 = 205
Count for leading digit 6 = 209
Count for leading digit 7 = 179
Count for leading digit 8 = 155
Count for leading digit 9 = 149

Use the chi-squared goodness-of-fit test to test the hypothesis that the leading digits in the county population data follow the Benford’s law. This example is also discussed in this blog post. $\square$

For further information and more examples on chi-squared test, please see the sources listed in the reference section.

_______________________________________________________________________________________________

Reference

1. Moore D. S., McCabe G. P., Craig B. A., Introduction to the Practice of Statistics, 7th ed., W. H. Freeman and Company, New York, 2012
2. Wackerly D. D., Mendenhall III W., Scheaffer R. L.,Mathematical Statistics with Applications, Thomson Learning, Inc, California, 2008

_______________________________________________________________________________________________ $\copyright \ 2017 - \text{Dan Ma}$