Testing Categorical Data

f0804

Introduction

Chi-Square Tests are generally applied to categorical data. Yet the Chi-square distribution is a continuous random variable. How can this be ? The answer may be found in the history of statistical methods.

Just as de Moivre and Laplace sought for and found the normal approximation to the binomial, Karl Pearson sought for and found a multivariate normal approximation to the multinomial distribution. A categorical data set with 3 or more categories arises from a multinomial experiment with an underlying multinomial (discrete) distribution.

Pearson showed that the chi-squared distribution, the sum of multiple normal distributions, is the best approximation to the multinomial distribution. Hence, under appropriate conditions, the chi-squared statistic, X2, computed from a categorical data set has an approximate chi-squared distribution.

Goodness of Fit

A critical question is often whether or not an observed pattern of categorical data, fi , fits some given (historical or theoretical) distribution … A perfect fit cannot be expected, so the investigator must look at discrepancies and make judgements as to the goodness of fit. The test statistic of choice is the following.

            chi_stat_gof

This X2 is approximated well by χ2 as long as none of the expected values (ei) fall below 5.

In this test, single-sample observations on a single random variable are compared to a population model. Sometimes the population model can be a specific discrete or continuous distribution.

General method

1. Use a given population model, or determine which distribution is a good model by examining the conditions which apply to the observed data.

2. Set the significance level.

3. Form the hypothesis  H0  vs  Ha

Calculate cell frequencies expected under H0;  ei = npi , where n is the sample size.

6. Combine any expected frequencies so that none are less than 5.

7. Find k-1, the degrees of freedom, where k is the number of categories.

Calculate the test statistic, X2 , appropriate for the goodness of fit test.

10. Use the p-value or classical method to determine if the test statistic value is significant.

11. Draw the appropriate conclusion and interpret in the context of the original problem.

***********************

Independence

In many real-world problems we want to compare two or more observed samples without any prior assumptions about an expected distribution. Is there an association (dependence), or not, between two of the variables involved ?

Contingency table of observed data …is a single sample classified on two variables.

Use another X2 statistic:  chi_stat_indep

Analysis involves calculating a table of expected values, assuming the null hypothesis about independence is true. We then compare these expected values with the those in the observed contingency table.

No matter what happens, we cannot claim a causal relationship.

Homogeneity

A single-variable procedure intended to compare samples from two or more populations.