Statistics is often introduced using quantitative data such as height, weight, waiting time, etc. But there is a wide, wide world of very important data of a different type; qualitative or categorical data. As the name suggests, categorical data is expressed, as a random variable, in terms of categories or cells; two or more information classes often identified by text. Examples of this data type include gender, race, blood type, brand preference, mode of transportation, shoe size, and type of occupation.
In the past, many of the sources of categorical data originated in the social sciences, while engineering, technology, and the pure sciences were concerned mainly with quantitative data. This schism between disciplines no longer prevails: much of the statistics done in medicine, for instance, involves categorical data. It is even helpful, at times, to convert quantitative variables to categorical form. Thus a set of blood cholesterol measurements, originally quantitative, could be assigned to three categories; low, medium, and high.
Sometimes categorical cells present in the form of integers; for instance, number of goals per match, or number of 5’s when a fair die is tossed. Thus the domain of a categorical variable may be either discrete or non-numeric.
Categorical data comes “pre-classified;” each category is a unique data class with a specific count associated with it. These counts are, in fact, data frequencies. Therefore the frequency distribution of a categorical data set is often given to us directly in the form of a table.
The students at Degrassi High were surveyed as to their preferred flavoured drink. The results are shown in the following table.
Categorical data may present in either nominal or ordinal form. Nominal data consists of two or more categories which cannot be ordered. For example, type of car in the table below.
The simplest type of nominal data consists of exactly two categories or levels. It is called binomial or binary data.
With ordinal data, the categories have a definite order; we can find a minimum, maximum, and usually in-between categories within the whole. Consider, for instance, the following data on shoe size.
When a single variable is involved, categorical data is called univariate. A pie chart or bar chart is the graphic display format for such data.
Often categorical data, even from a single sample, is analyzed in terms of two variables. A contingency table is used to examine the data in this way.
A study was done of how many students in a college were left-handed and how many were right-handed. As well as handedness, the gender of each person was also recorded. The results were as follows.