So far in our course, we have only discussed measurements taken in one variable for each sampling unit. This is referred to as univariate data. In this lesson, we are going to talk about measurements taken in two variables for each sampling unit. This is referred to as bivariate data.
Often when there are two measurements taken on the same sampling unit, one variable is the response variable and the other is the explanatory variable. The explanatory variable can be seen as the indicator of which population the sampling unit comes from. It helps to be able to identify which is the response and which is the explanatory variable.
In this lesson, here are some of the cases we will consider:
Categorical - taken from two distinct groups
If the measurements are categorical and taken from two distinct groups, the analysis will involve comparing two independent proportions.
Sex and whether they smoke
Consider a case where we measure sex and whether they smoke. In this case, the response variable is categorical, and the explanatory variable is also categorical.
Quantitative - taken from two distinct groups
If the measurements are quantitative and taken from two distinct groups, the analysis will involve comparing two independent means.
GPA and the current degree level of a student
In this case, the response variable is quantitative, and the explanatory variable is categorical.
Quantitative - taken twice from each subject (paired)
If the measurements are quantitative and taken twice from each subject, the analysis will involve comparing two dependent means.
Dieting and the participant's weight before and after
In this case, the response is quantitative, and we will show later why there is no explanatory variable.
Categorical - taken twice from each subject (paired)
Finally, if the measurements are categorical and taken twice from each subject, the analysis will involve comparing two dependent proportions. However, we will not discuss this last situation.
To begin, just as we did previously, one has to first decide whether the problem you are investigating requires the analysis of categorical or quantitative data. In other words, you need to identify your response variable and determine the type of variable. Next, one has to determine if the two measurements are from independent samples or dependent samples.
You will find that much of what we discuss will be an extension of our previous lessons on confidence intervals and hypothesis testing for one-proportion and one-mean. We will want to check the necessary conditions in order to use the distributions as before. If conditions are satisfied, we calculate the specific test statistic and again compare this to a critical value (rejection region approach) or find the probability of observing this test statistic or one more extreme (p-value approach). The decision process will be the same as well: if the test statistic falls in the rejection region, we will reject the null hypothesis; if the p-value is less than the preset level of significance, we will reject the null hypothesis. The interpretation of confidence intervals in support of the hypothesis decision will also be familiar:
One departure we will take from our previous lesson on hypothesis testing is how we will treat the null value. In the previous lesson, the null value could vary. In this lesson, when comparing two proportions or two means, we will use a null value of 0 (i.e., "no difference").
For example, \(\mu_1-\mu_2=0\) would mean that \(\mu_1=\mu_2\), and there would be no difference between the two population parameters. Similarly for two population proportions.
Although we focus on the difference equalling zero, it is possible to test for specific values of the difference using the methods presented. However, most applications research only for a difference in the parameters (i.e., the difference is less than, greater than, or not equal to zero).
We will start by comparing two independent population proportions, move to compare two independent population means, from there to paired population means, and ending with the comparison of two independent population variances.