15 Multiple Logistic Regression

Logistic regression analysis responds to the needs in many domains to predict a categorical binary response134 A binary response takes values 0 or 1. For example, between the right or wrong answer, between survival and death, or between to buy and not to buy. based on two or more predictors. For example, for a response (criterion) variable with two possible values (e.g., Yes or No), logistic regression offers the possibility to attach probability values to the responses given a set of predictors. That is, logistic regression helps understand how multiple predictor variables, together, predict the response or criterion variable membership in one or the other of the two categories of the dependent variable.

The dichotomous nature of the response variable prevents the calculation of a numerical value, as it is the case with regular regression tests. Instead, it uses the binomial probability theory, with only two values to predict, and the maximum likelihood method to generate a best fitting equation that is used to classify the data to the appropriate category based on the regression coefficients.

The basic formula for Logistic Regression is similar to the one used in Linear Regression:

\[ logit(P) = a + b \cdot X \]

For multiple predictors, the formula changes to:

\[ logit(p) = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + ... + \beta_n \cdot X_n \]

In the equation above, p is the probability the outcome or characteristic of interest is attained, \(X_i\) represent the predictors, and \(\beta_i\) represent the relative contribution of these factors.

The dependent variable in Logistic Regression is a logit, the natural logarithm of the odds:

\[ logit(p) = log(odds) = ln(\frac{P}{1-P}) \]

where P is the probability of predicting a 1 (attain the outcome of interest).

In the end, what we are interested in is the probability (P) that the desired outcome occurs. For that, we need to do a little bit of simple algebra135 While they may look scary at first sight, the computations are simple and will be explained in context later in this chapter.:

\[ \begin{array}{c} ln(\frac{P}{1-P} = a + bX)\\ \frac{P}{1-P} = e^{a+bX}\\ P = \frac{e^{a+bX}}{1+e^{a+bX}} \end{array} \]

15.1 The Study

To understand a bit better where and how logistic regression is useful, let’s look at an example study designed to investigate the effects of self-explanation136 Self explanation is an explanation a learner generates on his or her own, as opposed to explanations provided by an external source (e.g., book, instructor, peer). on learners’ performance on causal reasoning tasks. Specifically, this study was designed as a completely randomized137 The participants were randomly assigned to one or the other of the two experimental groups., two group (control and treatment), between-subjects138 Because each participant is member in only one of the two experiment groups: control or treatment. experiment that used self-explanation to elicit causal mechanism explanations when reasoning about causally linked events. The target field was medicine, a domain that relies heavily on the understanding and use of extensive and complex causal processes. The overarching research question was:

In the medical field, when learners are reasoning causally, does using self-explanation to elicit an explanation of the causal mechanism(s) improve, on average, learners’ performance on tasks involving such reasoning processes?

Based on the existing literature, the study used prompts to train learners to self-explain before answering a question. That is, to think about and attempt to formalize in writing the principle(s) involved in solving practice problems presented to them. For the purpose of this experiment, the participants were randomly assigned to one of the two groups, control or treatment. The participants in both groups were asked to answer the same multiple choice questions in the training stage, with the difference that those participants in the treatment group were asked, in addition to answering the question and before choosing an answer, to formally explain the causal mechanism behind the problem posed in the question. The participants in the control group were only prompted to choose an answer, without being prompted to explain the mechanism first. It was hypothesized that the participants that had a chance to practice self-explanation (those in the treatment group, who were asked to explain before responding) would perform better, on average, than the control group on a subsequent similar problem, for which the prompt was removed and all participants, in both groups, performed the same task.

Learners’ performance was measured using a single multiple-choice question with only one correct answer for which learners first selected an answer they believed to be correct and then explained the mechanism that supported their choice. An opportunity to change the answer was offered to the participants after they submitted their explanation of the phenomena with a request to explain why the new answer is better than the previous one. The instrument also assessed the participants’ prior knowledge related to the topic used in testing, both as self-assessment and, more objectively, through a set of multiple choice questions. In addition, age group, gender, income group, undergraduate major, and intended medical speciality were collected as demographic variables.

About 350 first- and second-year medical students were invited to participate in the study and processes were set in motion to convince enough students to participate to at least meet the minimum sample size of about 100139 A basic sample size of 88 was computed based on recommendations from Keppel (1991, p. 74) of at least 44 participants per group for medium 0.6 effect size, a power of 0.8, and an alpha level of 0.05. Recommendations from other authors ranged from 40 to 60 per group. Therefore, a choice was made to consider 50 participants per group an acceptable value, which makes 100 participants the minimum sample size for the experiment.. In the end, the recruitment efforts generated a sample of 117 valid responses.

This example will cover only a subsection of the full study that was answered using logistic regression and will use a curated data set. Data manipulation and transformation procedures used to generate the data set used in this example are not covered. This example will focus on the following research question:

Does the practice of self-explanation as causal mechanism elicitation technique affects, on average, learners’ performance on causal reasoning tasks?

The variables included in the model are:

Categorical Performance Score (criterion/response, nominal scale) - calculated assigning a value of 0 to a wrong answer choice or a value of 1 to the correct answer choice.

Experiment Group (predictor, nominal scale) - determined by the group (control or treatment) to which the participant was randomly assigned to.

Year of Study (covariate) - introduced as covariate to control for potential differences in performance due to where the student is situated on the progression timeline in medical school (first or second year medical students). This attempts to account for additional knowledge, experience, and other skills that may help their performance on causal reasoning tasks.

15.2 Assumptions

As with all other statistical analysis tests, Logistic Regression has some requirements to be met:

  • The response (criterion) variable has to be dichotomous (has only two values). For this example, the response variable is dichotomous by design. Therefore, this assumption is verified.
  • The groups (categories) are mutually exclusive, meaning that one case can only be in one of the groups. The random assignment of participants to one of the treatment groups, either control or treatment, and only one group, verifies this assumption.

15.3 Analysis

The data file has been prepared beforehand to include, from the more than 60 variables in the raw data set, only variables that may be relevant to this analysis. So, first, let’s familiarize with the data. Table 15.1 shows the first few data rows.

Table 15.1: Logistic regression data

ids case year group pk score
7 1 1 1 6 0
9 1 1 1 12 1
10 1 1 2 8 0
11 1 1 1 11 0
12 0 1 2 6 0
14 1 1 2 14 0

The variables of interest are group, score, and year. A first step in the analysis is to convert the variables into factors140 In R, logistic regression analysis requires the predictors to be defined as factors.. For this purpose two new variables, groupf and yearf will be added to the data set, representing group and year as factors. Once this step has been performed, it is time to define the model. The research question being investigated here asks if the treatment (practice of self-explanation) affects, on average, performance on causal reasoning tasks. Therefore, the model will look at how performance (DV) represented by the variable score is related to the predictor group. The variable year is introduced to account for the potential effect of the year in medical school141 First or second year students.. A summary shows the count of records (frequencies) for each category for each of the predictors converted to factors.

Group (1=Control Group, 2=Treatment Group)
 1  2 
61 56 
Year of Study (1=First Year Students, 2=Second Year Students)
 1  2 
44 73 

For the analysis, R142 Other statistical analysis packages, such as SPSS, perform a similar conversion. converts the predictors (factors) to values of 0 and 1. For the current analysis and data set, the recoding (or dummy coding143 Is the process of recoding a categorical variable with 2 or more levels into a binary variable (categorical variable with only 2 levels), with values 0 and 1, variable known as dummy variable. as it is sometimes known) is performed as shows in Table 15.2.

Table 15.2: Logistic regression predictor (factor) recoding

Variable Values Description
Treatment Group 0 Control group (recoded as 1 in original dataset
1 Treatment group (recoded as 2 in original dataset
Year of Study 0 First year medical students (recoded as 1 in original dataset
1 Second year medical students (recoded as 2 in original dataset

The logistic regression equation and the model we start this analysis with is:

\[logit(p)=\beta_{0}+\beta_{1}\cdot group+\beta_{2}\cdot year+\beta_{3}\cdot group \times year\]

In this model group and year are the main effects144 The effect of each variable taken individually on the response (DV) variable. while the term \(group \times year\) represents the interaction effect145 The combined, simultaneous, effect of the two variables taken together.. Let’s look at how a summary of this model looks like.


Call:
glm(formula = score ~ groupf + yearf + groupf * yearf, family = binomial, 
    data = logReg)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.394  -0.992  -0.781   0.975   1.634  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)     -0.9445     0.4454   -2.12    0.034 *
groupf2         -0.0852     0.6854   -0.12    0.901  
yearf2           0.4925     0.5615    0.88    0.380  
groupf2:yearf2   1.0336     0.8376    1.23    0.217  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 159.10  on 116  degrees of freedom
Residual deviance: 148.74  on 113  degrees of freedom
AIC: 156.7

Number of Fisher Scoring iterations: 4

Looking at the output above, it can be noted that the name of the variables in the left column is accompanied by a value. That is because R146 Other statistical analysis software application perform the same conversion. converts the predictors in a process known as dummy or treatment coding. It creates a set of dichotomous147 Dichotomous variables have only two values variables (see Table 15.2 as well) where each level of the predictor is contrasted to a predefined reference level chosen to be as one of the values of the respective predictor variable. In this analysis the variables have only two levels, therefore the process consists of choosing a reference level as one of the two values. In this case the value 1 (representing the control group) is selected as the reference level. The value 2148 In the output R uses the values of original variables 1 and 2 and not the internal values of 0 and 1 it uses for analysis. This is because while the analysis uses numbers, the output can use strings to provide more information, if the data was collected and entered with string labels or values. in the output indicates that the treatment group (represented by the value 2 in the original dataset) is contrasted to the reference level, the control group. If the predictor has more than two levels, one of the levels will be chosen as reference level and two or more dichotomous variables will be generated for the remaining levels, that contrast each of them with the reference level. Each of these levels will be entered as a separate factor in the output.

The analysis of the full model shows that there may be no significant main effects or interaction effects. Nevertheless, further analysis can be conducted to learn if there may be a model that, using only a subset of the variables, may show significance. For this purpose a stepwise logistic regression can be conducted.

Similar to multiple linear regression, the stepwise analysis can be conducted either forward or backward. The forward approach starts with a blank model and enters each term one at a time, computes the model, and compares it against the previous one. The process will continue as long as the difference in predictive power between the more complex model149 Has more variables than the previous one. and its predecessor is significant. Once it finds an insignificant gain in predictive power, the process stops. The backward approach looks at things in reverse. It starts with the full model and starts removing variables in decreasing order of their contribution to total variance.

For this specific case, considering that we started with the full model150 Known as enter model, in which case all terms of the model are entered at the beginning. we’ll use the backward approach. The output of this model suggests that a more parsimonious model exists. It includes only the main effects and shows significance for year and the model’s constant151 The (Intercept) line of the output.. An ANOVA analysis conducted between the competing models shows which factor(s) were eliminated.

Start:  AIC=156.7
score ~ groupf + yearf + groupf * yearf

               Df Deviance AIC
- groupf:yearf  1      150 156
<none>                 149 157

Step:  AIC=156.3
score ~ groupf + yearf

         Df Deviance AIC
<none>           150 156
- groupf  1      153 157
- yearf   1      156 160

Call:
glm(formula = score ~ groupf + yearf, family = binomial, data = logReg)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.319  -1.061  -0.705   1.042   1.739  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   -1.264      0.393   -3.22   0.0013 **
groupf2        0.608      0.389    1.56   0.1181   
yearf2         0.984      0.416    2.37   0.0180 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 159.10  on 116  degrees of freedom
Residual deviance: 150.28  on 114  degrees of freedom
AIC: 156.3

Number of Fisher Scoring iterations: 4

So let’s look at what the ANOVA output tells us.

            Step Df Deviance Resid. Df Resid. Dev   AIC
1                NA       NA       113      148.7 156.7
2 - groupf:yearf  1    1.539       114      150.3 156.3

The ANOVA analysis shows that when the interaction factor was eliminated, the model’s AIC has improved slightly, effectively making this more parsimonious model a better predictor for the response variable than the full model. To interpret the results we need to look again at the generic equation of the logistic regression:

\[logit(p)=\beta_{0}+\beta_{1}\cdot group+\beta_{2}\cdot year+\beta_{3}\cdot group \times year\]

From the logistic regression output, the \(\beta\) coefficients in the logistic regression equation are found in the Estimate column. Therefore, with values, the equation becomes:

\[logit(p)=-1.264+0.6084\cdot group+0.9836\cdot year\] The odds ratio computed for each of the parameters are indicators of the odds of performing better (answering correctly) versus the odds of performing worse (answering incorrectly) is increased or decreased by a factor indicated by the value of the odds ratio. The direction is provided by the sign of the raw estimated \(\beta\) coefficient. If the coefficient is negative, the odds are decreased by the computed value while if the coefficient is positive, the odds are increased by the value of the odds ratio.

The odds ratio for group:

\[group\,odds\,ratio=e^{0.6084}=1.84\] Therefore, holding the year of study constant, being in the treatment group152 Indicated by the number 2 at the end of the variable name in the logistic regression output. increases the odds of preforming better rather than worse by a factor of 1.84. That is, being in the treatment group153 Participants in the treatment group were prompted to use self-explanation to help improve score. increases by 84% the the odds of a better performance score than by being in the control group.

The odds ratio for year:

\[year\,odds\,ratio=e^{0.9836}=2.67\]

This suggests that, holding the treatment constant, being a second year154 Indicated by the number 2 at the end of the variable name in the logistic regression output. medical student increases the odds of performing better rather than worse by a factor of 2.67, meaning second year medical students see a 167% increase in odds for a better performance score.

This concludes the logistic regression analysis test. Nevertheless, every statistical test is run in the context of a study and once the results are know, they should be interpreted in that context. Next section, while not relevant to the application of the Logistic Regression analysis, is intended to offer insights in how the results of the analysis may be interpreted in the context of the study.

15.4 Additional Analysis - TL/DR

The results of the analysis so far are mixed, showing that the treatment itself, while still included in the equation, does not show a significant main effect in the overall sample. Let’s look at some of the elements that may have impacted the results, additional information about the study’s design, and how these affect data analysis.

First, given the population of students at the medical school was relatively small and considering the expected percentage of respondents, second year medical students offered an insufficient participant pool. Therefore, based on the timeline of the study and the curricula at the medical school, which ensured that the participants had sufficient knowledge of relevant domains, a decision was made to include first year medical students as well.

Second, the literature and prior pilot studies suggested that prior knowledge in domains relevant to the practice and test questions matters. Therefore, the study included both a subjective measure of prior knowledge, as a self-evaluation assessment reported by the participants, and a more objective, though brief, evaluation of the participants’ prior knowledge using multiple choice questions. Including first year medical students offered a chance to better understand the effects of prior knowledge as it is expected this prior knowledge to be less extensive than that of second year medical students.

With this new knowledge, the model that has been presented thus far can be extended to account for the effects of prior knowledge while controlling for the year of study. By introducing prior knowledge in the regression equation, the model selected included an interaction between prior knowledge and treatment group which suggests that the treatment works differently for different levels of prior knowledge.

The existence of an interaction term is relevant because the interpretation can no longer be conducted for each individual predictor while holding the others constant as explained for an interpretation of a logistic regression showing only main effects. In this case, the interpretation covers multiple simple regression equations, for each level of the predictors that are part of the interaction term. For example, consider the following regression equation155 pk = prior knowledge:

\[logit(p)=\beta_{0}+\beta_{1}\cdot group+\beta_{2}\cdot pk+\beta_{3}\cdot group\times pk+\beta_{4}\cdot year\]

The interaction term is represented by \(group \times pk\). For this equation, analysis can be conducted for the two levels of the group (0 = control group, 1 = treatment group), by entering the values 0 or 1 into the equation. This will produce the following two regression equations, which only include main effects and can be interpreted as described before.

For group = 0 (control group):

\[logit(p)=\beta_{0}+\beta_{2}\cdot pk+\beta_{4}\cdot year\]

The interpretation will now discuss the odds ratio of prior knowledge to affect the response variable for the participants in the control group only.

For group = 1 (treatment group):

\[logit(p)=\beta_{0}+\beta_{1}+\beta_{2}\cdot pk+\beta_{3}\cdot pk+\beta_{4}\cdot year\]

Which can be further reduced to:

\[logit(p)=(\beta_{0}+\beta_{1})+(\beta_{2}+\beta_{3})\cdot pk+\beta_{4}\cdot year\]

The resulting equation is interpreted only in the context of the treatment group and can look a the odds ratio for the levels of the prior knowledge predictor to influence the response variable.