To understand the world around us we use models we construct in our mind. Rooted in our education and our experience, these models may or may not be true representations of reality and how close that representation is. While for most people these models are “good enough” for their everyday needs, researchers take them one step further and attempt to devise ways to determine and understand if they are indeed true representations of reality. For this purpose, researchers formalize these models and design studies to test their validity. These formal models are based on the shared understanding of the phenomena under study at the time of the research21 For which reason a thorough literature review is necessary to building the most relevant model. and the researcher’s own insights. That is, scientific models are a representation of our evolving understanding of the physical world.
A research study is designed with the intent of finding or supporting a model representing a phenomenon or construct as defined by the relations between the various variables involved. The theoretical framework underlying this model is a major determinant for the choice of analytic technique, for how this technique is applied, and for how the results are interpreted. That is, the potential for success in a quantitative research study is determined, among other things, by the use of analytic techniques appropriate for the model involved.
When designing a research study one should be always aware that all a statistically significant finding means is that the probability that there is nothing to be found22 The null hypothesis is true. is small. Therefore, a sound research design is based on a chain of decisions about the effect size that makes the relationships sought substantially meaningful for the study, the level of significance and the power of the statistical test, and a calculation of sample size. This approach helps avoid pitfalls such as findings that are meaningful but not statistically significant or findings that are statistically significant but not meaningful. That is, the focus of the research process should be on the meaning23 This meaning, such as relations between variables or differences between means, cannot be established in the absence of other research. of the findings from the perspective of the theory and existing research.
4.1 Occam’s Razor
Also known as the Principle of Parsimony, it states that simple explanations are better than more complicated ones. In statistics for example, using Occam’s Razor means that an explanatory model with fewer variables is better than one with a larger number of variables.
Attributed to the English philosopher William of Ockham, the Principle of Parsimony states: Given a set of equally valid equivalent explanatory models, the best explanation is the simplest one. In statistical terms, it means:
- The model should have as few variables as possible.
- Linear models should be preferred to non-linear models.
- Models that rely on on fewer assumptions should be preferred to those that rely on many.
- Simpler explanations should be preferred to more complicated ones.
Used in the process of simplification of statistical models, the Principle of Parsimony advises that when a variable does not guarantee a significant increase in deviance24 Deviance is a statistic used to compare models. In this case, if the two models - with and without the variable in question - are not significantly different, the model without the variable, the simpler model, should be selected. when removed, it should be excluded from the study.
OK, but what does this mean for a study? In essence, the Principle of Parsimony advises you to keep your research model as simple as possible. Many researchers, myself included, have at times a tendency to over complicate their studies25 By including as many variables as can be fit and collected, for examples., effectively overlooking this principle.
If you ever worked in one of the social sciences fields you probably know how difficult it can be to access and recruit participants for a study. Therefore, when opportunity arises, one might tend to include as many data collection items as feasibly possible. While having a lot of data is not a bad situation to be in, it also makes it tempting to fit it all in the model. This can also happen when a research tries to justify collecting more data just because access to participants is readily available.
Don’t do it. That is not to say that you shouldn’t collect the data if possible. By all means, if the opportunity presents itself, do so. Nevertheless, do not use more data than you actually need in your analysis. That is, let the Principle of Parsimony guide your decisions and ask yourself the following questions:
- Are all variables (data) collected valuable and relevant to the study?
- From a theoretical perspective, would the inclusion of a variable make a difference?
The answer may not be obvious at first. To find out which model is more parsimonious, you should first start looking at the existing literature. Attempt to find similar studies and study the models they proposed as well as the data they used. You can then analyze each piece of data (variable) individually to assess if it, theoretically, adds value to your study, from both the perspective of prior research and based on your own understanding of the problem under study.
When you decide how complex to make your model and how much data you need to collect to validate it, think of the downsides increased complexity brings about. Here are a few:
- Increased difficulty in analyzing and interpreting data.
- Longer and more complicated research instruments26 This could lead to a significant increase in the time the participants need to complete the task which, in turn, could significantly increase the chance for more participants to either decide to leave the study early or not to participate in it at all. That is, the longer an instrument is, the less likely is for the participants to complete it..
- The time you will need to complete the study can also increase significantly27 If, for example, you are working towards a Doctoral Dissertation, increased complexity could bring delays in completion and graduation..
4.2 Experimental Versus Non-Experimental Research
A cursory search will show the wide variety of quantitative designes researchers use. Nevertheless, all of them can be included in one of two types: experimental or non-experimental studies. The most significant difference between the two types is the level of control researchers have over the environment in which the research is conducted.
Experimental research studies are designed to provide the researchers with the highest level of control possible over the experimental conditions. The intent of an experiment is to discover the relationships between the variables of interest while attempting to hold all other variables constant (or control them)28 For example, random assignment is such a way to control for differences between subjects in research involving human subjects.. For this purpose the experimenter usually manipulates a condition (the treatment) and attempts to assess its impact on one or more variables of interest. Because of the ability to manipulate the experimental conditions and the restrictive design aimed at controlling as much as possible other extraneous variables, experimental research offers the best chance of finding causal relationships between variables. Nevertheless, the controlled nature of the studies, makes them less capable of reflecting reality. Given the restrictive nature of the design, experiments offer a high level of reliability and control29 In STEM fields, studies conducted in the laboratory employ experimental designs. In non STEM fields experimental design is used, for example, to understand the differences between groups of participants (people) subjected to different experimental conditions, such as different visual stimuli, or different learning environments..
Non-Experimental research studies are designed to look at phenomena and contexts the researcher does not have control over. In this case the researchers cannot manipulate the conditions or variables of interest and they have to rely on observations and measurements of variables available to them and use those to seek an answer to the research question they pose. This lack of control renders non-experimental studies less capable of identifying causal relationships due to the large number of variables that usually accompany real-life contexts. Therefore, while a relationship may be observed and inferences can be made, the researcher’s ability to strongly suggest causality is limited by the potential interference in the process of other variables that were not accounted for30 The context in a non-experimental social sciences research study is so complex that the researchers cannot capture and measure every variable that may influence the phenomena they are studying. For example, in educational studies, prior knowledge has significant influence on how well a learner understands new concepts. In real life situations, such as during classroom instruction, there may not be time or capabilities to assess the learners’ prior knowledge of a concept or construct.. Because non-experimental studies look at phenomena in their natural environment, they tend to have a higher level of external validity, which makes them much easier to generalize to larger populations31 These types of studies are frequently encountered in the social sciences fields, where researchers attempt to study phenomena as they unfold in their normal environments..
Going a bit further, quantitative research studies can be grouped into descriptive, correlational, quasi-experimental, and experimental.
Descriptive studies are designed to describe the status of a phenomenon using mostly observational type data. They do not have hypotheses, though one may be developed after the data is examined.
Correlational studies use mostly observational data to explore relationships between variables without looking at cause-effect relationships.
Quasi-experimental studies are designed to recognize cause-effect relationships between variables in situations when no groups are assigned beforehand and no variables are manipulated to elicit a desired outcome. The groups for which variable statistical summary data are compared are identified after the data has been collected.
Experimental studies follow the guidelines of the scientific method and are specifically designed to verify the existence of a cause-effect relationship between variables describing a phenomenon. For this purpose all efforts should be made to control for as many variables as possible while manipulating the variable(s) of interest.
4.3 Between-Subjects vs. Within-Subjects Designs
The between-subjects and within-subjects research designs are differentiated by the number of measurements done for every subject. In between-subjects designs only one measurement is performed for each participant while in within-subjects designs, there are multiple, successive, measurements, for which reason the within-subjects studies are many times called repeated measures studies32 Repeated measures studies are only a subgroup, probably the largest, of the broader category of within-subjects designs..
In essence, the between-subjects research design allows researchers to study the differences between groups of participants at a given point in time. They usually involve comparing the groups on one or more summary or central tendency measures33 E.g., mean or median.. The participants are part of only one of the research groups and are exposed to only one intervention.
In within-subjects designs all participants are members of the same group and all are exposed to all treatments. The comparison usually happens between the successive values of central tendency measures of the same variable. These types of designs tend to have more power than the between-subjects designs and make possible to observe change over time, but tend to suffer from confounding issues34 Confounding in within-subjects designs can be mitigated by counterbalancing. For example, participants can be grouped together in small groups and the order in which they are subjected to the various treatment conditions can be randomized across these groups. Or, the randomization of how the treatments are applied can be done for each individual subject..
Variable: A specific characteristic that can be measured and can assume different values.
Continuous or Quantitative variables: a variable that has numerical values, such as test scores, lengths, durations, etc.
Classification or Categorical variables: represent categories, usually used as grouping variables, such as gender or race.
4.4.1 Measurement Scales
One of the major ways of understanding variables is to look at how they are measured and their scale of measurement. From this perspective variables can be nominal, ordinal, interval, or ratio. This classification is important - because the statistical procedure to be used depends on the scale of measurement.
Nominal Scale: Classifies in mutually exclusive categories. The variable becomes a classification variable.
Ordinal Scale35 As an example, letter grades are ordinal because how much A is better than a B cannot be known. For a score range between 0 and 100 A is between 90 and 100 and B between 80 and 89.9. The difference between an A and a B is anything between 19.9 and 0.1. An F is for any score below 50, different from the others. Recoding to a scale of 1 to 5 is misleading because the numerical difference between 1 and 2 is not the same as the difference between F and D. : Rank order with respect to the variable being assessed. The values represent a hierarchy of levels. Provides limited information because the equal steps in the scale values do not necessarily have an equal real-life quantitative meaning.
Interval Scale: Provides more information than the ordinal scale because equal differences between values have a real-life equal meaning/counterpart. The downside of the interval scale is that is has no true zero point36 A value of zero on the scale is equal to a zero quantity of the variable being assessed. For example, the Celsius scale does not have a true zero point because the value of 0 does not mean that there is absolutely no heat present. .
Ratio Scales: Are similar to interval scales in that equal differences between scale values have equal real-life quantitative meaning. However, ratio scales also have a true zero point which gives them an additional property. With ratio scales, it is possible to make meaningful statements about the ratios between scale values37 For example, the system of inches used with a common ruler is an example of a ratio scale. There is a true zero point with this system in which zero inches does, in fact, indicate a complete absence of length. The Kelvin scale is, as opposed to the Celsius and Fahrenheit scales, a ratio scale because 0 degrees Kelvin means, by design, that there is absolutely no heat present. .
4.4.2 Independent vs. Dependent Variables
Many research studies are primarily focused on two categories of variables: dependent and independent. This categorization is based on what the variable measures and how it is intended to be used in the analysis. To understand the difference, let’s look at an experimental study design.
An experimental study can be designed to find, for example, if two interventions or treatments offer different or similar outcomes. This implies the study needs two groups of participants, with one of the groups being subject to one of the treatments, while the second group is subject to the other. If the groups come from the same population and are homogeneous enough, the experiment should be able to recognize the effects of the treatments. To recognize the effects, the design should also include ways to measure them, such as test grades or scores.
In terms of variables, the one measuring the effects or outcomes is the dependent variable (DV), while the one that places the participants in groups is the independent variable (IV). The study will attempt to determine the influence of the independent variable (treatment group membership) on the dependent variable.
The dependent variables are those that measure an observed effect. Examples of dependent variable could be a test score used as proxy for students performance on a task.
The independent variables reflect either conditions specified by design to help single out the effect or determined by conditions that are outside the researcher’s and, potentially, the participant’s control. Examples of such variables are grouping variables (e.g., treatment vs. control groups) or demographic variables (e.g., age, gender, etc.).
With this knowledge lets look at an example. Consider we have a study in which the participants are assigned to one of two groups. This is under the researcher’s control. More exactly, these groups have been formed by design to represent two different conditions or interventions. In the analysis, they will be represented by the independent variable. On the other side, the researcher has devised a way to measure the effects of each of the two conditions or interventions (e.g., test scores). As, by design, the researcher expects these test scores to be, on average, different for the two groups, one could say that they “depend” on which group the participants were assigned to. In the analysis, this measure translates into the dependent variable. All else equal the effects (measured by the dependent variable) depend on the group to which the participants were assigned to (represented as the independent variable). Let’s consider a more concrete example.
Let’s say that we are studying the influence off the skill and drill practice on mathematics performance in high school students. For this purpose we select two groups of students. One group of students will do math as usual and will not engage in any skill and drill practice. The other group will continue to do math as usual but, in addition, will have a few extra sessions during which they will do math drills. In this case, the first group is the control group and the second group is the treatment group. The variable which defines the group a student is member of, is the independent variable. All this said and done, the next step is to find a way to assess the students’ performance. Let’s consider that we chose a specific math test to be administered to all students after the treatment has ended38 Our expectation might be, for example, for the students in the skill and drill group to perform better than the students that did not do any math drills on this test.. The score the student obtains for this test could be considered a measure of their performance. Therefore, for further analysis, this score is considered to be the dependent variable. That is, the score the student obtains depends on which group he was member of and therefore dependent on the type of training (treatment) the students engaged in.
The variables in a research design, while serving the same purpose and representing the same components of relationship, may be found under different names. For example, in experimental research studies the use of independent39 Presumed cause in an experimental study. / dependent40 Studied effect in experimental studies. variables is preferred. For non-experimental studies, the preference is for using predictor41 Presumed cause in an non-experimental study. / criterion/response42 Studied effect in a non-experimental study..
4.5 Descriptive vs. Inferential Designs
Descriptive statistical analysis is focused on measuring population characteristics. For the purpose of these analyses a population is defined as the entire collection of subjects or things that are being studied43 For example, all the students in a course..
Inferential analysis is a statistic, a numerical value calculated using a sample (or subset) of people, objects, events, etc. that can be used to describe the characteristics of the sample44 For example, the mean of some value. and/or used to make inferences/estimates about the population from which the sample was extracted.
Most statistical tests included in this resource will probably fall in the inferential category. Overall, inferential tests can be categorized in two basic types: tests of group differences and tests of association.
Test of group differences - are designed to help determine if there are differences between the mean scores of one or more dependent variables45 Or criterion variables in non-experimental studies. between two populations. One of the best known examples is the one-tailed t-test.
Tests of association - for a single population, to determine if there is a relationship between two or more variables that describe this population. Best known example is the correlation coefficient.
A third, more involved, class of inferential analyses allows to study if the association between two variables is the same across two or more populations. An example of such type of analysis is ANCOVA.
4.6 Research Questions
In quantitative studies research questions ask, in essence, if a relationship exists between two events. In most cases this relationship is causal in nature. That is, a research question asks if the onset of an event has an impact on some other event. Let’s use the well known butterfly effect46 Term used in chaos theory, coined by Edward Lorenz. as an example.
Would the flap of the wings of a butterfly in the Amazonian jungle influence the number of hurricanes in Japan?
A closer look shows that this is indeed a question. The first part introduces the originating event or the cause (butterfly flaps wings). The second part describes the effect (number of hurricanes in Japan). Further analysis shows that this question only asks if a relationship exists, but does not include any indication of how strong the relationship is and in which direction the effect will be. This is called a non-directional research question.
If the literature supports or suggests a direction for the causal relation, then a directional research question would be more appropriate as it includes an indication of how the relation is thought to behave. Let’s transform the question above to a directional research question.
Would the flap of the wings of a butterfly in the Amazonian jungle significantly increase the number of hurricanes in Japan?
This time the question suggests both the direction and the strength of the relationship, which is achieved by replacing the word influence with the words significantly increase. In this case, significantly is an indication of the strength of the relation and increase is an indication oft the direction47 If the existing literature cannot provide any guidance as to what the strength of the relation may be, removing significantly will not affect the type of question..
Hypotheses are questions worded as statements to be tested using statistical tests. They are derived from the study’s research questions and describe the causal relation(s) between events and/or variables. In the most basic format hypotheses are bi-variate in that they state the influence of one independent variable (IV) on one dependent variable (DV).
Null Hypothesis (H0) - statement saying that nothing is different. For studying group differences the null hypothesis states that there are no differences between group means of some variable of interest. For the study of association the null hypothesis states that there are no relationships between the variables of interest.
Alternative Hypothesis (H1) - is the opposite of the null hypothesis and states that there is a significant48 A difference between the means is likely to exist; the question is if that difference is significant so that an inference can be made based on the results. difference between the means or that a relationship exists between the variables.
Alternative hypotheses can be further classified into non-directional and directional.
Non-directional alternative hypotheses - predict that the means of the population differ significantly but do not make a specific prediction about the direction of the difference (which one is higher or lower). These types of hypotheses can be answered using two-sided (two-tailed) statistical tests49 If the statistics computed by the test follows a symmetrical distribution, there are three possible alternatives for defining hypotheses to test, two for one-sided tests and one for a two-sided test. The one-sided tests look only to one side, left or right of the distribution curve of the statistic, effectively testing for one direction of the relationship while ignoring the other. A two-sided test tests both tails of the statistic distribution, but with less resolution. Two-sided statistical tests are considered less powerful than one-sided tests, which are used to test the directionality of the hypothesis in addition to the significance of the difference..
Directional alternative hypotheses - are more specific in that in addition to predicting that the means of the groups differ on some variable it also predicts which of the means will be higher and which lower. These hypotheses can be tested using more powerful one-sided (one-tailed) statistical tests.
Because the one-sided statistical tests are more powerful than the two-sided variety, using directional alternative hypotheses is preferred to the use of non-directional alternative hypotheses.
A good hypothesis includes three elements:
- A clear statement of the causal relationship to be tested;
- A clear indication of the direction of that causal relationship, if known;
- A clear indication of the variables between which the causal relation occurs.
4.8 Hypothesis Testing
Testing a hypothesis means to determine if the null hypothesis (H0) can be rejected with (acceptable) confidence. For this reason, statistical tests compute the p-value as the probability that the presently computed value of the statistic will be obtained if the null hypothesis is true50 Important: p does NOT provide the probability that the null hypothesis is true.. Therefore, if the value of p is very small, the null hypothesis (H0) can be rejected and the alternative hypothesis (H1) should be accepted.
One commonly accepted cutoff value for p is 0.05. If the computed p value is > 0.05, the test indicates that the null hypothesis should be accepted (or that the test fails to reject the null hypothesis). If the computed p value is < 0.05 the null hypothesis may be rejected and the alternative hypothesis is accepted, indicating that the differences or relationships found seem to be statistically significant.
The reference p-value is selected for each analysis individually based on the level of confidence in the predictive power of the test necessary to generalize the findings from the sample to the population from which the sample was drawn.
4.9 The p-Value Controversy
In recent years scientists have voiced concerns about potential misuse of the p-value in research (Arnheim, Greenland, and Blake 2019Arnheim, Valentin, Sander Greenland, and McShane Blake. 2019. “Retire Statistical Significance.” Nature 567: 305–7. https://doi.org/10.1038/d41586-019-00857-9.). There seem to be a few more widely accepted explanations for this: misunderstanding of what the p-value is, tradition as reflected by the education researchers receive in their formative years, and journal reliance on p-values in accepting submissions for publication.
According to the American Statistical Society (ASA), an informal explanation of the p-value is (Wasserstein and Lazar 2016Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The Asa’s Statement on P-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.):
“A p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.”
Acknowledging this controversy, the American Statistical Association (ASA) recommends researchers consider and follow a few guiding principles in designing, conducting, and reporting their studies (Wasserstein and Lazar 2016Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The Asa’s Statement on P-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.):
- The smaller the p-value is, the more incompatible the data is with the null hypothesis, given a set of assumptions hold true. That is, p-values are an indication of how compatible or incompatible the data are with a hypothesized statistical model.
- The p-value is an indicate about how the data relates to a hypothetical explanation, but not about the explanation itself. That is, the p-value does not represent the probability of the hypothesis being true or false.
- The results of an analysis should not be interpreted as a hard yes or no as a statistical finding is not automatically true or false depending on where it falls related to the p-value threshold. That is, scientific conclusions or business decisions should not be based on the p-value alone.
- P-hacking or cherry-picking the results tends to generate a body of research skewed towards significant findings. This can be avoided through transparency and open and full reporting of a study and its findings.
- Even weak treatments can produce small p-values if the sample is large enough. Or, alternatively, strong treatments may produce irrelevant p-values if the sample is not adequate or the measurements are incorrect. That is, the p-value cannot measure the size of an effect because statistical significance is not the same thing as scientific or human significance.
- Because it provides limited information, the p-value requires a context in which to be interpreted. That is, the p-value is irrelevant by itself.
The proposed solution for issues raised by the inadequate use p-values in research studies is to use other approaches instead of or in addition to it. Because they are easier to reason about, the most common suggestion is to use confidence intervals51 Confidence intervals describe the variability surrounding the sample point estimate. The wider the interval, the less confident one can be about the estimate of the population mean. In general, the larger the sample size, the more precise the estimate is. instead of the p-value.
Despite all these issues scientists raise, the p-value remains a valuable tool in the researcher’s toolbox. It just needs to be used with care, not treated as a binary, definitive answer, and the research using it should observe the appropriate guidelines for design, collection, and reporting. This resource uses the p-value in its traditional acception and attempts to follow, as much as possible, the above mentioned guiding principles.
4.10 How to Choose the Appropriate Statistical Test
Two simple criteria, type of variable’s scale and number of variables of each type (dependent and independent), can be used as the starting point for determining which statistical analysis would be more appropriate52 The MyReLab website (https://www.myrelab.com) offers a tool that helps with the selection an appropriate statistical test.. Of course, once a possible analysis is selected, it should be carefully considered, as not all analyses work in all instances. If we were to build a table or diagram of all possible alternatives, for each specific case, it would quickly become unusable. Therefore, tables 4.1, 4.2, and 4.3 offer some guidelines for what general type of analysis one should start from (adapted from Hatcher & Stepanski (1994Hatcher, E.J., L.; Stepanski. 1994. A Step-by-Step Approach to Using the Sas System for Univariate and Multivariate Statistics. SAS.)).
Multiple types of analyses can be applied for the same combination of variables/scales. The final selection depends the specifics of the analysis as it applies to the actual data. For example, Table 4.1 shows that three statistical tests can be used for an analysis with one nominal independent variable and an interval or ratio dependent variable. The Kruskall-Wallis test is usually used for ordinal dependent variables, but can be used with interval/ratio dependent variables when these show significant departures from normality. Similarly, the t-Test is applicable only if the independent variable has only two possible values. Therefore, when deciding which analysis to use, the requirements and assumptions of each statistical test should be carefully considered.
The advice for how to choose when to apply the most commonly used statistical analyses presented in Tables 4.1, 4.2, and 4.3 has been adapted from Hatcher & Stepanski (1994Hatcher, E.J., L.; Stepanski. 1994. A Step-by-Step Approach to Using the Sas System for Univariate and Multivariate Statistics. SAS.).
Table 4.1: ONE DV x ONE IV
|ONE Independent Variable||ONE Dependent Variable||Statistical Analysis|
|Nominal||Interval/Ratio||t-Test, One-Way ANOVA|
|Ordinal/Interval/Ratio||Ordinal/Interval/Ratio||Spearman Correlations Coefficient|
|Interval/Ratio||Interval/Ratio||Pearson Correlations Coefficient|
Table 4.2: ONE DV x MANY IVs
|MANY Independent Variables||ONE Dependent Variable||Statistical Analysis|
|Nominal/Interval/Ratio||Interval/Ratio||ANCOVA, Multiple Regression|
Table 4.3: MANY DV x ONE or MANY IVs
|Independent Variable(s)||MANY Dependent Variables||Statistical Analysis|
|Nominal (ONE)||Interval/Ratio||One-Way ANOVA|
|Nominal (MANY)||Interval/Ratio||Factorial MANOVA|
|Interval/Ratio (MANY)||Interval/Ratio||Canonical Correlations|
There are many resources available to help with deciding what statistical test to use for data analysis. For example, Bruce Frey’s (2016Frey, Bruce B. 2016. There’s a Stat for That! What to Do and When to Do It. Sage Publications, Inc.) book There’s a Stat for That! What to Do and When to Do It, provides a thorough overview and guides through the selection process, which makes it a worthy addition to any researcher’s toolbox.