[06/27/2024] Statistics in Psychological Research 5
Last updated
Last updated
One common recommendation for overcoming some of the difficulties in interpreting isolated p values is to make wider use of a statistical tool called a confidence interval (Cumming, 2012). In null hypothesis significance testing, a sample statistic like the mean is a point estimate, a single value used to approximate a parameter of some population. Alternatively, we could generate an interval estimate, a range of values within which we believe the parameter lies. Often news reports of polls and surveys include both a point estimate and an interval estimate called the margin of error. By far, the most commonly used interval estimate in psychological research is the confidence interval introduced by and discussed in the earlier Confidence Intervals video.
The confidence interval is described in terms of a percentage, such as a 95% confidence interval or a 99% confidence interval. The value of the percentage is termed the confidence level. The margins of error we see in news reports are typically 95% confidence intervals. The 95% confidence interval is probably the most common in behavioral research, but whenever a confidence interval or margin of error is reported, it is good practice to check for the exact level.
The procedure for calculating a confidence interval is to compute a lower limit and an upper limit of an interval around a sample mean, or mean difference. If samples were taken repeatedly from a population, then the confidence interval would contain the population parameter as often as indicated by the confidence level (e.g., 95% or 99% of the time).
The 95% confidence interval for the mean of the population from which this sample was drawn is 77.56 inches to 78.94 inches. Notice that the population mean of 79 does not fall within this confidence interval. What does it mean that the population mean does not fall within this 95% confidence interval? Confidence intervals can be understood as null hypothesis significance testing in disguise. Remember that there is a statistically significant difference between the midpoint of the confidence interval, the sample mean, and any value that falls outside the confidence interval. However, confidence intervals offer more information than a significance test alone, because they provide an interval estimate that reflects the size of the sample and the variability in the data.
The theory of confidence intervals (), again, is that we use a procedure that, if repeated a large number of times, will generate an interval that includes the population parameter of interest x% of the time, x being the confidence level. When we use the common .05 value for alpha, we know that in the long run, 95% of our confidence intervals will include the population mean, and that 5%, the intervals that correspond to Type I errors, will not include the population mean. In practice, we only generate one interval. But using a known population and computer simulation, we can do in actuality what the confidence interval procedure normally assumes only in theory: generate a large number of confidence intervals and visualize what happens in the long run (sometimes called the dance of the confidence intervals; see Cumming, 2012). The 95% Confidence Intervals for n = 4 figure illustrates the âlong runâ based on the basketball player height example. Note that the intervals contain the population parameter about 95% of the time. When you have a particular interval, it either contains the population parameter (p = 1) or it does not (p = 0). The difficulty, of course, is that you do not know which of these is the case.
Note. The figure represents 95% confidence intervals for 100 random samples of n = 4 players, which were drawn from the list of 482 players on NBA team rosters in the season that ended in 2014. They are presented as a row of vertical lines. The mean for the population is 79 inches. The point at the center of each interval represents the mean, and the endpoints of the lines represent the lower and upper limits of the confidence intervals. The mean point of confidence intervals that did not include the true value of the population mean, 79 inches, are colored red and indicated by asterisks. Six of the confidence intervals did not include the true value of the population mean, 79 inches, and 94 of the confidence intervals did include the population mean.
It is important to note that reasoning about the meaning of a confidence interval, like reasoning about null hypothesis significance tests, contains many potential missteps. For instance, before we collect the data and calculate the confidence interval, it is correct to say that the probability that the confidence interval will include the population mean matches the confidence level, so that if the confidence level is 95%, then the probability that the confidence interval we create will contain the population mean is .95. However, it is not correct to say, after a confidence interval is calculated, that the probability that the interval obtained contains the population mean is .95. Remember that the probability that a confidence interval will include the population mean is a statement about the procedure. The procedure is designed to produce intervals that include the population mean a certain proportion of the time. This is different from a claim that the probability that a population mean lies within a particular confidence interval equals the confidence level.
We will use the following example to review appropriate applications of statistical significance, effect size, and confidence intervals.
Example: Researchers wanted to investigate whether viewing images of attractive people (such as celebrities or models on social media) for 30 minutes would lower self-esteem compared to the average self-esteem value for adults in the United States. They collected data from 500 people and found (with α = .05) that average self-esteem scores were significantly lower for the group that viewed the attractive images, p = .01.
What might you conclude based on the results provided?
Remember, a statistically significant effect is not always an important effect. Very small differences may be statistically significant and not be meaningful, no matter the size of the test statistic, or p value. Effect size statistics, on the other hand, are distinct from p values and cannot be used to judge statistical significance. Instead, they quantify the extent to which sample statistics diverge from the null hypothesis and answer the question: How big is the effect?
A common effect size measure, Cohenâs d, expresses the size of the effect, often the size of a difference or mean difference, in standard deviation units. For example, we could expand the exampleâs previous conclusion, by adding an effect size statistic, such as d = â.52. This would indicate that the mean self-esteem for the group viewing images of attractive people for 30 minutes was .52 standard deviations smaller than the population mean.
The effect size statistic can then be used to judge the importance of an effect. If mean self-esteem scores drop by 1 standard deviation after viewing images of attractive people, then that seems meaningful. However, if viewing images of attractive people for a longer period had an effect of d = â1.3, then that effect is both larger and more important (because decreases in self-esteem was the measurement of interest).
In most cases, a larger effect size (whether in standardized or unstandardized units) is more meaningful or more important. Yet, there are important exceptions. In some cases, small effects can be important. For instance, if you are measuring a health outcome, then a small effect size could indicate only a slight decrease in symptoms, yet this small change could drastically improve a patientâs quality of life. In the same way, a large effect does not guarantee importance. Effect size should always be considered in the context of the research scenario.
What additional information is provided by including a confidence interval in our resultsâ summary? A confidence interval provides an estimated range of values within which a population parameter exists with a certain confidence level, expressed as a percent. For example, a 95% confidence interval around a sample mean indicates that the probability is .95 or 95% that the population mean will be included in a confidence interval from a sample we collected. Remember, only inferences that depend on the procedure are valid: Intervals will contain the population parameter about 95% of the time, but about 5% of the time, they will not. And inferences that depend on the isolated result are invalid. When you have one particular interval, it either contains the population parameter (p = 1) or it does not (p = 0). The difficulty is that you do not know which of these is the case (because the population parameter is unknown).
For the self-esteem example, creating a 95% confidence interval of the mean produces the best available estimate of a range of values that contains the population mean. The range of 20.87 to 23.13 can be used as an approximate marker of the population mean self-esteem score. Yet, it is still possible that the true population mean is not included in this interval.
View the results summary below that includes the p value, effect size statistic, and confidence interval of the meanâand consider how each of these pieces provides distinct information about the mean comparison.
Results summary: They collected data from 500 people and found (with α = .05) that average self-esteem scores were significantly lower for the group that viewed the attractive images, p = .01, d = .52, 95% CI [20.87, 23.13].
One way to understand statistical power is to graph its relationship to Type I error and Type II error. We will use the example of a one-tailed, one-sample z-test to illustrate these relationships. In the Null and Type I Error figure, a null distribution is shown (with notation H0), and alpha marking the rejection region. Alpha also indicates the probability of a Type I error. Review the Null and Type I Error figure and notice how the H0 distribution aligns with the decision table.
Note. A null distribution, H0, is marked with a vertical line, the critical z value, over the right tail. The critical z value for the null distribution divides the null distribution (H0) into two regions, α (Type I error) to the right and the region of correct null acceptance on the left.
You may have noticed that the null distribution (H0) is divided into two regions, the area in the right tail above the critical value, the critical region that represents α, or the Type I error rate, and the remainder of the curve, which represents the probability of correctly retaining the null hypothesis if it is true. The area represented by α, the Type I error, is complementary to the area to the left of the critical value, which is 1 â α. If 1 â α is 95%, then this tells us that, if a sample is from the reference population (e.g., the null distribution), then there is less than a 5% chance the sample statistic will be more extreme than alpha. However, remember that this is only the case when the null hypothesis is true and the sample is drawn from the reference population.
Now, review the next figure, Power and Type II Error, and consider the area under the alternative distribution (H1). This curve represents one possible distribution of values for a case in which the sample statistic is not from the reference population. Notice, again, that the distribution is divided into two regions. The area in the left tail of the alternative distribution, to the left of the critical value, represents ÎČ, or the Type II error rate, and the area to the right of the critical value represents the power of the analysis (1 â ÎČ). In this example, the Type II error rate is .20, or 20%, which means that power is .80, or 80%. Describing a Type II error in everyday language, we would say that there is a 20% probability of âmissingâ a difference that exists; in other words, not deciding the sample is different from the reference population when it actually came from a different distribution (i.e., our decision is a false negative). Because of the complementary relationship between Type II error and power, we can also say that the probability of obtaining a sample that leads us to reject the null hypothesis is 80%, which means that the statistical power of the test is 80%.
Note. The null distribution, H0, is on the left and the alternative distribution, H1, is on the right with a vertical line, the critical z value, over the right tail of the null distribution. The critical z value for the null distribution divides the alternative distribution into two regions. The region to the left of the critical z value represents the Type II error and the region to the right of the critical z value represents statistical power. The critical value, marked as a vertical black line, divides the alternative distribution (H1), the right-most distribution, into two regions, ÎČ (Type II error) to the left and power to the right.
Notice that the relationship between Type II error (ÎČ) and power (1 â ÎČ) is affected by the critical value (α), the distance between the null and alternative distributions, which is the effect size (for example, Cohenâs d), and the overlap of the distributions, which is influenced by sample size (n). Each of these factors is important in determining the power of an analysis.
Power is perhaps most relevant to a researcher before a study is conducted, where the resulting decision regarding the null hypothesis is not yet known. At this stage of the research process, the researcher wants to know how to achieve enough power to detect a true difference between groups if one exists. If power for the analysis is too low, then a Type II error (a false negative) is more likely to be committed, wasting money and participantsâ time when a real difference is not found. However, if the analysis is overpowered, then it is also likely that money and participantsâ time will be wasted by collecting more data than is necessary to make an accurate decision. In other words, an important question for a researcher is about how to achieve enough power to optimize the chance of detecting a true effect. To understand how sufficient power is achieved, we must understand the relationships of four important features of a study that affect the power of an analysis. The features are the:
sample size (n).
Type I error rate (α).
effect size (d).
Type II error rate (ÎČ).
As an introduction to this concept, we will use Cohenâs d as our measure of effect size, although other measures of effect size may be more appropriate for different research questions. The Determining Statistical Power video that follows this page demonstrates how the Type I error rate, sample size, and effect size each contribute to determining power.
In practice, researchers commonly use statistical power analyses to determine the appropriate sample size (n) to achieve the desired power for an analysis:
Given a probable effect size, what sample size will be needed?
For this question, researchers might propose a study that is a follow-up to a line of research where the methods are similar and prior results are available. For example, suppose a researcher wanted to replicate a classroom intervention for teaching fractions to third-grade students but this time to a new population of students, maybe second-language learners. For such a study, the effect size from the prior study is known, so the researcher would determine the sample size that would be needed to achieve the desired power. If it is determined that many more participants are needed than is practical, then the study would need to be redesigned or its merit reconsidered. As sample size is increased, statistical power will increase; yet it is important to sample only the necessary number of participants to achieve the desired level of power so as to avoid wasting resources such as time, supplies, money, etc., during the research process.
Consider the changes in power (e.g., the area of the H1 distribution that is to the right of the critical value) illustrated in the Sample Size and Power figure. Notice that with the means of the distributions fixed, as sample size increases, the spread of the distributions decreases, which leads to less overlap between the distributions. The decrease in the overlap is associated with a decrease in the Type II error rate (ÎČ) and a complementary increase in the power (1 â ÎČ). Therefore, by increasing the sample size of an analysis, a researcher can also increase its power. In this same way, even when sample size is constant, samples with smaller standard deviations (e.g., less spread from the mean), can be expected to have more power than samples with larger variances (e.g., more spread from the mean). Some researchers may be able to decrease variability with careful methodology and techniques that reduce measurement error.
Note. Three distribution pairs are represented by graphs. The first has a sample size of 5 with the two distributions wide but short. The second has a sample size of 15 with the distributions more narrow but higher than those with a sample size of 5. The third has a sample size of 25 with the two distributions the most narrow but the highest of the three sets. As sample size increases, the spread of the distribution decreases, which leads to less overlap between the tails of the distributions. The decrease in the overlap between the tails is associated with a decrease in the Type II error rate and a complementary increase in the power.
For each of the scenarios considered, the alpha, or the Type I error rate is set, most commonly at .05 and denoted by the vertical line. Consider the effect of a change in the alpha level on power. In the Alpha and Power figure, compare the middle set of distributions with alpha = .05 to the other two sets of H0 and H1 distributions. Notice that when alpha = .01, the portion of the alternative distribution to the right of the critical value, which corresponds to power, is smaller. In other words, when the rejection region is smaller (1% instead of 5%), the likelihood of Type II error is larger and corresponding power is smaller. When alpha increases to .10, on the other hand, the portion of the alternative distribution to the right of the critical value is larger, indicating greater power.
Note. Three distribution pairs are represented by graphs. The first has an alpha of .01, or 1%. The second has an alpha of .05, or 5%, which has a larger area in the tail. The third has an alpha of .10, or 10%, which has the largest tail of the three sets. The height and spread of the distributions is not different between the three pairs. As the Type I error rate, α, increases, the critical value shifts to decrease the Type II error rate. These changes do not change the overlap of the null and alternative distributions, only the areas of the Type I error rate and Type II error rate.
There is a dependent relationship between n, α, d, and ÎČ to determine the design of a study that will achieve a sufficient level of statistical power. Notice that increases in sample size (n), Type I error rate (α), or effect size (Cohenâs d) will lead to an increase in statistical power. The Factors That Influence Power table summarizes how these factors lead to an increase in power.
Note. A causal chain is listed for each of the primary factors that affect power, which are power, effect size, and alpha value. As sample size increases, variability decreases, the overlap between the null and alternative distributions decreases, and power increases. As effect size increases, the mean distance between the null and alternative distributions increases, the overlap between the null and alternative distributions decreases, and power increases. As alpha increases, the Type I error rate increases, the Type II error rate decreases, and power increases.
Statistical power is defined as the ability to detect a difference, or relationship, if indeed a difference exists. You can also think of statistical power as:
making a correct decision to reject the null hypothesis when it is false.
avoiding a Type II error when the null hypothesis is false.
We will review the concepts of Type I error and Type II error to understand statistical power.
To explore the concept of statistical power, consider the simple scenario of a one-sample test, such as whether MCAT test-takers who have completed a pre-med track perform better than the population of everyone who takes the MCAT. There are four possible outcomes when investigating if there is a difference between these two groups. Use the Statistical Decisions table to understand these four possible outcomes.
Note. Statistical decisions are illustrated as a two by two grid. In the top left quadrant, the null is true and retaining the null represents a correct decision, or 1ïŒalpha. In the top right quadrant, the null is false and retaining the null represents a Type II error, which is beta and often called a false negative. In the bottom left quadrant, the null is true and rejecting the null represents a Type I error, which is alpha and often called a false positive. In the bottom right quadrant, the null is false and rejecting the null represents a correct decision, or 1ïŒbeta.
Suppose that we are comparing a sample of pre-med MCAT test-takers and a population of all MCAT test-takers, but we do not know if the sample is from the comparison population or if it is from a different population. These two situations are represented by the two columns in the Statistical Decisions table. We will walk through each one of them in turn.
The left column represents the possible decisions if the sample is from the reference population, which is to say that the null is true and the pre-med MCAT test-takers are no different from the total population of MCAT test-takers. There are two possible statistical decisions based on the sample data, located in the rows: retain the null or reject the null. In the box labeled 1, the decision is to retain the null, which is to decide the sample is from the reference population; this is a correct decision with the probability equal to 1 minus alpha. In the box labeled 3, the decision is to reject the null; this is an incorrect decision, known as a false positive or Type I error, which has a probability equal to alpha. In this example, a false positive would be to decide the pre-MCAT test-takers are different from the MCAT population when, in reality (e.g., the âtruthâ in the population), they are not different.
Notice that in the left column in which the null is true, the probability of the correct decision is 1 â α (where α is typically .05) and the probability of the incorrect decision known as a Type I error is α; these two probabilities add to 1, or 100% as they represent all possible outcomes when the null is true.
The right column represents the possible decisions if the sample is not from the reference population, which is to say that the null is false and pre-med MCAT test-takers perform differently from the population of MCAT test-takers. Again, there are two possible statistical decisions based on the sample data, located in the rows: retain the null or reject the null. In the box labeled 2, the decision is to retain the null, which is to decide the sample is from the reference population; this is an incorrect decision, known as a false negative or Type II error, which has a probability equal to beta. In this example, a false negative would be to decide the pre-MCAT test-takers are not different from the MCAT population when, in reality (e.g., the âtruthâ in the population), they are different. In other words, the test âmissedâ this difference. In the box labeled 4, the decision is to reject the null, which is to decide the sample is not from the reference population; this is a correct decision with the probability equal to 1 minus beta.
In the right column in which the null is false, the probability of the incorrect decision, known as a Type II error, is ÎČ and the probability of the correct decision, known as power, is 1 â ÎČ; once again, these two probabilities add to 1, or 100% as they represent all possible outcomes when the null is false.
Note that correct decisions and incorrect decisions are tied together. For example, the probability of making an error is directly tied to the probability of making a correct decision. We say these decisions are complementary because we can make one or we can make the other but we cannot make both. The implication of this complementary relationship is that when the truth is that there is no difference between groups, and we set the probability of a Type I error to 5% (i.e., α), then we also set the probability of making a correct decision about a correct null hypothesis to 95% (i.e., 1 â α). As with the relationship between Type I error and a correct decision not to reject the null, there is also a complementary relationship between Type II error (i.e., ÎČ) and power (i.e., 1 â ÎČ). To illustrate these points, consider the Type I Error, Type II Error, and Power figure.
Note. Two versions of the null and alternative distribution pair are shown. The left panel illustrates how the critical value divides the null hypothesis into the Type I error region on the right and the region for correctly retaining the null hypothesis on the left. The right panel illustrates how the critical value divides the alternative hypothesis into the Type II error region on the left and the region for correctly rejecting the null hypothesis on the right. The four possible decisions are illustrated in the decision table on the far right.
The rate that we set for the false positive rate, α, is typically .05, which means that we accept that less than 5% of the time we will decide something is different when it is not. We may also set the false negative rate, ÎČ, at .20, which is equivalent to a power of 80%, because in the social sciences we generally accept .80 as the minimum tolerable power (1 â ÎČ) for an analysis. But, how do we determine if our study meets the desired level of power? The simple answer is that we design the study to achieve the desired level of power before the study is conducted. To design a study to meet a target level of power, we need to know two important values: n and d.
With values for α, n, and d, the value of ÎČ can be calculated, which allows us to easily calculate power (1 â ÎČ). Fortunately, a few of these values are known, or can be estimated, at the beginning of a study.
The power of an analysis is directly related to the degree of overlap between the Type I error region of the null distribution and the Type II error region of the alternative distributions. This relationship is illustrated in the Alpha, Power, and Type II Error figure. The boundary between the two distributions that mark the lower value for the Type I error and the upper value for the Type II error is the critical value. Although it is a common practice to set the Type I error at .05, in some instances it is appropriate to change this critical value to reduce the probability of making a false positive. For example, by reducing the Type I error rate from .05 to .01, the critical value reduces the area for the Type I error region and increases the area for the Type II error. Because the Type II error rate and power are complementary, the power decreased and the Type II error rate increased with a decrease in Type I error.
Note. Two sets of two distributions are shown, each with the same sample size for the null and alternative distributions. The set on the left has the typical (based) configuration that is used in all of the preceding figures. The set on the right will use the same distributions but the critical values will have shifted to the right, causing a reduction in the area associated with alpha in the null distribution and an increase in the area associated with power in the alternative distribution. The null distribution, H0, is marked with a vertical line, the critical z value, over the right tail. The alternative distribution, H1, is to the right of the null distribution. The critical z value for the null distribution divides the alternative distribution into two regions. The region to the left of the critical z value represents the Type II error and the region to the right of the critical z value represents statistical power. The two sets differ in the position of the critical value in the null distribution. As the critical z value moves to the right in the null distribution, alpha decreases, which corresponds to increasing Type II error and decreasing power.
Increasing alpha (e.g., using α = .10 instead of α = .05), on the other hand, would expand the rejection region and increase power. However, the size of α is typically limited to .05 to avoid an increased likelihood of a Type I error.
In psychological research, ÎČ is generally set at .20 (20%) or lower. This means that power (1 â ÎČ) would be at .80 (80%) or higher.
This leaves two unknowns, sample size, or n, and effect size, or d. So, the question becomes what values of sample size and effect size are needed to achieve the desired power?
Of the two unknowns, only the sample size (n) is under the researcherâs direct control. As an illustration of how changes in sample size affect the variability of the distributions, review the next three figures. Notice that as sample size increases, the spread of the distributions decreases, but the means of the distributions do not change.
In the Power with n = 5 figure, notice that with a small sample size (n = 5), the spread is wide and power is low. For the null distribution, the Type I error, âș, is .05. Notice that in this configuration, the Type II error, ÎČ, is larger than the power, 1 â ÎČ.
Note. The null distribution, H0, is marked with a vertical line, the critical z value, over the right tail. The alternative distribution, H1, is to the right of the null distribution. The critical z value for the null distribution divides the alternative distribution into two regions. The region to the left of the critical z value represents the Type II error and the region to the right of the critical z value represents statistical power. With n = 5, sample size is small, the spread is wide and the power is low.
Next, in the Power with n = 15 figure, the sample size has increased to 15 and the spread has decreased. The Type I error, âș, for the null distribution is still .05. Again, compare the size of the Type II error, ÎČ, region to that for power, 1 â ÎČ. Notice that, along with the smaller spread, the power has increased compared to the Power with n = 5 figure.
Note. A null distribution, H0, is marked with a vertical line, the critical z value, over the right tail. An alternative distribution, H1, is to the right of the null distribution. The critical z value for the null distribution divides the alternative distribution into two regions. The region to the left of the critical z value represents the Type II error and the region to the right of the critical z value represents statistical power. With n = 15, spread is smaller than when n = 5 and power is larger.
Finally, in the Power with n = 25 figure, sample size has increased to 25 and the spread has decreased even further. The spread is now the smallest and power is the largest compared to the Power with n = 5 and Power with n = 15 figures. However, again, the Type I error, âș, for the null distribution is still .05, which tells us that the change in power that we see is due to the change in overlap due to the reduced spread rather than a change in the means of the distributions or a change in the Type I error rate.
Note. The null distribution, H0, is marked with a vertical line, the critical z value, over the right tail. The alternative distribution, H1, is to the right of the null distribution. The critical z value for the null distribution divides the alternative distribution into two regions. The region to the left of the critical z value represents the Type II error and the region to the right of the critical z value represents statistical power. As sample size increases, thespread of the sampling distributions decreases causing less overlap between the null and alternative distributions. With less overlap, beta decreases and power increases. With n = 25, spread is small and power is large.
In summary, as sample size increases, the spread of the sampling distribution decreases, which leads to less overlap between the tails of the distributions. The decrease in the overlap between the tails is associated with a decrease in the Type II error rate and a complementary increase in the power.
While effect size is not under the direct control of researchers, prior studies similar to a researcherâs study may have clues about the effect size (d) that is probable for the proposed study.
In the Power with d = 0.2 figure, effect size is 0.2. Remember that Cohenâs d is a measure of standard deviation, which means that the distance between the null and alternative distributions is 0.2 standard deviations. The other thing to note about the figure is that the Type I error rate, âș, is .05. Generally, a Cohenâs d of 0.2 is considered to be a small effect, as represented in the figure. Compare the size of the Type I error region to that for power. Notice that for distributions this close together, there is a high chance of making a false positive error.
Note. A null distribution, H0, is marked with a vertical line, the critical z value, over the right tail. An alternative distribution, H1, is to the right of the null distribution. The critical z value for the null distribution divides the alternative distribution into two regions. The region to the left of the critical z value represents the Type II error and the region to the right of the critical z value represents statistical power. When effect size is small, the distance between the distributions is small and the power is low. The effect size is d = 0.2, which is the distance between the two distributions in standard deviation units.
Next, in the Power with d = 0.5 figure, the effect size is 0.5, which is half of a standard deviation. An increase in effect size represents an increase in the distance between the distributions. Compare the mean and spread of both distributions to that from the previous figure and you will see that although the means have changed, the spread has not. Now, as you did for the previous figure, compare the relative size of the region for Type II error to that of power and notice that with an increased effect size, power has also increased. As before, the Type I error rate has not changed. Therefore, this change is due to a reduction in the overlap between the two distributions, because they are farther apart, rather than a change in the Type I error.
Note. A null distribution, H0, is marked with a vertical line, the critical z value, over the right tail. An alternative distribution, H1, is to the right of the null distribution. The critical z value for the null distribution divides the alternative distribution into two regions. The region to the left of the critical z value represents the Type II error and the region to the right of the critical z value represents statistical power. The effect size is d = 0.5, which is the distance between the two distributions in standard deviation units. When effect size is 0.5, the distance between the distributions is considered to be a medium effect. Cohenâs d is in units of standard deviation, so an effect size of 0.5 means that the two distributions are separated by half of a standard deviation. Compare these curves to those illustrating a sample size of 0.2 and notice that power has increased.
Finally, in the Power with d = 0.8 figure, effect size has increased even further and the distance between the null and alternative distributions is now 0.8 standard deviations. Yet again, the distance between the two distributions has increased but the spread of the distributions have not changed compared to the Power with d = 0.2 and Power with d = 0.5 figures. Of the three figures, this one has the least overlap between the distributions, which means that the region for Type II error is now substantially smaller than the region for power.
Note. A null distribution, H0, is marked with a vertical line, the critical z value, over the right tail. An alternative distribution, H1, is to the right of the null distribution. The critical z value for the null distribution divides the alternative distribution into two regions. The region to the left of the critical z value represents the Type II error and the region to the right of the critical z value represents statistical power. The effect size is d = 0.8, which is the distance between the two distributions in standard deviation units. When effect size is 0.8, the distance between the distributions is considered to be a large effect. Cohenâs d is in units of standard deviation, so an effect size of 0.8 means that the two distributions are separated by just over three-quarters of a standard deviation. Compare these curves to those illustrating an effect size of 0.2 and an effect size of 0.5. Notice that power has increased.
Overall, notice that as effect size increases, the distributions get farther apart, which means the overlap between their adjacent tails decreases. The decreased overlap leads to increased power. Therefore, changes in effect size have a direct effect on the power of an analysis, which is caused by changes in the overlap.
With this basic framework outlined, researchers most often focus on one of the following questions when seeking to understand how to achieve the desired level of power, which is typically set at .80 or 80%:
Given a probable effect size, what sample size will be needed?
Given an available sample size, what effect size might be observed?
For the first question, the researcher knows the Type I error rate (âș) and has an estimate of the effect size (d), allowing for the determination of the needed sample size (n). For the second question, the researcher has the Type I error rate (âș) and the available sample size (n), allowing for the determination of the expected effect size (d).
These and the many other papers that have raised objections are too multifaceted and too complex to summarize completely here; instead, we briefly introduce some of the recurring themes.
Sample Sizes Are Smaller: Ioaniddisâs (2005) analyses showed that as sample size increases, the probability that a research report is false decreases. Although larger samples are more expensive, there is a benefit to the cost: fewer false positive reports, which can themselves be expensive in lost time following up misleading reports, and so on.
Effect Sizes Are Smaller: Ioaniddisâs (2005) analyses showed that as effect size increases, the probability that a research report is false decreases. Effect size is often not something the researcher can control, but it is wise to be more skeptical about smaller effects that have statistical significance.
The Number of Statistical Tests Is Larger: It is well known that the probability of making at least one Type I error increases as the number of statistical tests conducted increases, just as the probability a coin will land heads up at least once increases with the number of flips. When more tests are conducted with a random approach with a large number of variables, there is an increased risk of a false research report.
Flexibility Is Greater: When there is greater flexibility in the choice of research designs, the operational definitions of variables, and the choice of statistical test, there is more opportunity for bias to creep into the process such that the chance of a false research report is more likely. In their quest for interesting, noteworthy findings, researchers may knowingly or unknowingly make decisions that increase the chance of a false report.
Incentives for Positive Findings Are Greater: Science is a human process, subject to human foibles. Career success in science often depends on the number of publications, media attention, and grant funding, all of which depend on successful research projects, where success has often depended on p being less than .05. In their quest for interesting, noteworthy findings, researchers may knowingly or unknowingly make decisions that increase the chance of a false report.
The Research Topic Is A Hot One: If more researchers are working on a topic, then there is an increased risk of false reports simply because more tests are being conducted. The more times you roll two dice, the greater the chance the result will total seven.
Other proposals are more complex. And none have yet gained universal acceptance in the psychological research community. Again, our treatment of this topic will not be exhaustive, but will instead briefly describe four of the more commonly proposed alternatives to NHST methods: Bayesian inference, estimation methods, meta-analysis, and modeling.
The foundations of NHST methods depend on a conception of probability that focuses on the long-run frequencies of the set of events that can occur when a process or experiment is repeated many, many times. This is sometimes called the frequentist conception of probability and statistics. Alternatively, in Bayesian inference, probability can be thought of as a degree of belief. Watch the Bayesian Inference video for an intuitive introduction to this method, after which we will explore the proposed alternatives to the frequentist-based NHST methods.
In the frequentist way of thinking, the value of a particular parameter, say a population mean, is fixed. We take samples to estimate that value. In the Bayesian way of thinking, the value of a parameter is represented as a probability distribution and we take samples to update and improve our estimate of that probability distribution.
As in the demonstration depicted in the Bayesian Inference video, we begin our understanding of a situation with a set of prior beliefs, a probability distribution referred to simply as the priors in Bayesian analysis. We have gathered no evidence, so anything is possible and our probability distribution might be a uniform distribution over all possible answers. But then as we gather evidence, some answers become more likely, others less so. And as we gather still more evidence, our view of how probable each hypothesis (the possible locations) is updated accordingly. The updated beliefs are represented in a probability distribution called the posterior probability distribution. So existing beliefs (the priors) are updated by new data, resulting in the posterior set of beliefs.
They crowd out scientific judgment: The general concern is that by using a rote procedure for making a binary decision (significant vs. nonsignificant), NHST methods discourage deeper, more reasoned consideration of the results of a study and what they mean.
They use confused logic: The general concern is that by focusing on the hypothetical situation in which the null hypothesis is assumed to be true and basing p values on that situation, people are apt to mistakenly interpret the p value as the general probability the null is true or false, instead of the probability of the test statistic under the assumption that the null hypothesis is true.
They test a hypothesis no one believes: It is highly unlikely that the null hypothesis is true, even if there is no systematic, causal reason for a departure from the null result. Random processes seldom produce exactly the result predicted by the null hypothesis. For example, if we flip a fair coin 1000 times, then the chance that the 1000 flips result in exactly 500 heads and 500 tails is small. With a large enough sample, it is possible to reject the null hypothesis for even an unmeaningful difference such as this.
They distort the scientific process: Because at some level, all a researcher needs to get a result published is to reach statistical significance, NHST processes discourage testing larger samples and taking other such steps to reduce the chance of a false positive result.
Use Bayesian inference: A different view of probability, identified with Thomas Bayes and known as Bayesian inference, involves starting with a distribution of probabilities (the prior probabilities) assigned to possible hypotheses, collecting new evidence, and using the new evidence to update the probability distribution (creating a posterior probability distribution). In contrast, NHST methods depend on a frequentist view of probability, which depends on thinking about a hypothetical distribution of outcomes that occur in a hypothetical large number of repetitions of a research process.
Use estimation methods: These procedures change the focus from testing whether the null hypothesis is true to trying to estimate the size of the effect in which the researcher is interested.
Use meta-analysis: These procedures broaden the base of data on which scientific judgments are made by collecting together all the existing estimates of a given effect size and averaging them together to create an overarching estimate of an effect size. Various statistical and graphical techniques (such as the forest plot, in which a series of individual effect sizes are plotted together with the overall average effect size) are used to combine and summarize the collected estimates.
Focus on modeling: A model is a mathematical expression of the relevant variables and their relationships that can then be tested against new data and against other models so that, as this process is repeated, increasingly accurate models of a phenomenon may be constructed.
Bayesian inference: Bayesian inference more naturally reflects the scientific process, in that scientists start with existing knowledge, reason toward a research question, acquire new data, and update their understanding. Proponents of Bayesian inference emphasize that it avoids the reasoning about a hypothetical large number of samples from a hypothetical null distribution to determine the probability of a test statistic, assuming that the null is true. Instead, Bayesian inference directly estimates the probabilities that the null and alternative hypotheses are true. Bayesian inference associates the prior and posterior probabilities to the full range of possible hypotheses, thus not focusing on one point (zero) along an entire range of possible hypotheses. Because Bayesian methods do not focus on a single possible hypothesis, but on the entire range of possible hypotheses, the focus can be more on determining the most probable hypothesis instead of trying to achieve statistical significance.
Estimation methods: By focusing on the size of the effect, estimation methods would take the emphasis away from the mechanical, binary decision of whether a result is statistically significant or not, and shift it to the size of the effect and what it means. The logic of estimation is much more direct than the logic of NHST. In estimating an effect size with estimation methods, we are simply trying to determine as accurately as possible how big the effect is. Whether or not the effect is different from zero (the null effect) is secondary to estimating the effect size. Estimation methods focus on the size of an effect, not on a single possible effect size (zero) like NHST methods. By focusing on finding the best possible estimate of a particular effect size, estimation methods put the incentives on accuracy and the quality of the estimate, rather than on trying to achieve a statistically significant result.
Meta-analysis: By focusing on the size of the effect, meta-analytic methods would take the emphasis away from the mechanical, binary decision of whether a result is statistically significant or not, and shift it on the size of the effect and what it means. The logic of estimation is much more direct than the logic of NHST. In estimating an effect size with meta-analysis, we are simply trying to determine as accurately as possible how big the effect is. Whether or not the effect is different from zero (the null effect) is secondary to estimating the effect size. Meta-analytic methods focus on the size of an effect, not on a single possible effect size (zero) like NHST methods. By focusing on finding the best possible estimate of a particular effect size, meta-analytic methods put the incentives on accuracy and the quality of the estimate, rather than on trying to achieve a statistically significant result.
Modeling: By focusing on the specification of relevant variables and the relationships among them, methods focused on modeling would shift the emphasis away from the mechanical, binary decision of whether a result is statistically significant or not, and put it on the variables and the relationships among them. The logic of building and testing a model is seen as much more clear and less confusing than the logic of NHST. Researchers are simply focused on using existing knowledge to build a model that can make predictions about new data. Models are intended to focus on relevant variables and the relationships among them, rather than on a single difference or strength of relationship and whether those are equal to zero. In the process of developing and testing models, the focus is on building the best possible model, that is, the best possible understanding of the relevant variables and their relationships, rather than on achieving a single statistically significant result.
Key Takeaways: Beyond Null Hypothesis Significance Testing
Statistical significance (often phrased as âp < .05â) has a very narrow, specific meaning and lets us answer the question, âHow likely is this result, assuming that the null hypothesis is true?â Because the interpretations of p values have a limited scope, it is important to consider additional statistical information in your analyses, such as effect size and confidence intervals.
Effect size is a measure of the magnitude of the effect, which is not dependent on sample size (unlike test statistics and p values). Effect size, along with information about the study and the variables, conveys information about the importance of a difference and can help a researcher determine if the difference, or relationship, is meaningful in a practical sense.
A common standardized measure of effect size is Cohenâs d, which measures the distance between two means in standard deviation units.
Confidence intervals are used to overcome some of the difficulties of hypothesis testing by providing an interval estimate instead of only a point estimate.
The confidence interval is described in terms of a percentage, such as a 95% confidence interval or a 99% confidence interval, which is the confidence level. The confidence interval procedure, if repeated a large number of times, will generate an interval that includes the population parameter of interest x% of the time, x being the confidence level.
Remember that a 95% confidence interval will contain the population parameter only about 95% of the time. When you have a particular interval (after it is generated), it either contains the population parameter (p = 1) or it does not (p = 0). Yet we cannot be sure which is the case.
The effect size and confidence interval values are typically listed following the p value in a APA style summary of the results of a null hypothesis significance test.
Statistical power is most relevant to a researcher before a study is conducted; a researcher often wants to know how to achieve enough power to detect a true difference between groups if one exists.
Calculating statistical power is dependent on the sample size (n), Type I Error rate (α), and effect size (d).
As sample size increases (and the spread of the distributions decreases), power increases.
As Type I error increases (i.e., there is a larger rejection region), power increases.
As effect size increases (i.e., there is a larger mean difference between the groups), power increases.
Power can also be calculated as 1ïŒÎČ.
Even from the time when psychologists and other scientists were only just beginning to use null hypothesis significance testing (NHST) methods consistently, other psychologists and scientists were objecting that these methods had numerous potential problems.
Among the potential problems were concerns that NHST methods
are a poor substitute for scientific judgment, turning a continuous dimension like probability (p) into a mechanical decision of significant vs. not significant;
use bad logic that is confusing and that leads to frequent misunderstandings and misinterpretations;
test a hypothesis that no one believes anyway, namely that there is literally zero difference between conditions or zero relationship between variables; and
distort the scientific process.
Among the ways that NHST methods distort the scientific process (the four bullet point, above), they
allow smaller samples to have out-sized influence by being more likely to produce false positive results;
similarly allow true small effect sizes to produce misleadingly large, positive results;
allow scientists to use a scattershot approach to conducting a large number of statistical tests without sufficiently controlling for the possibility that at least one false report becomes increasingly likely as the number of tests increases;
allow scientists to take advantage of flexibility in the definition of variables and choice of statistical methods in such a way that makes false reports more likely;
are susceptible to outside influences and biases so that false positive findings are more likely to occur; and
are even more susceptible to these outside influences and biases if the research topic is one that draws greater interest from the scientific community.
In response to concerns about NHST methods, a number of alternatives have been proposed, including
Bayesian inference, which depend on a view of probability in which probabilities represent how existing beliefs (priors) are updated as new information becomes available, resulting in new beliefs (posterior probabilities), instead of as the long-run relative frequencies of members of the set of possible events (termed the frequentist view);
estimation methods and estimation graphics, which use NHST methods but shift the focus from testing the null hypothesis to estimating the size of an effect;
meta-analysis methods, which summarize, often in the form of a weighted average, all the available studies of a given effect size, employing forest plots and other specialized methods to summarize the relevant studies; and
model building, which emphasizes the specification of the most relevant variables and the mathematical relationships between them to create an equation or equations that define those relationships.
Compare this situation to an honest referee flipping a fair coin. Before the flip, we would say the probability of the coin landing heads up is .5. After the flip, we know the probability that the coin landed heads up: It is 1.0 if the referee said âheadsâ or 0 if the referee said âtails.â A confidence interval is the same. It either contains the population parameter, or it does not. The difference is that there is no referee to tell us which it is! For the procedure, there is a big difference between, on the one hand, saying that 95% of the confidence intervals generated by a procedure repeated a large number of times will contain the population mean, and on the other hand, saying that the probability that one particular interval holds the population mean is .95.
Beginning around the 1930s, one set of statistical methods developed by Sir Ronald A. Fisher merged over time with a different set of methods developed by Jerzy Neyman and Egon S. Pearson to eventually become what we now know as null hypothesis significance testing (NHST; see The Empire of Chance by , for a detailed recounting of this history). Neither Fisher nor Neyman and Pearson approved of the way the methods were combined. As Gigerenzer et al. put it (p. 107), â[k]ey concepts from the Neyman-Pearson theory such as power are introduced along with Fisherâs significance testing, without mentioning that both parties viewed these ideas as irreconcilable.â Throughout this history, there has been vocal opposition to NHST procedures and their forerunners, both in psychology and in the broader scientific community. Papers with titles like the following give a flavor of the objections:
: Mathematical vs. Scientific Significance
: The Fallacy of the Null-Hypothesis Significance Test
: The Earth Is Round (p < .05)
: Why Most Published Research Findings Are False
: Abandon Statistical Significance
By dividing the world of possible research results into the realms âstatistically significantâ and not âstatistically significantâ we make an artificial and arbitrary distinction. Is there really much difference between a p of .049 and a p of .051? Among other things, the concept of statistical significance makes it possible to make a decision about a data set by looking at a single number, and substitutes that decision for more careful, considered scientific judgment. It also creates the illusion of objectivity when, in reality, any sample is a flawed representation of the greater population. As wrote, âgiven only approximate control of experimental conditions, only approximate results can be achieved.â He concluded that a process involving such mechanical decision making âis one of many where statistical ability, divorced from a scientific intimacy with the fundamental observations, leads nowhereâ (p. 334).
Many students find the logic of NHST methods rather strange and confusing. Numerous studies have shown that even well-trained scientists and statisticians get confused about what the probability p represents and how to interpret it (e.g., Haller & Krauss, 2002; ; ). Many people think that a small p value, of say .005, means that the probability that the null hypothesis is true is also .005, but this is not correct. Many also believe that the probability that a replication of the study would also find a statistically significant result is 1 â .005; this is not true either. The p value only tells you the probability of finding a result as extreme as (or more extreme than) the result you obtained if we first assume that the null hypothesis is true. If the p value is small (less than .05), then we infer that the null hypothesis is false because this result would be relatively improbable if the null hypothesis was true.
The null hypothesis is that there is exactly zero difference, or exactly zero relationship. How likely is it that this is exactly true? described an analysis he did with data from a sample of about 57,000 people, in which they conducted 105 chi-square tests of association between variables. None of the comparisons was motivated by a research question; they just used the variables available in the data set. All of the tests were statistically significant, 101 of them with p < .000001. If we know before we begin a study that the null hypothesis is false, why do we bother to test it?
An influential paper by epidemiologist used a probability model of the accuracy of reported research findings to argue that more than half of published research findings are false, despite the widespread use of a relatively small Type I error rate (α level = .05). Although critics have challenged some aspects of Ioannidisâs model, they have also shared his concern about the influences of various research practices and biases on the general accuracy of research reports (e.g., ). More generally, Ionnidisâs concerns have converged with many of the concerns prompted by the well-publicized replication crisis in psychology and other sciences like medicine and nutrition (). Ioniddisâs list of practices and biases is summarized in the following list:
Along with the numerous objections to NHST methods have come many proposals for alternatives. Some are as straightforward as the encouragement for researchers to rely more on replications of their own and othersâ studies as a way of assessing the reliability of a phenomenon (Roediger, 2012). We do not need statistics to tell us that implicit memory effects (e.g., ) or false memory effects (e.g., ) are real; they have been replicated hundreds of times.
Researchers who favor the Bayesian way of thinking believe that it better captures the way knowledge develops in science. The frequentist way of thinking better captures situations like rolling dice and spinning roulette wheels, which are nothing like science. To take just one example, the Bayesian analogue of a confidence interval is usually called a credible interval, and it does mean that there is a 95% chance that the value of interest (e.g., population mean, effect size, etc.) falls within that interval (noting that the interval depends on the prior probabilities). Recall the correct interpretation of a confidence interval: It is the long-run probability that intervals calculated according to the confidence interval procedure will include the population mean if we draw a very large number of samples. That understanding of the confidence interval procedure does not allow the inference that there is a 95% chance that the value of interest falls within any one particular interval ().