[06/09/2024] Statistics in Psychological Research 1-4
Last updated
Last updated
Variables are measured on one of four types of scales, or levels of measurement, which include nominal, ordinal, interval, and ratio scales. A mnemonic for remembering the scales in order is NOIR (as in film noir). The levels of measurement are useful in helping us remember that measurements can have different properties. The properties of each scale allows researchers to choose statistical techniques that are appropriate to those properties. As we move through the levels, each measurement has all the properties of the preceding level(s), plus an additional property. So, an interval-level measurement has all the properties of the nominal- and ordinal level measurements, in addition to another property, in this case equal spacing between the increments.
An important feature of any data distribution is its central tendency, a property that represents a typical score in the distribution, or the point in the middle or center of the data set. Three forms of central tendencyāthe mean, the median, and the modeāeach convey different information about a data set, but they each indicate one trend in the data, namely a central point around which the data tend to cluster. Among other factors, the type of variable we measure (i.e., the scale of measurement) will determine which measure of central tendency is most appropriate to use for a given purpose.
Another important feature of any data distribution is its variability, which represents how the scores in a distribution are spread out over the possible values of the variable. Three measures of variabilityāthe range, variance, and standard deviationāeach give different information about a data set, but all describe one aspect of the data, their dispersion. The range may be used for ordinal-, interval-, or ratio-level data, but the latter two measures may only be used with interval- or ratio-level data.
Sometimes there are two scores that tie, or nearly tie, for the most frequently occurring score. Such a distribution is said to be bimodal. If a distribution of scores has three or more modes, it is called multimodal.
Often the mode for a distribution of scores is near the center of distribution, but this is not always the case. A set of scores might have the most frequent score near the high end or the low end of the scale. Imagine that you give a sample of 12-year-olds a reading test meant for 8-year-olds. The mode for that set of scores likely would be at the high end.
A negative skew occurs when a sample of research participants score very high on a measure and are at the top of the possible scores. Conversely, a positive skew occurs when a sample of research participants score very low on a measure and are close to the lowest point of the possible scores. Now, imagine that you give a sample of 12-year-olds a reading test meant for 15-year-olds. The mode for that set of scores likely would be at the low end. If the data is skewed, then the mode score will not be in the center of the distribution.
One of the most straightforward ways of measuring variability is called the range. The range is often described by the actual low and high values; it is also the difference between the highest score and the lowest score. If, over 10 years, the highest daily high temperature on a given date in October was 82ā° and the lowest daily high temperature was 48ā°, then the temperatures ranged from 48 to 82, for a range of 34.
The variance is the mean-squared deviation score. To calculate variance, start by calculating the deviation scores and then squaring all the deviation scores. A deviation score is the difference between a score and the mean of all scores. Squaring the deviation scores, like taking the absolute value, makes all the scores positive. Then calculate the mean of these squared deviation scores, which is the population variance. However, if we have a sample of values that does not include all members of the population, then the sum of the squared deviation scores is divided by one less than the number of scores, which is the sample variance.
The standard deviation is the square root of the variance. Using the standard deviation brings the score back to the original unsquared units. The standard deviation is quite commonly reported. However, standard deviation can only be used with interval- or ratio-level data; you must be able to calculate a meaningful mean to calculate the standard deviation for a set of scores.
Is there a relationship between scores on a college aptitude test, such as the SAT, and college GPA? Questions such as this are investigated frequently in the behavioral sciences. In this case, we are interested in the correlation, or relationship, between these two variables. Does knowing the value of one variable for an observation tell us something about the value of the other variable for that same observation? In statistical terminology, a correlation refers to a particular quantitative measurement of the direction and strength of the relationship between those variables.
When measuring a correlation quantitatively, a correlation coefficient is the statistic of interest. The most common of these is the Pearson correlation coefficient, or more completely, the Pearson product-moment correlation coefficient. A Pearson correlation coefficient, which is denoted by the letter r and ranges from ā1 to +1, quantifies the linear relationship between two variables and conveys information about both the direction and the strength of the correlation.
The direction of the correlation is conveyed by the sign of the correlation coefficient and can be positive, negative, or zero. When there is a positive correlation, the two variables change in the same direction; if one increases, so does the other, and if one decreases, so does the other. For example, sales of cold beverages are higher on summer days with higher temperatures and lower on summer days with lower temperatures. When there is a negative correlation, the variables change in opposite ways; as one variable increases, the other decreases. This is also known as an inverse correlation. For example, sales of hot beverages, such as coffee or tea, are lower on summer days with higher temperatures and higher on summer days with lower temperatures. When there is zero correlation, the changes in the values of one variable are not related to changes in the value of the other variable.
A good way to understand the direction of a correlation coefficient is to plot the data with a scatterplot, which displays the individual data points in a sample for two interval- or ratio-level variables, one plotted on the x-axis, the other on the y-axis. Scatterplots are used to verify that there is a linear relationship in a correlational analysis and can help us visualize the relationship between the variables. In the Three Types of Correlations figure, notice how the points in the positive correlation increase from the lower left side to the upper right side. The opposite is true for a negative correlation; observe how the data points decrease from the upper left side to the lower right side. In a zero correlation, the points form a horizontal line.
Note. The first scatterplot shows a positive correlation between temperature on the x-axis and cold beverage sales on the y-axis, with the data points moving in a straight diagonal line from the lower-left corner to the upper-right corner of the graph. The second scatterplot shows a negative correlation between temperature on the x-axis and hot beverage sales on the y-axis, with the data points moving in a straight diagonal line from the upper-left corner to the lower-right corner of the graph. The last scatterplot shows a zero correlation between temperature on the x-axis and brussels sprouts sales on the y-axis, with the data points moving in a straight horizontal line across the middle of the graph.
The strength of the correlation coefficient, also referred to as the size or magnitude of the correlation, is conveyed by the absolute value of the statistic r, with the upper bound of 1 and the lower bound of 0. A value of either positive or negative 1 indicates a perfect relationship. Values of r closer to ā1 or +1 indicate stronger relationships while values closer to 0 indicate weaker relationships. Despite the difference in direction, a correlation of +.6 is equally strong as a correlation of ā.6.
The strength of a correlation can be observed in a scatterplot by looking at how closely the individual data points form a straight line. The Three Types of Correlation figure presents exact correlations of +1, ā1, and 0 correlation. Notice that for the positive and negative correlations, the points form a perfect diagonal line. For each one unit change in temperature there is a corresponding change in beverage sales. For the zero correlation, there is no correspondence at all between temperature and beverage sales. Examples of more typical correlations are shown in the Strong and Weak Correlations figure. These correlations are weaker than the idealized versions presented in the previous figure, although trends are still readily evident. Compare the scatterplots of strong and weak correlations to each other and to the plots of perfect correlations.
Describing Bivariate Data
Bivariate graphs plot one variable against another, usually with one on the x-axis, the other on the y-axis. One type of bivariate plot is the line graph, which displays data for interval- or ratio-level dependent variables (on the y-axis), as a function of interval- or ratio-level independent variables (on the x-axis), as shown in the TV Watching and Age example.
Note. The variable on the x-axis is decades of life, ranging from 2 to 9, and the variable on the y-axis is average hours of TV watching per day. The trend for both sexes is generally positive. Data derived from the General Social Survey (Smith, et al., 2018).
A second type of bivariate plot is the scatterplot, which displays the individual data points in a sample for two interval- or ratio-level variables, one plotted on the x-axis, the other on the y-axis, as illustrated in the GDP and Life Expectancy figure.
Note. A scatterplot with log per capita gross domestic product on the x-axis, ranging from about 6 to 12, and healthy life expectancy on the y-axis, ranging from about 45 to 80 years. The scattered data has a positive slope. Data derived from Helliwell et al. (2021).
The Pearson correlation coefficient, or r, is a standardized measure with boundaries for the size of the value at ā1 and 1. It provides a measure of the linear relationship between two variables that can be interpreted based on direction and strength. Imagine researchers are interested in the relationship between average weekly minutes of exercise and number of illnesses per year. We may anticipate a negative correlation, such that as minutes of weekly exercise increase, the number of illnesses per year decreases. If r = ā.72 for exercise and illness, how would you interpret this value? The sign indicates that the direction of the relationship is negative and the variables are moving in opposite directions. We can also consider the strength of the relationship. The closer the values move to r = ±1, the stronger the relationship and the closer the values move to 0, the weaker the relationship. Based on r = ā.72, there is a strong, negative relationship between exercise and illness.
Note. The points begin in the lower-left quadrant and move along a diagonal path toward the upper-right quadrant. Examine how tightly the points cluster around a diagonal line. Based on data from the World Happiness Report (Helliwell et al., 2021).
Compare your conclusions to those listed here.
Direction: There is a positive relationship, such that as GDP increases, life expectancy also increases.
Strength: The correlation coefficient is .79, which is close to 1, and the data points closely approximate a diagonal line, which both indicate a strong relationship between the variables.
It is important to consider a common misstep when interpreting a correlation coefficient. We cannot conclude that the relationship is causal. While it is possible that more GDP leads to a longer life expectancy, other scenarios are equally possible. For instance, perhaps individuals who live longer are contributing more to their countryās GDP. Or, another variable might lead to both increased GDP and longer life expectancy. Because the variables are only measured in a correlation and not manipulated, we cannot make a causal claim based on a correlation.
Variables can be classified as categorical (emotions), discrete (number of pets), or continuous (commute time in minutes).
Notice that discrete variables have no intermediate values (e.g., we cannot have 1.5 dogs), while continuous variables can have additional observations between any two values (e.g., commute time can be measured as 5.5 minutes).
Variables can also be classified as qualitative (also known as categorical) or quantitative (can be discrete or continuous).
Qualitative variables answer the question of āwhat kind?ā (e.g., what kind of emotion, for the variable emotion).
Quantitative variables answer the questions of āhow much?ā or āhow many?ā (e.g., how many siblings, for the variable siblings).
A variableās level of measurement can be classified as nominal (type of food), ordinal (size of pizza: small, medium, large, etc.), interval (temperature in degrees Fahrenheit), or ratio (number of questions missed on a quiz).
While interval-level and ratio-level measurements both have equally spaced increments, only ratio-level measurements include a meaningful 0. For the ratio measurement number of questions missed on a quiz, a 0 value would indicate no questions missed. A 0 value for the interval measurement temperature in degrees Fahrenheit, however, does not indicate an absence of temperature.
Variables can also be described based on their use in a research study.
In an experimental design, an independent variable, such as the type of study technique, is manipulated and a dependent variable, such as quiz score, is measured. The independent variable can be thought of as the grouping variable and the dependent variable can be thought of as the outcome variable.
In some research studies, there is no manipulated variable, and instead researchers want to analyze how one measured variable, known as a predictor variable, predicts another measured variable, known as an outcome variable. Notice that the terms dependent variable and outcome variable both refer to a measured outcome variable.
The three measures of central tendency, the mean, median, and mode, each provide a way to express information about the middle of a distribution.
The mean is often referred to as the āaverageā and is the most common form of central tendency. However, when data are skewed or outliers are present, the mean is pulled in the direction of the tail of the distribution or an outlier, because it relies on the size of each value, and will not represent the middle value in a data set. The mean can only be used with interval- or ratio-level variables.
The median is the midpoint of a set of ordered scores. Because it is not pulled toward outliers or the tail of skewed distributions, it is a good choice to describe the central tendency of data that are positively skewed (e.g., tail toward the right side of distribution) or negatively skewed (e.g., tail toward left side of distribution). The median can be used with ordinal-, interval-, or ratio-level variables.
The mode is the most frequently occurring score in a set of data, but is used least often. An advantage of the mode is that it can be used for data with any level of measurement, and is the only measure of central tendency that can be used to summarize nominal-level measurements (e.g., you can describe the most frequent eye color in a group, but you could not calculate the mean or median eye color). In a distribution of scores, the mode will be seen as the peak of the distribution.
Variability, or dispersion, is used to express the spread in a set of scores, and is commonly expressed through the range, variance, and standard deviation.
The range is a measure of the difference between the minimum and maximum values in a data set and is a quick way to judge the overall spread of the data. It is heavily influenced by outliers, however, so it is less useful when outliers are present.
Variance is the average squared deviation from the mean for a set of values and provides a way to consider spread from the mean of a set of data. Interpretation of the variance is difficult, however, because it is in squared units.
Standard deviation indicates how narrowly or broadly the sample scores deviate from the mean. It is used more commonly than variance because the value of standard deviation is in the units of the variable, which makes it much easier to interpret.
Although numeric values, such as a measure of central tendency or variability, are useful in describing the data, it is often helpful to graph data so that patterns can be visualized.
A frequency distribution represents every observation in the data set and provides a comprehensive picture of the sample results for a variable. Bar graphs, histograms, and frequency polygons are examples of frequency distributions where a nominal-, ordinal-, interval-, or ratio-level variable is plotted on the x-axis and the frequency of that variable in a data set is plotted on the y-axis.
A bar graph is a frequency distribution for nominal- and ordinal-level variables. The values of the variable of interest are arranged on the x-axis (the horizontal line) and the frequency of occurrence of that value in the set of data on the y-axis (the vertical line). With nominal-level variables, the order of the categories on the x-axis is arbitrary. With ordinal-level variables, the order of the values on the x-axis follows the ranking of the values of the variable, but because the values are not separated by equally spaced intervals, the width and spacing of the bars are arbitrary. Because of the arbitrary order of the values on the x-axis of a bar graph, the particular shape of the distribution is not meaningful.
A histogram is a frequency distribution for interval- or ratio-level variables. For a histogram, the shape of the distribution is meaningful because the units on the x-axis are spaced meaningfully because the values are not arbitrary. If the variable is continuous (or if there is a large number of discrete values), instead of representing a single value, then each bar might represent a range of values, or bin. It is common that histogram bars are touching unless there are observations that are absent.
A frequency polygon is a frequency distribution of an interval- or ratio-level variable. It is very similar to a histogram but is constructed simply by marking a point at the center of each bin that corresponds to the frequency for that bin, and then drawing a line to connect the points. The frequency polygon may provide an even clearer sense of the shape of the frequency distribution, and also has the advantage that multiple distributions can be plotted on a single set of axes.
Unlike a frequency distribution, which plots variables on only the x-axis, bivariate graphs plot variables on both the x- and y-axis. These include scatterplots and line graphs.
A scatterplot, where individual data points represent two variables, one plotted on the x-axis, the other on the y-axis, is commonly used with two interval- or ratio-level variables. The distributions of two variables are plotted together so that the relationship, if any, between the two variables can be easily visualized.
A line graph is a data representation that displays data for interval- or ratio-level dependent variables (on the y-axis) as a function of interval- or ratio-level independent variables (on the x-axis). A line graph may also be used to plot summary data (on the y-axis) for groups (on the x-axis) to show change over time, make predictions with a line of best fit, or allow the reader to search for trends in the data.
A correlation refers to a particular quantitative measurement (e.g., correlation coefficient) of the direction and strength of the relationship between two or more measured variables.
A Pearson correlation coefficient, which is denoted by the letter r and ranges from ā1 to +1, quantifies the linear relationship between two variables in terms of the direction and strength.
The direction of the correlation is conveyed by the sign of the correlation coefficient and can be a positive correlation (variables change in the same direction), a negative correlation (variables change in opposite directions) , or a zero correlation (values of one variable are not related to changes in the value of the other variable).
The strength of the correlation coefficient is conveyed by the absolute value of the statistic r, with the upper bound of 1 and the lower bound of 0. Values of r closer to ā1 or +1 indicate stronger relationships; values closer to 0 indicate weaker relationships.
A correlation only allows us to conclude that two variables covary to some degree. Correlation does not indicate causation.
When interpreting data from a sample, do we need more than descriptive statistics? Why not just look at the data, compute a mean and a standard deviation, make a graph, and draw conclusions? The short answer is that descriptive statistics only tell us about the sample, but if we wish to make inferences about the population from which the sample was collected, then we need inferential statistics. Inferential statistics, which are based on probability and sampling theory, are necessary because using samples to make inferences about populations adds uncertainty, known as sampling error, to our conclusions. In this section, we will learn about the fundamentals of inferential statistics.
How do we use samples to understand the population from which they were collected? As an example, imagine that you have been asked to determine if students at your college study the same average number of hours as students at all other colleges in the United States. As a question, do the students from your college represent the typical student in the United States or are they different from the typical student?
We use inferential statistics for a sample to draw conclusions about a population. The challenge in drawing conclusions involves the problem of sampling error. Sampling error occurs when sample statistics depart from the corresponding population parameters by some amount due to measurement errors and random variability, which is why probability is important. As such, the mean and standard deviation of the sample is unlikely to be identical to the mean and standard deviation of the population. However, inferential statistical techniques allow us to make inferences about the population based on probability and the sample.
Sometimes we are interested in a single probability, for example, the probability of rain on a weekend afternoon, or the probability that a new car will require an expensive repair before its warranty lapses. Other times we are interested in a range of probabilities, in which case we refer to a probability distribution, which is simply a representation of a set of probabilities. For example, we may wish to know the probability that the score for our next exam will be 80% or greater. Notice that our concern is not about the probability of scoring exactly 80% on the exam but rather about scoring from 80% to 100%.
For continuous variables, probabilities are determined by the area under a curve of the probability distribution defined by a mathematical function. The mathematical function that is used depends on the statistical test being used. (Don't worry. We will cover the concept and leave the calculus to the theoreticians.) Such distributions are theoretical distributions, because they are based on mathematical functions, rather than empirical frequency distributions that are based on numerical samples.
There are several common probability distributions, and we could define any number of others. Some of the common distributions encountered in statistics are shown in the Continuous Probability Distributions figure. In each case, the x-axis lists the possible range of values and the y-axis lists the resulting value from the relevant mathematical function, which varies depending on the statistical test. But, what we are more interested in is the area under the curve for a range of x-values.
Note. From left to right, an example of a rectangular distribution, one member of the family of t distributions, three members of the family of F distributions, and three members of the family of Χ2 (i.e., chi-square) distributions. The specific shapes of the distributions are less important than the idea that we can identify a region under any of these curves and describe what proportion that region is of the whole area under the curve as a way of estimating the probability of observing any of the x-axis values within the specified range.
In the first panel, the rectangular, or uniform distribution, shows that every value is equally likely over some range. In this example, any randomly selected x-value between 5 and 10 is equally likely (i.e., it yields the same y-value), but to calculate the probability of a range of x-values, such as from 7.5 to 10, we must determine the area under the curve that corresponds to the desired range. The total area under a probability distribution is 100% so, for our example, the range from 7.5 to 10 is 50% of the area, corresponding to a probability of .50.
In the second panel, a t distribution, which is only one example from a family of distributions, illustrates how some distributions are not uniform. For example, the probability of a value between -1 and 1, bracketing the mean of the distribution, is far more likely (i.e., has a greater area) than the probability of a value that is less than -1 or that is greater than 1, which are the tails of the curve.
The third panel and fourth panel show examples from families of curves for F distributions and ź2 (i.e., chi-square) distributions, which illustrate how the shape of the distribution may vary depending on the statistical test as well as the characteristics of the study being conducted.
The key point is that the specific shape of the distribution is less important than the idea that we can determine the proportion of the whole area under the curve for any range of values. This area represents the probability of observing any of the x-axis values within the specific range being investigated. Look again at the uniform distribution and identify the area under the curve between 6 and 7. As intuition suggests, the probability of randomly selecting a value from this distribution and observing a value that is between 6 and 7 is .2, which is 20% of the total area under the curve. Of course the math is not as easy for determining the probability under some distributions, but there are tools that help us determine the area, or probabilities, under any curve.
Inferential statistics are tools for making inferences, based on probability and sampling theory, about a population using a sample of that population. In contrast, descriptive statistics, such as measures of central tendency and variability, are used to describe a sample but not to infer how the sample relates to the population from which it was drawn.
Probability is defined as the expected relative frequency that a random process will yield a particular event. Probability is reported as a value between 0 and 1, inclusive. A probability of exactly 0 means that an event will never occur, and a probability of exactly 1 means that an event will certainly occur. So an event with a probability of .3 is less likely than one with a probability of .4. Probability can also be reported as a percentage, by multiplying the probability by 100, or as a fraction by listing the expected outcome in the numerator and the total possible outcomes in the denominator. For example, a probability of .4 is equal to a percentage of 40% and a ratio of four out of 10 or 4/10ths.
It is not uncommon to refer to the probability of discrete events, such as a flip of a coin to land heads up (50%) or a roll of two dice to result in a count of seven (about 16.7%). These values are calculated by first determining the number of outcome events that fit the target description and dividing that value by the total number of possible outcome events. For example, flipping heads when the only other possible outcome is tails is one out of two or 50%. In this same way, as illustrated by the Discrete Probability Distribution figure, there are six possible ways to roll two dice that result in a count of seven, out of a total of 36 possible outcomes, which is six out of 36 or about 16.7%.
Note. We will give the probability of a given total of the dice followed by the results that yield that total. The pattern of probabilities is symmetric around the total of seven. There is a 1/36th chance of rolling a two by the dice landing one-one. There is also a 1/36th chance of rolling a twelve by the dice landing six-six. There is a 2/36th chance of rolling a three by the dice landing one-two or two-one. There is a 2/36th chance of rolling an eleven by the dice landing five-six or six-five. There is a 3/36th chance of rolling a four by the dice landing one-three, two-two, or three-one. There is a 3/36th chance of rolling a ten by the dice landing four-six, five-five, or six-four. There is a 4/36th chance of rolling a five by the dice landing one-four, two-three, three-two, or four-one. There is a 4/36th chance of rolling a nine by the dice landing three-six, four-five, five-four, or six-three. There is a 5/36th chance of rolling a six by the dice landing one-five, two-four, three-three, four-two, or five-one. There is a 5/36th chance of rolling an eight by the dice landing two-six, three-five, four-four, five-three, or six-two. There is 6/36th chance of rolling a seven by the dice landing one-six, two-five, three-four, four-three, five-two, or six-one.
Flipping coins and rolling dice are examples where we are interested in discrete events, but what if we are interested in the probability of a range of outcomes that is not easily defined by discrete events? For example, we may wish to know the probability that the inches of annual rainfall in the state of Florida will be greater than the stateās historic average. Notice that our concern is not about the probability of a single value but rather about a range of values (any amount above the average).
Unlike discrete probability distributions, continuous probability distributions are defined by a mathematical function and probabilities are determined by the area under a curve defined by the mathematical function. The specific shape of the distribution is less important than the idea that the area represents the probability of observing any of the x-axis values within the specific range being investigated. For statistics, important probability distributions include the family of distributions for t-tests, the family of distributions for ANOVA tests, distributions for the chi-square (ź2) test, and many more, including one of the most important probability distributions, the normal distribution.
The normal distribution, sometimes called a ābell curveā (i.e., it has a strong central peak and tapers off symmetrical away from the peak, but note that there are other bell-shaped distributions that are not examples of a normal distribution), is derived from a mathematical formula, so it is a theoretical distribution (in contrast to an empirical frequency distribution). Though we are seldom able to measure an entire population, natural variables like height, weight, IQ, and many others are assumed to be approximately normally distributed. The fact that the normal distribution can be observed in so many situations means that it can be especially useful in hypothesis testing, which is a critically important topic in statistics.
As shown in the Common Features of Normal Distributions figure, all normal distributions share two features.
Normal distributions are symmetrical around a center point equal to the mean, median, and mode of the distribution.
Normal distributions are identical in the proportions of the symmetrical areas on either side of the curve with fixed percentages for the areas. The areas under corresponding regions of the curve always total to 1, or 100% probability.
Note. A normal distribution illustrates that they are symmetrical around the center point with an equal mean, median, and mode of the distribution. A normal distribution illustrates that the regions under the curve are proportional. When divided in units of standard deviation, the regions +1 or ā1 standard deviation of the mean are each about 0.341 of the total. Each area between 1 and 2 (and between ā1 and ā2) standard deviations account for about 0.136 of the total. Each area between 2 and 3 (and between ā2 and ā3) standard deviations account for about 0.021 of the total. Each area between 3 and 4 (and between ā3 and ā4) standard deviations account for about 0.001 of the total. The area above 4 or below ā4 standard deviations will account for the balance so that the total area under the curve will equal 1.
Statistical tests are generally divided into two categories based on whether the parameters of the population distribution are normally distributed. If the parameters are normally distributed, then a parametric test offers more statistical power. If the parameters are not normally distributed, then a nonparametric test is appropriate. A nonparametric test is also appropriate when data are measured at the ordinal or categorical level or when the data includes outliers that are not easily removed.
Sampling methods are commonplace in science because it is usually impossible to measure every unit in a population. But sampling raises a critical question: Is our sample representative of the population, or not? Random variability and measurement error might lead to sample statistics that are very different from the population parameters. We cannot know the population values, so how do we decide if our samples are representative?
Even when there is nothing more than random variability in a sample, it is likely that small trends will appear, that is, that there will be some deviation from perfect randomness. How do we know when to conclude that an apparent trend is real, or when to conclude that it is the result of randomness? For example, as shown in the Roll of a Die figure, we know that if we roll a fair die many times (X), then we can expect an approximately equal number of rolls of each number on the die (i.e., X Ć· 6), about 150 in this case. But suppose we are testing dice for a game-making company: At what point would we decide that a die is defective? In 90 rolls, dice A and B yield frequency distributions close to the one we would expect by chance. Die C is a little more discrepant, with 22 sixes and 18 threes but only 13 ones and 14 fours. Die D is more discrepant still. Is die D defective? Is die C? How about dice A or B?
Die
Rolls
1
2
3
4
5
6
Total Discrepancy
Fair
X
X Ć· 6
X Ć· 6
X Ć· 6
X Ć· 6
X Ć· 6
X Ć· 6
0
A
90
16
15
15
14
15
15
2
B
90
14
16
16
14
14
16
6
C
90
13
16
18
14
17
22
16
D
90
10
25
18
14
17
16
22
Note. Summary of die-rolling example described in the text.
The hypothetical fair die in this situation is an example of a null result, which is an outcome in which there are no differences between conditions, no relationships between variables, or, more generally, no departures from the result that would be expected based on the presence of only random variation in the data. In the die-rolling example, the null result is that when rolled many times, the die will land face up on each possible number an equal number of times.
Null hypothesis significance testing (NHST) procedures are designed to address the uncertainty that arises from sampling. Knowing that, due to randomness, a sample will almost always have at least some small departure from a purely null result, we need to decide when a result departs so much from the null result that we will decide that the data-generation process (like the process of rolling a die 90 times in the previous example, or whatever else it might be) was not purely random? In the die-rolling example, we would ask, at what point does the frequency distribution depart so much from our expectation that we will decide that the die is not fair?
Ask, āwhat do we expect if there is only random variation with respect to the research question?ā That is, what are the null and alternative hypotheses? Null hypothesis: The correlation between the measures is zero. Alternative hypothesis: The correlation is nonzero.
What is the probability distribution if the null hypothesis is true, and what proportion of that distribution (α) will be considered too unlikely to have been sampled at random? The relevant probability distribution is called a t distribution, which for a large sample is roughly equivalent to the normal distribution. Values from the far 2.5% of both the left and right tails of the distribution will be considered too unlikely to have been sampled at random.
What is the probability (p) of the observed result? The study by Noftle and Robins (2007) suggests that even with a very large sample of more than 10,000 students, p > .05.
If p is less than α, then we reject the null hypothesis in favor of the alternative; if not, then we retain the null hypothesis. Because p is greater than α, we retain the null hypothesis. In other words, the observed result is relatively likely to be randomly sampled from the null distribution; it is fairly likely that we would observe this value if the null hypothesis is true. In such cases, we decide to retain (or fail to reject) the null hypothesis.
Consider one final example, this one hypothetical, but representative of a very common research design. The imaginary research question asks, is a new drug effective for improving memory? Imagine a 1-month, double-blind, placebo-controlled trial of 100 people, with 50 each randomly assigned to take either the placebo or the drug each day. At the end of the month, all the participants read a chapter in a textbook and then take a 50-question, 4-alternative, multiple-choice test over that material.
Ask, āwhat do we expect if there is only random variation with respect to the research question?ā That is, what are the null and alternative hypotheses? Null hypothesis: There will be only random variation in the data if there is no difference between the placebo and the drug, and we would expect there to be no difference between the groups in performance on the multiple-choice test. Alternative hypothesis: The drug has some effect on performance, so there will be some difference between the test performance of the two groups.
What is the probability distribution if the null hypothesis is true, and what proportion of that distribution (α) will be considered too unlikely to have been sampled at random? We need a probability distribution that describes the distribution of differences between means when the null hypothesis is true, that is, when both groups were sampled from the same population. A family of distributions called t distributions meets this need. For any sample, we can calculate the observed value of t and the corresponding probability of observing a t that far or farther from the mean of the distribution.
What is the probability (p) of the observed result? Suppose that p is less than .001.
If p is less than α, then we reject the null hypothesis in favor of the alternative; if not, then we retain the null hypothesis. In this case, p is less than α, so we conclude that the null hypothesis is false and that the drug had an influence on learning the textbook chapter. In other words, the observed t value is not one that we would be likely to draw from the null distribution at random, so we will conclude that it came from some other (alternative) distribution.
The Statistical Decisions figure summarizes again the possible correct decisions and statistical errors that can occur with null hypothesis significance testing procedures. It is important to remember that the possibilities depend on whether the null is true or false. Of course, this is not known, but if the null is true, then the only two possible outcomes are to correctly retain the null, or to commit a Type I error. If the null is false, then the only two possible outcomes are to correctly reject the null, or to commit a Type II error. There is no situation in which both a Type I and a Type II error are possible.
Sometimes people have difficulty remembering which error is which. It may help to think that the J in rejected comes first in the alphabet, before the T in retained. So you can make a Type I error by incorrectly reJecting the null, and a Type II error by incorrectly retaining it.
Note. The table shows the probabilities of making the two correct decisions (i.e., retaining the null when it is true, with probability one minus alpha, and rejecting the null when it is false, with probability one minus beta) and of making the two incorrect decisions (i.e., retaining the null when it is false, with probability beta, and rejecting the null when it is true, with probability alpha). The statistical decision you make and the value of the ātruthā in the population intersect to create correct decisions or incorrect decisions. For incorrect decisions, a Type I error is a false positive in which the null hypothesis is rejected when it is true. A Type II error is a false negative in which the null hypothesis is retained when it is actually false.
Review of Null Hypothesis Significance Testing
Recall that null hypothesis significance testing (NHST) procedures employ a sequence of four steps:
Ask, āwhat do we expect if there is only random variation with respect to the research question?ā That is, what are the null and alternative hypotheses? We begin by asking what we expect the results to look like if there is only random variation in the data set (that is the null result) and what we might expect if there is some systematic variation in the data (alternative hypothesis).
What is the probability distribution if the null hypothesis is true, and what proportion of that distribution (α) will be considered too unlikely to have been sampled at random? To frame our decision, we identify a relevant test statistic, the null distribution of that statistic, and the part of that distribution that contains results we will decide are so unlikely that they probably did not occur by chance alone. The proportion of the distribution that contains those unlikely results represents the α value, which is usually .05.
What is the probability (p) of the observed result? Using the null distribution, we calculate the test statistic and identify the associated probability of randomly selecting that value of the test statistic (or a value more extreme than that) from the null distribution; that probability is our p value.
If p is less than α, then we reject the null hypothesis in favor of the alternative; if not, then we retain the null hypothesis. Finally, we simply decide if our result meets our definition of unlikely, which is that p is less than α. If so, we conclude that the results are so unlikely that they probably were not randomly selected from the null distribution, that they were instead selected from some alternative distribution. If p is not less than α, then we retain the hypothesis that our result was sampled from the null distribution.
Dementia
No Dementia
Total
Surgery
215
821
1036
No Surgery
638
1364
2002
Total
853
2185
3038
Is this pattern of results due merely to random variation, or might it indicate a real trend?
Ask, āwhat do we expect if there is only random variation with respect to the research question?ā That is, what are the null and alternative hypotheses? Null hypothesis: The chance of developing dementia does not depend on surgery status. Alternative hypothesis: The chance of developing dementia does depend on surgery status.
What is the probability distribution if the null hypothesis is true, and what proportion of that distribution (α) will be considered too unlikely to have been sampled at random? A statistic called Ļ2 (chi-squared) can be used to assess the differences between the frequencies observed in a contingency table such as the one shown in this example and the frequencies that would be expected if there are no dependencies. The greater those differences, the larger the observed value of Ļ2 will be. There is a family of Ļ2 probability distributions; the 5% of the distribution at the upper end of the distribution is regarded as the region that would be too unlikely to occur by random sampling alone.
What is the probability (p) of the observed result? The p value is about .000000000103, or about 1 in 10 billion.
If p is less than α, then we reject the null hypothesis in favor of the alternative; if not, then we retain the null hypothesis. One ten-billionth is less than .05, so we would reject the null hypothesis in favor of the hypothesis that this is a real trend in the data.
Of course, it is possible to make incorrect statistical decisions with the null hypothesis significance testing process. The null hypothesis could be true and, simply by chance, we might still sample a test statistic in the region of the null distribution that we have decided is unlikely (but not impossible) to occur by chance. This would be an example of a Type I error, where the null hypothesis is true but our sample leads us to reject it (i.e., a false positive). Similarly, we can make an error when the null hypothesis is false. It might be that there is a real trend to be detected, but just by random sampling, we obtain a result that is not sufficiently rare that we would conclude that it was unlikely to have come from the null distribution. In such a case, we would commit a Type II error, failing to reject the null hypothesis when it is false (i.e., a false negative).
The appropriate null hypothesis significance test can be selected for any research question by answering questions to find the appropriate path through a decision tree, beginning with the question āWhat is the type of analysis?ā. The first question is the most broad and asks whether the purpose of the study design is to analyze differences among means or relationships between variables. The answer to each question guides you to the next question as you move closer to selecting a hypothesis test.
You may find it useful to download the Selecting a Statistical Test image that follows; or, following the image is a link to a downloadable spreadsheet version of the content.
Note. This decision tree can be used to select an appropriate statistical test based on the design criteria for a study. The question āWhat is the type of analysis?ā is the first to answer, following the arrows to each new question until a statistical test is reached. Step down the rows of the column titled Study Criteria until the design criteria are met, then step to the column titled Statistical Test to identify an appropriate statistical test. Select the following document link to download the file.
If the study involves comparisons of means, then subsequent decisions may concern the number of sample means, the number of independent variables, and whether the sample means are related or not. These decision points are illustrated in the Selecting a Statistical Test for Variable Means figure. Depending on these decisions, the path will lead to one of the t-tests or ANOVAs.
A t-test is a null hypothesis significance test used when the population standard deviation is unknown and tests whether one mean differs from a known value, requiring a one-sample t-test, or two means differ from each other, requiring either a paired-samples t-test or an independent-samples t-test. For t-tests, the dependent variable (the measured or outcome variable) must be measured at the interval or ratio level; if there are two means, then the independent variable (the manipulated or predictor variable) is measured at the nominal or ordinal level. Because the population standard deviation is unknown, we estimate it from the sample standard deviation. In such cases, the normal distribution is not an accurate representation of the sampling distribution of the mean. Instead, t-distributions are the correct probability distributions for these inferential tests. There is a family of t-distributions, with a specific distribution for every possible sample size of n = 2 and larger. (Note that if the population standard deviation is known, which is not common, then a different test, the z-test, is the appropriate choice.)
An analysis of variance (ANOVA) is a null hypothesis significance test that requires one dependent variable measured at the interval or ratio level, similar to t-tests, but there may be one or more independent variables defined at the nominal level. In addition, each independent variable may have more than two levels, or conditions, so that differences among three or more means may be tested. When an ANOVA has one independent variable, it is referred to as a one-way ANOVA. Similarly, when there are two independent variables, it can be referred to as a two-way ANOVA, or more generally as a factorial ANOVA. As with t-tests, the levels of the independent variables may be related (i.e., a within-groups design) or unrelated (i.e., a between-groups design). It is also possible to have an ANOVA with one or more within-groups independent variables and one or more between-groups independent variables, which we refer to as a mixed-groups design.
Factorial designs are useful when one is interested in multiple factors and how those factors interact. When using factorial designs, the main effects of the individual independent variables are considered separately from any interactions between two or more of the independent variables. The interaction would test if the effect of either independent variable changed depending on the level of the other independent variable.
A correlational analysis may be used to test whether a correlation coefficient, r, is significantly different from zero. A correlation coefficient is a numeric value that describes both the strength and direction of the relationship between two variables. Pearsonās correlation coefficient is a common correlation statistic that measures the relationship between two scale variables (e.g., interval or ratio level), and it ranges between ā1 and +1. If two variables are strongly correlated, with r closer to either ā1 and +1, then knowing the value of either variable tells us something about the value of the other. If the two variables are weakly correlated, with r closer to 0, then knowing the value of either variable tells us little about the value of the other variable.
A regression analysis is used to investigate how one or more predictor variables can be used in an equation that yields the value of an outcome variable. Predictor and outcome variables play roles similar to independent and dependent variables, respectively. An important difference between predictor and outcome variables on one hand, and independent and dependent variables on the other, is that an independent variable is under the control of the researcher and is manipulated in an experimental context, but a predictor variable is measured in observational research, along with the outcome variable. A bivariate linear regression is a common form of regression analysis in which the linear relationship between a single interval- or ratio-level predictor variable is used to predict the value of an interval- or ratio-level outcome variable. Multiple regression procedures with more than one predictor variable and nonlinear regression procedures that estimate more complex functions also exist, as do regression procedures for nominal- and ordinal-level variables, but bivariate linear regression is the most basic procedure.
Depending on the characteristics of the study, a parametric test may or may not be the appropriate test. If the measured variable (i.e., dependent variable or outcome variable) is measured at either the ratio or interval level (i.e., it is a scale variable) and the assumption of normality is met, then the selected parametric test is appropriate. However, if the measured variable is assessed at a nominal or ordinal level, or if the assumption of normality is not met, then a nonparametric test should be considered. The Nonparametric Alternatives video offers a list of common nonparametric tests that may be appropriate. For each parametric test, a nonparametric alternative is listed, although this list is not exhaustive and some tests may be appropriate for multiple designs.
You may find it useful to download the Nonparametric Alternatives image that follows; or, following the image is a link to a downloadable spreadsheet version of the content.
Note. A table of nonparametric test alternatives. The parametric and nonparametric alternatives include the one-sample t-test and the nonparametric one-sample Wilcoxon signed-rank test; the paired-samples t-test and the nonparametric Wilcoxon signed-rank test; the independent-samples t-test and the nonparametric Mann-Whitney U test; the oneāway within-groups ANOVA and the nonparametric Friedman test; the one-way between-groups ANOVA and the nonparametric Kruskal-Wallis test; the factorial within-groups ANOVA and the nonparametric aligned rank transform (ARC) ANOVA; the factorial between-groups ANOVA and the nonparametric aligned rank transform ANOVA, the factorial mixed ANOVA and the aligned rank transform ANOVA; the Pearson correlation and the nonparametric Spearmanās rank correlation; the bivariate linear regression and the nonparametric binary logistic regression; the multivariate linear regression and the nonparametric multinomial logistic regression.
Tests of Relationships in Frequency of Occurrence
If the study compares frequencies of nominal variables, then decisions may concern the number of variables in the study. These decision points are illustrated in the Selecting a Statistical Test for Variable Frequencies figure. Depending on these decisions, the path will lead to one of the chi-square analyses. Use this decision tree to select a statistical test for a study that compares frequencies of variables.
You may find it useful to download the Selecting a Statistical Test for Variable Frequencies image that follows; or, following the image is a link to a downloadable spreadsheet version of the content.
Note. This decision tree can be used to select an appropriate statistical test based on the design criteria for a study when the study compares frequencies of variables. The question āWhat is the type of analysis?ā is the first to answer, following the arrows to each new question until a statistical test is reached. Step down the rows of the column titled Study Criteria until the design criteria are met then step to the column titled Statistical Test to identify an appropriate statistical test.
A chi-square goodness-of-fit test is a nonparametric null hypothesis significance test to determine if the frequency distribution of nominal-level data matches that of a set of expected frequencies. By contrast, a chi-square test of independence is a nonparametric null hypothesis significance test to determine if the categories of two nominal-level variables are related. Notice that each type of chi-square test includes one or more measured variables at the nominal level.
Inferential statistics are used to make inferences about the population from which a sample was collected, which are based on probability and sampling theory. Using samples to make inferences about populations adds uncertainty, known as sampling error, to a statistical conclusion.
The mean and standard deviation of the sample are unlikely to be identical to the mean and standard deviation of the population. Sampling error occurs when sample statistics depart from the corresponding population parameters by some amount due to chance, which is why probability is important. Inferential statistical techniques allow us to make inferences about the population based on probability and the sample.
Probability theory is used to know what it means to draw a random sample, or how to estimate how likely it is that we might observe a particular value of a variable or statistic for our sample. A probability of exactly 0 means that an event will never occur, and a probability of exactly 1 means that an event will certainly occur.
A single probability, for example, might be the probability of landing on your feet after a backflip or the probability that you will catch COVID after attending a music festival. But other times, we are interested in a range of probabilities, in which case we refer to a probability distribution, which is simply a representation of a set of probabilities.
For continuous variables, probabilities are determined by the area under a curve of the probability distribution defined by a mathematical function. The mathematical function that is used depends on the statistical test being used, such as the functions listed in the Continuous Probability Distributions figure. Such distributions are theoretical distributions, because they are based on mathematical functions, rather than empirical frequency distributions, which are based on numerical samples. For probability distributions, the x-axis lists the possible range of values and the y-axis lists the resulting value from the relevant mathematical function, which varies depending on the statistical test. But what we are more interested in is the area under the curve for a range of x-values.
The key point is that the specific shape of the distribution is less important than the idea that we can determine the proportion of the whole area under the curve for any range of values. This area represents the probability of observing any of the x-axis values within the specific range being investigated.
The normal distribution is a special type of theoretical distribution that is useful for making predictions. The normal distribution can be observed in many situations, which makes it useful in hypothesis testing.
Normal distributions are symmetrical around a center point equal to the mean, median, and mode of the distribution. Because they are symmetrical, each half mirrors the other with identical proportions, which means that areas under corresponding regions of the curve are always identical proportions of the whole. In total, the area under the curve totals to a proportion of 1.
The region between the mean and 1 standard deviation above the mean has an area that makes up about 34.1% of the total area under the distribution. This same proportion of area is also under the curve between the mean and ā1 standard deviation below the mean.
All normal distributions will have the same proportion of the area between 1 standard deviation above and below the mean (about 68.2%), 2 standard deviations above and below the mean (about 95.4%), and 3 standard deviations above and below the mean (about 99.8%). A distribution that is not symmetrical and that does not have this pattern of proportions is not a normal distribution. The Common Features of Normal Distributions figure illustrates these important features.
The normal distribution is a good predictor of the population if the observed sample data approximate a normal distribution. Any data that approximate a normal distribution can be analyzed in this way.
The normal curve is an important tool for inferential statistics because it forms the basis of many statistical tests. We refer to these tests as parametric tests because they rely on an assumption of normality of the parameters of the population distribution from which the sample was taken. Without meeting this assumption, then a different category of statistical tests can be used, nonparametric tests.
Null hypothesis significance testing (NHST) compares sample data to a null result (i.e., a result in which there is no difference or no relationship).
The first step of NHST is to set up the null hypothesis (H0); This hypothesis assumes there is only random variability in the sample and that there is no difference between groups or relationship between variables.
The alternative hypothesis (with notation H1), on the other hand, indicates that the data do reflect some difference between groups or relationship between variables (i.e., there is more than random variation in the data).
The second step of NHST is to specify which parts of the null probability distribution are unlikely (i.e., have a low probability). The probability associated with the unlikely areas in the null distribution is called α (alpha level), and is usually set at or near .05.
The third step of NHST is to calculate the sample statistic and the associated probability, p, of encountering that statistic (or any statistics that are more extreme) at random from the null distribution. This probability is known as the p value.
The fourth step of NHST is making a statistical decision to reject the null hypothesis or fail to reject the null hypothesis. If p < α (i.e., p < .05, when α = .05), then the decision is to reject null, which indicates that the result is statistically significant. If p > α, then the decision is to fail to reject the null hypothesis and provides support for the alternative hypothesis.
It is possible that the decisions in null hypothesis significance testing are not correct. These statistical errors are known as Type I and Type II errors.
A Type I error is a false positive (e.g., a false alarm), in which the statistical decision is to reject the null hypothesis, but the null is actually true (i.e., there is no actual difference). We limit the likelihood of a Type I error by setting the alpha level before a study begins to a small portion of the distribution, such as .05. Type I error is equal to the alpha level.
A Type II error is a false negative, in which the statistical decision is to retain the null hypothesis, but the null is actually false (i.e., there is an actual difference). The likelihood of a Type II error is equal to źµ.
An important step for selecting a significance test includes assessing whether the purpose of the study is to analyze differences among means, relationships between variables, or the distribution of frequencies over the values of one or more variables.
Significance tests that analyze differences among means include t-tests and analysis of variance (ANOVA) tests.
Significance tests that analyze relationships between variables include correlation analyses and regression analyses.
Significance tests that analyze the distribution frequencies of variables include chi-square analyses.
If the study involves analyzing differences among means, then whether a t-test, an ANOVA, or a different test is appropriate will depend on the number of sample means, the number of independent variables, whether the samples are related or not, and whether the statistical assumptions for the test are met.
A t-test is used when the population standard deviation is unknown and tests whether one mean differs from a known value, requiring a one-sample t-test, or two means differ from each other, requiring either a paired-samples t-test or an independent-samples t-test, depending on whether the conditions are related or unrelated.
An ANOVA requires one dependent variable measured at the interval or ratio level and one or more independent variables defined at the nominal level. When there is one independent variable, then it is referred to as a one-way ANOVA or as a factorial ANOVA when there is more than one independent variable. When the independent variables are related, then an ANOVA is referred to as a within-groups design or when unrelated as a between-groups design.
t-Tests and ANOVAs are types of parametric tests and only two of many types of significance test. With each, there are statistical assumptions that should be met if the test is to yield reliable results. A common assumption for parametric tests is the assumption of normality. If this assumption is not met, then one option is to use a nonparametric test alternative.
If the study involves analyzing relationships between variables, then whether a correlation analysis or a regression analysis is appropriate will depend on whether the study is to test if a predictor variable, x, predicts an outcome variable, y, and about the number of predictor variables in the study.
A correlation analysis may be used to test whether a correlation coefficient, r, is significantly different from zero. Pearsonās correlation coefficient is a common correlation statistic that measures the relationship between two variables that are measured at the interval or ratio level, and it ranges between ā1 and +1. When highly correlated, r is close to ā1 and +1, but when poorly correlated then r is close to 0.
A regression analysis is used to investigate how one or more predictor variables can be used in an equation that yields the value of an outcome variable. A bivariate linear regression is used to investigate the linear relationship between a single interval- or ratio-level predictor variable and an interval- or ratio-level outcome variable. A multiple regression is used to investigate the linear relationship between two or more interval- or ratio-level predictor variables and an interval- or ratio-level outcome variable. Nonparametric regression procedures also exist for nominal- and ordinal-level variables.
If the study involves analyzing the distribution of frequencies of variables, then which chi-square test is selected will depend on the number of variables in the study.
A chi-square goodness-of-fit test is a nonparametric null hypothesis significance test to determine if the frequency distribution of nominal-level data matches that of a set of expected frequencies.
A chi-square test of independence is a nonparametric null hypothesis significance test to determine if the frequency distributions of two variables match.
Consider the correlation between GDP (Gross Domestic Product) and life expectancy depicted in the GDP and Life Expectancy Scatterplot figure, which has r = .79 (). Can you interpret both the direction and strength of this correlation?
Consider still another example of the use of NHST, this one from a published research study (). The research question was whether the degree of extraversion is related to GPA in a sample of college students. In reality, the researchers collected several samples, some quite large, and used a variety of measures of extraversion. For our purposes, we will imagine a single, large sample with measures of both GPA and extraversion.
Consider the research question, are later-in-life vision problems like cataracts related to the development of dementia? The following table gives the results (based approximately on ) of a hypothetical sample of 3,038 people diagnosed with cataracts, some who opted for surgery and some who did not. The participants were then tested regularly over the ensuing years by their healthcare provider for symptoms of dementia. The table shows that about 21% (215 Ć· 1036) of the people who opted for surgery subsequently developed dementia, whereas about 32% of the people who declined surgery went on to develop dementia.