11
When you hear the term “bivariate,” focus in on the part of the word that says “bi.” This word originates from Latin as well and refers to a condition in which two things are present ( e.g. binocular, bicameral). In practice evaluation, we see bivariate statistics used to compare groups against one another, or to compare one group of people before and after a social work intervention.
For example, if we think back to our child welfare social worker that is running parenting education groups, a report on people’s child abuse potential score before and after a parenting course is considered a piece of bivariate data. We look at the group’s average score before the intervention as it is compared to their average score after the parenting intervention.
With bivariate statistics, you see the inclusion of information about statistical significance. Statistical significance helps us to tell whether there is a real mathematical difference between the before and after scores. Another way of saying that is that statistical significance tells us the percent chance any given finding is due to chance. But before we go there for more explanation, let’s start by orienting ourselves to what is in Table 10.2 step by step.
Table 10.2: Parenting program evaluation data |
|||
Measures |
Before intervention N=20
|
One month after intervention N=15 |
Test |
N (%) or M (SD) |
N (%) or M (SD) |
||
Child abuse potential score |
56.0 (4.2) |
38.0 (2.3) |
t=3.22*** |
Meeting all reunification goals |
35% (7) |
73% (11) |
X2=8.54* |
*p< .05; **p< .01; ***p< .001; NS = No statistically significant difference between groups |
|||
We can see from the top row that this table is reporting on parenting education evaluation data. In the left-hand column, we see a list of two measures used by the evaluators, the child abuse potential score (from a standardized instrument) and whether parents are meeting their family reunification goals (a yes/no question). The second and third columns indicate where pre data (before the intervention) and post data (after the intervention) will be reported. The final column indicates where statistical test data are reported.
Let’s start with the child abuse potential score, which can range anywhere between 1 (no child abuse potential at all) and 100 (high chance of child abuse potential). Before the intervention, as a group, parents scored on average 56.0 plus or minus 4.2 points. 56.0 is the mean and 4.2 is the standard deviation. However, after the parenting education intervention, as a group parents scored on average 38.0 plus or minus 2.3 points. In this example, 38.0 is the mean and 2.3 is the standard deviation.
On the face of it, this is about a 20-point drop in child abuse potential score. However, we don’t know if this is a statistically significant or mathematically meaningful difference. In order to understand whether this is a real difference, we need statistical data, which we have in the fourth column, where it is noted as “t=3.22***”.
Let’s take this one step at a time, beginning with the “t.” every statistical test produces a result that is coded with a letter. The “t” (always lower case) refers to the Student’s t test, a calculation that compares two values in a before and after situation (such as ours) or in a group comparison situation (such as comparing the results of two different therapeutic groups). When you see the “t,” this means that the measure you are looking at is continuous. A continuous variable is one for which you can calculate an average, which is what we are looking at. It is important to note that you can’t calculate an average on a yes/no question such as “completed treatment/didn’t complete treatment.” As a critical consumer of research, it is important to check whether the correct statistical test was used with any given measure in order to determine whether the results are legitimate or not. Believe it or not, I’ve seen this happen more than I care to relate!
Next, we have the 3.22 number listed next to the “t.” This number is not interpreted. It is a relic of a time before computer calculations when we had to look up values to determine whether our findings were statistically significant or not. So, we can skip right over this number and get to the “***” part of the report. Asterisks, or stars, are commonly used to indicate the presence of statistical significance.
Generally, if you see three stars, this means that there is a 99.9% chance that the finding you are looking at is not due to chance. The presence of two stars means that there is a 99% chance that the finding you are looking at is not due to chance. The presence of one star means that there is a 95% chance that the finding you are looking at is not due to chance. Classically, evaluators only accept results as statistically significant if they find between a 95% and a 99.9% chance in their statistical data.
However, more recently, statisticians have questioned this rule of thumb, suggesting that lower percentages might be considered as well. In order to understand this, it helps to look at the entire range of possible percentages, see Table 10.3 (need to fix the graphics so it runs evenly across the page).
In Table 10.3, statistical significance levels are reported as probability values, or p values. A p value of <.99 translates into 1%, for example, indicating there is less than a one percent chance the finding is not due to chance. To get from .99 to 1%, you have to remember some fourth-grade math, where you subtract .99 from 1 to get the percentage. The three groupings presented in Table 10.3 reflect classical thinking about statistical significance but are not a hard and fast rule.
Table 10.3: Range of statistical significance levels |
||
Not statistically significant |
Approaching significance |
Statistically significant |
p<.99-p<.80-p<.70-p<.60-p<.50- p<.40-p<.30- p<.20-p<.10 |
p<.09-p<.08- p<.07-p<.06 |
p<.05*-p<.04*- p<.03*-p<.02*- p<.01**-p<.001*** |
So, we’ve talked a lot about statistical significance, but there is another factor we have to consider, clinical significance (also called clinical meaningfulness sometimes). To do this, let’s go back to our example. As you will recall from Table 10.2, before the intervention, on average, parents’ child abuse potential score was 56.0 (4.2) whereas it was 38.0 (2.3) one month after the intervention ended.
While statistical difference tells us the chance a finding is due to chance, clinical significance is what we, individually, as social workers think about the difference between the numbers being compared (in a comparison context). These are two different concepts. In order to think about clinical significance, we have to start with statistical significance. If something is found to be statistically significantly different, then we have to consider clinical significance. If something is found *not* to be statistically significantly different, we stop there with the understanding that mathematically, the scores are equal vis-a-vis the comparison of scores. We would then just consider the range the two scores are in on whatever spectrum of scores we are considering vs. comparing the two scores to one another.
To use our statistically significant example from Table 10.2, we have to begin to consider whether 56 and 38 are meaningfully different from one another. This involves us going back to the range of scores on the child abuse potential scale, 1-100. I’m willing to bet that most people would consider a roughly 20-point difference, with statistical significance, a meaningful difference.
However, what if you had a statistically significant difference, but the score difference was only 10 points? Or 5 points? Would you feel the finding was clinically significant then? What would it mean, for example, if the pre-test score was 56, but the post-test score was only 51? Would you think that the intervention assisted parents in lowering their child abuse potential score? In this situation, even though there was a mathematical difference (per statistical significance), there was not a clinically meaningful change in score, suggesting that the intervention might not be effective.
Now, we’ve been spending a great deal of energy to interpret the child abuse potential score, but what about the other measure in Table 10.2, “meeting all reunification goals?” As you will recall, our child welfare social worker is trying to assist all parents on her caseload to reunify with their children who are living in foster care. In this situation, each parent works with a child welfare social work team to establish goals they need to accomplish in order to reunify with their children.
One of the measures in our child welfare social worker’s evaluation of her parenting education intervention is whether group participants are meeting all of their reunification goals. The hope is that participation in the parenting education intervention will help parents to achieve all of their goals.
As we can see in Table 10.2, only 35 percent of parents (a total of 7 people) had achieved their parenting goals before the intervention started. This is contrasted by the fact that 73 percent of the intervention participants (11 people) had achieved their reunification goals one month after the end of the parenting education intervention. On the face of it, that is a big difference, but is it a statistically significant difference? In order to determine that, we look in the fourth column for the statistical data, and we see X2=8.54*
Let’s take this one step at a time, beginning with the “X2.” Every statistical test produces a result that is coded with a letter. The “X2” (pronounced “Kai squared”) refers to the Chi-square test, a calculation that compares two values in a before and after situation (such as ours) or in a group comparison situation (such as comparing the results of two different therapeutic groups). When you see the “X2,” this means that the measure you are looking at is nominal. A nominal variable is one for which you *cannot* calculate an average, which is what we are looking at. Nominal variables are often structured as a yes/no question such as “achieved reunification goals/didn’t achieve reunification goals.” Remember, as a critical consumer of research, it is important to check whether the correct statistical test was used with any given measure in order to determine whether the results are legitimate or not.
Next, we have the 8.54 number listed next to the “X2.” As with the Student’s t test, this number is not interpreted. It is a relic of a time before computer calculations when we had to look up values to determine whether our findings were statistically significant or not. So, we can skip right over this number and get to the “*” part of the report. As you will recall, a single start tells us that the finding is statistically significant, and that there is a 95% chance that our test result is not due to chance.
So, we’ve talked a lot about statistical significance, but now we have to consider, clinical significance. To do this, let’s go back to our example. As you will recall from Table 10.2, before the intervention, 35 percent of parents had achieved their reunification goals whereas 73 percent (almost three quarters of the group) had achieved their reunification goals one month after the intervention ended. This roughly forty-point difference between the pre and post tests would be considered clinically significant by most people.
However, if there had been no statistically significant difference between pre and post, we would have to consider the scores statistically equal, in which case clinical meaningfulness would take on a different meaning. Specifically, the meaningfulness would not be about the difference in scores as the statistical tests showed the scores to be equal, suggesting that the evaluation was not effective.
We’ve taken our time going through Table 10.2. Now, let’s ask the question “How is this evaluation information helpful?” From the data we have reviewed, our evaluation tells us that parents participating in the parenting education intervention are making progress, and we have evidence to support this statement, evidence backed up by results from a Student’s t-test and a Chi-squared test.
So far, you have learned how to interpret a Student’s t-test and a Chi-squared test (two of the most common evaluation statistics), but we also have to talk about the public health and medical statistic that is increasingly used in social work, the odds ratio. An odds ratio compares two groups, such as a treatment group and a control group. You always need to know which group you are comparing to the other in order to correctly interpret the data that the test gives you. The data that you interpret is about the group you are focused on, and it is reported in comparison to the other group, known as the “referent.” Usually the treatment group is the group you are focused on and the control group is the referent. Let’s say we are comparing the treatment group and the control group on their likelihood of maintaining sobriety for 90 days post treatment. You will have data about what percentage of each group maintained sobriety for that timeframe, but you will be wanting to know if there is a statistically significant difference. Odds ratio scores also help you to know if there is a clinically meaningful difference. More on that in a minute.
While we don’t interpret the scores that are given to use from t-tests and chi-squares, we do interpret the number that is given to us with an odds ratio. If we have an odds ratio score of exactly 1.0 it means that the treatment and control group are exactly equal. If we have an odds ratio score of 2.3 (p<.001) it means that the treatment group is 2.3 times more likely to maintain sobriety than the control group. Let’s now say we have an odds ratio score of 2.3 (p<.99), while odds ratios are still reported when there is no statistical significance, we don’t interpret them because both groups are statistically equal to have the outcome.
Now, let’s say that our odds ratio had been 0.23 instead of 2.3. In this situation, we subtract 0.23 from 1, and get .77. We interpret this as a percent. This would tell us that the treatment group was 77% less likely to maintain sobriety for 90 days (meaning there’s a problem with our program). When our odds ratios are positive (meaning 1 or higher), we talk about “times more likely” and when they are negative (meaning they start with a zero), we talk about “percent less likely.” So, an odds ratio of 0 point anything is always about lower likelihood.
Let’s talk about the way that odds ratio scores help us to determine clinical meaningfulness. We only start paying attention to odds ratios as meaningful at a certain cutoff point. In research as a second language we talk about this as an “effect size.” As Chen, Cohen, and Chen (2010) note, “the odds ratio (OR) is probably the most widely used index of effect size in epidemiological studies” (p. 860). Further, these authors suggest that odds ratios of 1.68, 3.47, and 6.71 are equivalent to Cohen’s d effect sizes of 0.2 (small), 0.5 (medium), and 0.8 (large), respectively” (p. 860). So unless your odds ratio is above 1.68, you shouldn’t really consider it to be clinically meaningful. That’s a good rule of thumb.
It is important to note that depending on the structure of the outcome measure/variable used in an evaluation, a t-test, chi-square test or odds ratio test could be used in a pre-post test research design, as discussed in chapter 4.
Now that you have learned all of the classic statistics for comparing two groups, there is one more statistical test that evaluators use all the time, the ANOVA, or analysis of variance. Don’t let the evaluation-as-a-second language deter you. The ANOVA test can compare two or more points in time (or groups) but is classically used to compare three points in time (or groups). ANOVA tests compare continuous variables (reported as means, with standard deviations) across timeframes or groups. As discussed in chapter 4, the ANOVA statistical test is often used in evaluation to look at pretest data compared to post-test data and then aftercare data, for example.
Let’s look at some examples, starting with Table 10.4, which reports on data gathered by our child welfare social worker before the parenting education intervention, one month after the intervention and one year after the intervention. We have already reviewed a bivariate analysis that compared the average score before the intervention and one month after the intervention. Now, we are adding on a third point in time, the one-year mark after the intervention. We can see that on average, the child abuse potential score was at its lowest one month after the intervention, and that it crept up a slight bit at the one-year mark.
As with all statistical tests, the ANOVA presents a letter along with test results, in this case it is a capital F. The “F statistic” reads as F=5.23**. We do not interpret the number, as with the Student’s t-test and Chi-square, but we do interpret the stars to indicate the presence of statistical significance, in this case suggesting 99% chance the different scores across timeframes are not due to chance.
Table 10.4: Parenting education evaluation data across three timeframes |
||||
Measures |
Before intervention N=20 |
One month after intervention N=15 |
One-year post test N=15 |
ANOVA or X2 |
M (SD) or % (n) |
M (SD) or % (n) |
M (SD) or % (n) |
||
Child abuse potential score |
56.0 (4.2) |
38.0 (2.3) |
41.2 (3.7) |
F=5.23** |
*p< .05; **p< .01; ***p< .001; NS = No statistically significant difference between groups |
||||
In order to determine where the specific differences lie between the three timeframes, a post hoc test is conducted. All that means is that a test is done after the ANOVA is completed in order to compare the pre-test to each post-test individually, as well as comparing the post tests to one another. This test provides p values across time frames to see where statistically significant differences lie.
In Table 10.5, if we start by finding the p value linking the “before intervention” timeframe and the “one-year post-test” timeframe, we see a value of p<.01, indicating a 99% chance that there is a real difference between scores across these timeframes.
Table 10.5: ANOVA post-hoc test results for parenting education intervention evaluation |
|||
Timeframes: |
Before intervention |
One month after intervention |
One-year post test |
Before intervention |
— |
p<.03 |
p<.01 |
One month after intervention |
p<.03 |
— |
p<.86 |
One-year post test |
p<.01 |
p<.86 |
— |
Moving on, we might be curious about whether there is a statistically significant difference between scores at the “one month after intervention” timeframe and the “one-year post-test” timeframe. Looking at the score that links the two, we see p<.86. As we can see on Table 10.3, this is not a statistically significant value. This score means that there is a 14% chance that this finding is not due to chance.
As we mentioned above, the ANOVA test can also be used to compare groups, something that is also relevant for practice evaluation. For example, let’s say our child welfare social worker was running three intervention groups, and we wanted to compare the outcomes. See Table 10.6.
Table 10.6: Parenting education evaluation data across groups at one month after intervention |
||||
Measure |
Group 1 N=15 |
Group 2 N=22 |
Group 3 N=17 |
ANOVA |
M (SD) |
M (SD) |
M (SD) |
||
Child abuse potential score |
38.0 (2.3) |
42.6 (5.1) |
33.5 (1.6) |
F=7.45 |
*p< .05; **p< .01; ***p< .001; NS = No statistically significant difference between groups |
||||
Taking our time to orient to the table, we would start by grounding ourselves in the title of the table. Looking at the “measure” column we would see that the child abuse potential score is being reported. The three columns in the middle show us information about each of the three intervention groups.
The last column shows us which statistical test is used, an ANOVA, and reports the F statistic. As we learned above, we skip over the “F statistic” in order to see if there is a statistically significant difference between the three groups – or not. In this case, there is no star or asterisk present, indicating that there is no statistically significant difference between these groups with respect to the outcome measure of child abuse potential. As a reminder, if we wanted to know *exactly* where the statistically significant difference was between groups – namely between 1 or 2, 2 or 3 or 1 or 3, we would do a post-hoc analysis. In a post hoc analysis, a table of p values is reported for each of those comparisons, see Table 10.5 for an example.
How is this evaluation information helpful? We can take a look at these data several ways. First, we know that there is no statistically significant difference between the groups, which means that mathematically, they are doing equally well. This suggests that there may not be a difference in who is in each group or in how each group is conducted.
Second, if we look at the score range, we can see that the child abuse potential scores are between 33 and 43 (if we round up). Given that the child abuse potential score range is between 1 and 100, parents in the groups are at the lower end of the range. In terms of clinical significance, we would focus on how the data look all together – no concerning differences between groups, all scores on the lower end of the range. We would not compare the scores in each group for clinical significance because there was no statistical significance, indicating that the groups are mathematically the same.
You may not understand ALL of what is on a statistical table, but you will be able to pick out what is important for application to social work practice. In order to do this work, you need to start with a DEEP BREATH and an OPEN MIND before following these simple steps:
- ORIENTATION: Read the title of the table in order to orient yourself. Many people skip this step due to their anxiety. The title will tell you a lot that can help you to decipher the rest of the table. Some of the questions you can ask of the title are:
- What was the purpose of the table? Summarization? Comparison? Correlation? Prediction?
- Which statistical test was used? Mean & standard deviation? Percentage? Student’s t test? Odds ratio? Chi-square? ANOVA? Go to your stats cheat sheet to remember what these are about.
- How many groups are reported on? In one-group situations we will most often be looking at mean & standard deviation as well as percentages and sometimes OLS regression or logit regression. In some OLS regression and logit regression tables, we will be comparing groups.
- INTERPRETATION OF SCORES/PERCENTAGES: Now take a look at the numbers, slowly and methodically in order to increase understanding and reduce anxiety.
- If one group is reported on: What are the scores or percentages for that group? Once you have interpreted these for the group in question, you are done.
- If multiple groups are reported on: What are the scores or percentages for that group? If you are dealing with a multiple group comparison, you will need to interpret statistical and clinical significance, see below.
- INTERPRETATION OF STATISTICAL SIGNIFICANCE: While we can see that there may be numerical differences between groups, reading those numbers alone will not tell us if there is a statistical or mathematical difference between them. Statistical significance tells us whether mathematically, there is a real difference between the groups or whether the difference noted is due to chance.
- INTERPRETATION OF CLINICAL SIGNIFICANCE: Statistical significance doesn’t tell us everything. We also have to use our clinical minds in order to think about the meaningfulness of the data. Something may have a statistically significant difference, but not a clinically significant difference. For example, two therapy groups may score within 10 points of one another on an outcome measure, with a statistically significant difference noted. Even though one group is a little higher than the other on the outcome measure, this may not be clinically meaningful if we are using a scale of 1-100. If there is no statistical significance, we cannot detect a clinically significant difference between groups, however, as the values are statistically equal. There is still clinical meaningfulness, though, about the overall score of both groups on the range of scores being considered.
Discussion questions for chapter 11
- Thinking of your current internship or work placement, how could you use bivariate statistics to inform your work?
- What are the differences between univariate and bivariate multivariate statistics?
- What is statistical significance as compared to clinical significance?
- If there is no statistically significant difference in evaluation outcome measures between two groups, is there clinical significance?
Chen, H., Cohen, P., & Chen, S. (2010). How big is a big odds ratio? Interpreting the magnitudes of odds ratios in epidemiological studies. Communications in Statistics Simulation and Computation, 39(4), 860–864.