Appendix C: Highlighting Vs. Hiding Results

Overview

In the past few years, reproducibility has become a more prominent issue in the social sciences, as well as within cognitive neuroscience. For example, as far back as 2005, John Ioannidis published a paper on why most published research findings were probably false, using simulations of variables such as bias and pre-study odds. Although the paper dealt with hypothetical scenarios, these scenarios were demonstrated to have played out across many published findings in the social sciences: Brian Nosek and colleagues, in a paper published in Science in 2015, attempted to replicate the effects of 100 studies, were able to replicate only 39% of the effects, as determined by whether they passed the conventional p=0.05 threshold. This brought into question whether strategies such as p-hacking and selective reporting were leading to inflated effect sizes and spurious results.

Since then, many studies have also been carried out in cognitive neuroscience to find out whether effects are generally reproducible or not, and if not, what practices may be hindering reproducibility. For example, a study by Botvinik-Nezer et al., 2020, provided data to several dozens teams of independent researchers, and asked them to test for given effects, such as whether there was a significant effect of viewing the potential Gain of a gamble compared to its potential loss; this included whole-brain analyses which determined whether there were significant effects in a priori regions of interest (ROIs), such as the ventral striatum, amygdala, and ventromedial prefrontal cortex - all regions that typically show BOLD activity in response to gains and losses.

Once the analyses were completed by each team, they were then asked to make a decision as to whether the effect existed in a given region or not; in which case, the team could either reject the null hypothesis and say that there was probably an effect in that region, or fail to reject the null hypothesis. In the final paper, around 80% of the teams reported a significant effect for one of the findings, with most of the other hypotheses (of which there were nine total) having teams report around 20%-40% significant effects; in addition, a figure of Spearman correlation values of each team’s illustrated how much overlap there was in the unthresholded maps between each group.

Benefits of the Highlighting Approach

At first glance, it may look as though there was less overlap of significant effects between the teams than might be expected, given that the effects of Gain and Loss in regions such as the ventromedial prefrontal cortex and ventral striatum are consistently reported in the literature. Given this discrepancy, two hypotheses come to mind: Either the reported effects in the literature are in fact either spurious or inflated due to the practices listed above, or many of the teams in the Botvinik-Nezer study were not analyzing the data correctly. Neither hypothesis is encouraging for those who want to believe that most experiments should be reproducible, and that reproducibility is a strong indicator of the truth of the effect.

However, a recent paper by Taylor et al. (2023) proposes a third hypothesis: It may be that the vast majority of researchers are in fact doing the right analyses, and that the effects across independent teams are more similar than we think. It could be that it is not their fault that the effects are not significant, but rather the reliance on the conventional p=0.05 threshold for determining statistical significance. If we make the main focus on the effect sizes instead of statistical significance, we may see that there is less of a reproducibility crisis than we think.

Consider, for example, how most results are displayed. After preprocessing the data and analyzing it using a general linear model, the maps are thresholded, using any of a range of values, but most commonly p=0.05, whether it is voxel-wise corrected or cluster-wise corrected. This often leaves only a few dozen or a few hundred voxels passing the threshold, out of tens or even hundreds of thousands of voxels that were originally collected. It may be that these remaining voxels are indeed the only ones that are “active”, but this mindset of voxels being either “on” or “off” is probably an incomplete picture of the biological reality these voxels represent, and at worst can be a distortion. This binary approach is further restricted by masks, either restricting the analysis to just the brain, or to just the grey matter. Although there are reasons for this - ease of visualization being one, and uniformity of applying thresholds being the other - such practices likely give the reader a skewed perception of the data.

Note

These concerns have been brought up several times over the past two decades, notably by Jernigan et al. (2003), who recommended mapping different ranges of t-statistics in different colors, and by Allen et al. (2012). The latter paper was focused on figures more generally - such as using violin plots instead of histograms, for example - and gave particular attention to neuroimaging figures, including the use of translucency to display all of the effects, and highlighting with a solid boundary the voxels and clusters that passed a significance threshold. (The Allen et al. paper in particular merits a closer look from any neuroimaging researcher who wants to make figures that are attractive, informative, and are composed by making educated choices.)

To illustrate this, the authors present a typical neuroimaging result using the Flanker task. If we apply the traditional p=0.05 cluster-wise threshold to the data, we are left with a few small clusters in the bilateral intraparietal sulcus - not an uncommon result, but hardly the whole picture that we usually see in other neuroimaging studies of Incongruency, which often includes significant clusters in the pre-Supplementary Motor Area. By highlighting the results and displaying a wider range of effects, we see not only the expected cluster in the dorsal anterior cingulate / pre-SMA area, but also negative effects in the occipital and frontal areas. While these may be spurious effects, it could also be that they do in fact exist, but do not pass the significance threshold for a variety of reasons - relatively low N, and by extension relatively low power, being one of the most likely explanations.

Applying the Highlighting Approach to the NARPS Dataset

Returning to the study we opened with, how would those results look if we applied the highlighting approach instead of the more commonly used NHST approach? The Botvinik-Nezer et al. (2020) study - also called the Neuroimaging Analysis and Replication Study, or NARPS - asked the research teams to create a thresholded map the way they usually would, and then determine whether an effect was localized in a given region or not. If we look at the effect of Gain, for example, and apply the usual statistical thresholds, we are left with a few small clusters scattered throughout the brain - similar to what we saw with the Incongruency effect above. If we apply the highlighting approach, on the other hand, we see a wider range of effects, including some that are clearly bilateral. Furthermore, we can remove the whole-brain mask and reduce our statistical threshold even further, in order to see whether there are any issues with the mask or other artifacts that we may not see with our original threshold.

Even if there are no suprathreshold effects, viewing the results with a translucent display can indicate how strong the subthreshold effects are: Whether they are truly weak and effectively non-existent, or whether they are close to the threshold, and may be limited by statistical power. By zooming in on the orbitofrontal cortex, a region that is known to have susceptibility artifacts and signal dropout, the authors were able to examine whether the absence of any suprathreshold voxels was due to weak effects, or possibly signal dropout and an overly aggressive mask. Without the highlighting approach, it is impossible to judge between these possibilities.

This approach also extends to reporting the results in tables. While this usually entails a loss of data by funneling a cluster into a single number, usually represented by its peak voxel, this can also be expanded to provide a richer set of information for the reader. For example, instead of reporting a single coordinate for the peak of the cluster and a single region that it is localized to, we can report the size of the cluster and what percentage of those voxels overlap with a given atlas region. This is more useful for representing the uncertainty of where the effect is localized, which can more accurately reflect the noise and the inherent spatial smoothness of the data.

Lastly, if we look across a range of studies - in this case, using each research team’s results as an independent study, although all of the teams were analyzing the same data - we observe much greater overlap between the maps using the highlighting approach than with a binary approach. Note how the overlap points to a larger agreement between the groups about both the direction and the magnitude of the effects; without this perspective, we would be led to believe that there is much larger variability in the results reported by each of the teams than there is in reality.

Implications for Reproducibility

By using a highlighting approach, you gain several advantages: First, you display a much wider range of the data to the audience, providing a fuller picture of the results. Second, these highlighted maps are more likely to be faithful to the underlying BOLD dynamics, which likely do not act in a binary fashion, but rather spread over the brain in a continuous manner. And lastly, this can reduce the incentive to use any of the practices above, such as p-hacking. You can still use the statistical threshold of your choice, but this will not necessarily mean that you have to hide most or all of the data.

Using these unthresholded maps are also important for meta-analyses, which can otherwise be biased by single peaks that are influenced by a range of variables - including differences in sample sizes, signal quality, and experimental design.

Thresholding Results in AFNI

Displaying translucent effects is easy: After you have overlaid your results on a template brain and thresholded them appropriately, click on the A button above the threshold slider. This will display all of the effects on a continuous scale of translucency, with brighter colors indicating stronger effects. Next, you can highlight the results that pass your significance threshold by clicking on the B button, next to the A button. After that, save the figures as you want, and include them in your publication.