A new paper claims that APOE ε4 carriers, a group previously described as having poorer long-term memory, actually have better visual working memory. On Twitter, Prof Dorothy Bishop raised the issue of multiple comparisons, or rather a lack thereof… Should the paper be #cancelled?!
There is a lot of research on the apolipoprotein E (APOE) gene’s three main alleles: ε2, ε3, and ε4. Out of these, ε3 is basic, you don’t really hear a lot about ε2, and ε4 is a bit extra: it is associated with higher risk of Alzheimer’s, and earlier demenita onset. This also means that the APOE ε4 carriers who will end up being diagnosed with a dementia are more likely to have subtle cognitive deficits during the decennia before their diagnosis. (Their brains are already impacted, but not to a high enough degree to notice or meet diagnostic criteria.)
Some people have thus asked: “C’mon, evolution, why have you not gotten rid of APOE ε4 if it’s such a bad allele?” One idea is that APOE ε4 also confers certain benefits that could outweigh some of its downsides. In line with this approach, Kirsty Lu and co-authors attempted to look for cognitive benefits in APOE ε4 carriers, and reported on it in an article in Nature Aging (see the full reference at the bottom of this post).
Why is the APOE e4 allele still present in the general population when it confers an increased risk of #Alzheimers and worse episodic / long-term memory? Perhaps the answer is that it is actually associated with superior working / short-term memory. https://t.co/fXxjA0PWGq…
— Masud Husain (@MasudHusain) October 8, 2021
Prof Masud Husain (5th out of 23 authors, if I counted correctly) tweeted about the new paper (see above), and Twitter was quick to respond (see below). Two main pieces of criticism were offered: 1) How is APOE ε4 actually selected against? 2) You didn’t do a correction for making many statistical comparisons!
Full disclosure: Masud was my PhD supervisor, and this caught my attention because it’s not often that you see your PhD supervisor get cancelled on Twitter!
Even absent any benefits, what’s the selection pressure against it?
— Huw (@Huwtube) October 8, 2021
The first is easy: Poorer long-term memory is an obvious downside that can subtly affect survival. (For example through forgetting where the good bits in your environment are, like food and shelter. Think how good it is for elephants to have an amazing memory for these things.) People on Twitter seemed to think that it didn’t matter in old age. This glosses over potential earlier downsides, and it is reminiscent of the “grandmother hypothesis”. This is an answer to the question “Why don’t women just die after they’re no longer fertile?” (Astute observers will note that the existence of this question is an excellent argument for why we need more women in evolutionary psychology.) Turns out, having grandparents around increases the survival chances of their offspring and their offspring’s offspring. Thus, grandparents who are not dead, suffering from dementia, or just very forgetful could ultimately confer an evolutionary benefit to their own offspring’s survival.
How happy are you with not doing any corrections for multiple comparisons? There’s an awful lot of statistical tests.
— Dorothy Bishop (@deevybee) October 8, 2021
Now, the second point is more complicated. Prof Dorothy Bishop pointed out that no multiple comparisons corrections were done. What this refers to, is the fact that if you do many tests, you expect some of them to turn out positively by chance. Hence, if you do many tests, you need to correct their outcomes to avoid false positives. This is a very sensible argument, and particularly important if you just keep throwing tests at your data.
The authors note: “We did not apply a correction for multiple comparisons, following recommendations in the statistical literature (57,58), because this was a hypothesis-driven study motivated by previous literature.” (Methods section, under the heading “Statistical Analysis” on page 6.) What they’re essentially saying is this: “Look, we predicted some stuff a priori, without seeing the data, so we can be a bit more lax about false positives, OK?” That argument works to the extent that pre-registering your analytical plan, like multiple comparisons corrections, should reduce the number of tests that you run; and thus your false positive rate.
However, this does not necessarily reduce the false positive rate of your planned tests. So it might still be appropriate to correct!
Note that the cited references, 57 and 58, are to the following two papers. Judge for yourself if you feel that they make a compelling argument for not correcting for multiple comparisons:
- 57. Armstrong, R. A. When to use the Bonferroni correction. Ophthalmic Physiol. Opt. 34, 502–508 (2014).
- 58. Althouse, A. D. Adjust for multiple comparisons? It’s not that simple. Ann. Thorac. Surg. 101, 1644–1645 (2016).
Let’s do the math
The paper helpfully lists the tests that are crucial to their argument in Table 1, including the values we need for doing a quick Holm-Bonferroni correction for multiple comparisons. This correction is a great middle ground between being too liberal (no corrections, so high false positive rate) and too conservative (super stringent corrections, so high false negative rate). It works as follows:
- Rank order all p values, from low to high.
- Set “significance” thresholds starting at α/n, α/(n-1), α/(n-2), etc. all the way to α/1; where α is the significance level (usually 0.05), and n the number of tests within a family.
- Run along the p values, and see which is the first to be larger than the threshold.
- All p values below the first-larger-than-threshold are considered “statistically significant”, and all others are not.
The main finding of the current paper, is that APOE ε4 carriers are better on visual short-term memory than non-carriers. This hinges on the specific primary outcome of “identification error” and “localisation error”. The participants in this study did a memory test, in which several fractal patterns were shown. After these were off-screen for a bit, participants had to first identify which fractal they had seen before (wrong identification = “identification error”), and then they had to drag that fractal to its original place (distance between the original and response location = “location error”).
The outcomes are analysed as part of multivariable regressions that include not just APOE ε4 status, but also task characteristics (number of fractals and how long they were off-screen for) and the participants’ β-amyloid status, white matter hyperintensity volume, and hippocampal volume. These count as the predictors, and they are used to predict identification and localisation errors in separate regressions. We’ll consider each such multivariable regression a “family”, and thus correct within them.
Because there are 6 predictors, the Holm-Bonferroni significance thresholds are: 0.0083 (0.05/6), 0.0100 (0.05/5), 0.0125 (0.05/4), 0.0168 (0.05/3), 0.0250 (0.05/2), and 0.05 (0.05/1).
Let’s check the p values for identification errors: <0.001, 0.026 (APOE ε4 status), 0.029, 0.320, 0.690, and 0.770. Only the very first is smaller than its threshold, so only the first value is “significant” after correction, and the rest are not. This means that there is no significant effect of APOE ε4 status on identification errors.
Let’s do the same for p values for localisation errors: <0.001, <0.001, 0.007 (APOE ε4 status), 0.410, 0.580, and 0.670. The first three values are lower than their thresholds, and thus “statistically significant” after correction. This means that there is an effect of APOE ε4 status on localisation errors.
A pre-emptive note on “families” of tests
What constitutes a “family of tests” is hard, subjective, and somewhat arbitrary. You could argue that ALL the tests here are in the same family; but you could also argue that only those tests that predict the same outcome data are (which I did here).
I suspect that comments on my little reassessment will hone in on this. Perhaps at this point that it is good to remember that the authors elected to use a less conservative approach because they were testing a priori hypotheses. Being very stringent in a context of authors digging through a dataset to see what they can find is good, as that is a situation with a high chance for false positives. When doing the same for a study with a pre-defined scope, where the risk for false positives is generally lower, you risk false negatives. This is a trade-off, and what is the “right” approach is subjective and context-dependent.
Another pre-emptive note
This post is obviously written in a reasonably light tone, and playfully uses the word “cancelled” to parody its use in certain corners of Twitter and the media. I have no doubt that everyone involved behaved in good faith: The first-author is an early-career researcher who delivered a great piece of work with their co-authors, Prof Bishop asked an appropriate question, and the rest of the people probing on Twitter seemed to ask questions out of genuine interest. Nobody is a bad person here.
- Lu, K., Nicholas, J.M., Pertzov, Y. et al. (2021). Dissociable effects of APOE ε4 and β-amyloid pathology on visual working memory. Nature Aging. doi:10.1038/s43587-021-00117-4