Better visual working memory for APOE ε4 carriers: #cancelled? We do the math!

A new paper claims that APOE ε4 carriers, a group previously described as having poorer long-term memory, actually have better visual working memory. On Twitter, Prof Dorothy Bishop raised the issue of multiple comparisons, or rather a lack thereof… Should the paper be #cancelled?!

TL;DR: No.


There is a lot of research on the apolipoprotein E (APOE) gene’s three main alleles: ε2, ε3, and ε4. Out of these, ε3 is basic, you don’t really hear a lot about ε2, and ε4 is a bit extra: it is associated with higher risk of Alzheimer’s, and earlier demenita onset. This also means that the APOE ε4 carriers who will end up being diagnosed with a dementia are more likely to have subtle cognitive deficits during the decennia before their diagnosis. (Their brains are already impacted, but not to a high enough degree to notice or meet diagnostic criteria.)

Some people have thus asked: “C’mon, evolution, why have you not gotten rid of APOE ε4 if it’s such a bad allele?” One idea is that APOE ε4 also confers certain benefits that could outweigh some of its downsides. In line with this approach, Kirsty Lu and co-authors attempted to look for cognitive benefits in APOE ε4 carriers, and reported on it in an article in Nature Aging (see the full reference at the bottom of this post).


Prof Masud Husain (5th out of 23 authors, if I counted correctly) tweeted about the new paper (see above), and Twitter was quick to respond (see below). Two main pieces of criticism were offered: 1) How is APOE ε4 actually selected against? 2) You didn’t do a correction for making many statistical comparisons!

Full disclosure: Masud was my PhD supervisor, and this caught my attention because it’s not often that you see your PhD supervisor get cancelled on Twitter!

The first is easy: Poorer long-term memory is an obvious downside that can subtly affect survival. (For example through forgetting where the good bits in your environment are, like food and shelter. Think how good it is for elephants to have an amazing memory for these things.) People on Twitter seemed to think that it didn’t matter in old age. This glosses over potential earlier downsides, and it is reminiscent of the “grandmother hypothesis”. This is an answer to the question “Why don’t women just die after they’re no longer fertile?” (Astute observers will note that the existence of this question is an excellent argument for why we need more women in evolutionary psychology.) Turns out, having grandparents around increases the survival chances of their offspring and their offspring’s offspring. Thus, grandparents who are not dead, suffering from dementia, or just very forgetful could ultimately confer an evolutionary benefit to their own offspring’s survival.

Now, the second point is more complicated. Prof Dorothy Bishop pointed out that no multiple comparisons corrections were done. What this refers to, is the fact that if you do many tests, you expect some of them to turn out positively by chance. Hence, if you do many tests, you need to correct their outcomes to avoid false positives. This is a very sensible argument, and particularly important if you just keep throwing tests at your data.

The authors note: “We did not apply a correction for multiple comparisons, following recommendations in the statistical literature (57,58), because this was a hypothesis-driven study motivated by previous literature.” (Methods section, under the heading “Statistical Analysis” on page 6.) What they’re essentially saying is this: “Look, we predicted some stuff a priori, without seeing the data, so we can be a bit more lax about false positives, OK?” That argument works to the extent that pre-registering your analytical plan, like multiple comparisons corrections, should reduce the number of tests that you run; and thus your false positive rate.

However, this does not necessarily reduce the false positive rate of your planned tests. So it might still be appropriate to correct!

Note that the cited references, 57 and 58, are to the following two papers. Judge for yourself if you feel that they make a compelling argument for not correcting for multiple comparisons:

  • 57. Armstrong, R. A. When to use the Bonferroni correction. Ophthalmic Physiol. Opt. 34, 502–508 (2014).
  • 58. Althouse, A. D. Adjust for multiple comparisons? It’s not that simple. Ann. Thorac. Surg. 101, 1644–1645 (2016).

Let’s do the math

The paper helpfully lists the tests that are crucial to their argument in Table 1, including the values we need for doing a quick Holm-Bonferroni correction for multiple comparisons. This correction is a great middle ground between being too liberal (no corrections, so high false positive rate) and too conservative (super stringent corrections, so high false negative rate). It works as follows:

  1. Rank order all p values, from low to high.
  2. Set “significance” thresholds starting at α/n, α/(n-1), α/(n-2), etc. all the way to α/1; where α is the significance level (usually 0.05), and n the number of tests within a family.
  3. Run along the p values, and see which is the first to be larger than the threshold.
  4. All p values below the first-larger-than-threshold are considered “statistically significant”, and all others are not.

The main finding of the current paper, is that APOE ε4 carriers are better on visual short-term memory than non-carriers. This hinges on the specific primary outcome of “identification error” and “localisation error”. The participants in this study did a memory test, in which several fractal patterns were shown. After these were off-screen for a bit, participants had to first identify which fractal they had seen before (wrong identification = “identification error”), and then they had to drag that fractal to its original place (distance between the original and response location = “location error”).

The outcomes are analysed as part of multivariable regressions that include not just APOE ε4 status, but also task characteristics (number of fractals and how long they were off-screen for) and the participants’ β-amyloid status, white matter hyperintensity volume, and hippocampal volume. These count as the predictors, and they are used to predict identification and localisation errors in separate regressions. We’ll consider each such multivariable regression a “family”, and thus correct within them.

Because there are 6 predictors, the Holm-Bonferroni significance thresholds are: 0.0083 (0.05/6), 0.0100 (0.05/5), 0.0125 (0.05/4), 0.0168 (0.05/3), 0.0250 (0.05/2), and 0.05 (0.05/1).

Let’s check the p values for identification errors: <0.001, 0.026 (APOE ε4 status), 0.029, 0.320, 0.690, and 0.770. Only the very first is smaller than its threshold, so only the first value is “significant” after correction, and the rest are not. This means that there is no significant effect of APOE ε4 status on identification errors.

Let’s do the same for p values for localisation errors: <0.001, <0.001, 0.007 (APOE ε4 status), 0.410, 0.580, and 0.670. The first three values are lower than their thresholds, and thus “statistically significant” after correction. This means that there is an effect of APOE ε4 status on localisation errors.

A pre-emptive note on “families” of tests

What constitutes a “family of tests” is hard, subjective, and somewhat arbitrary. You could argue that ALL the tests here are in the same family; but you could also argue that only those tests that predict the same outcome data are (which I did here).

I suspect that comments on my little reassessment will hone in on this. Perhaps at this point that it is good to remember that the authors elected to use a less conservative approach because they were testing a priori hypotheses. Being very stringent in a context of authors digging through a dataset to see what they can find is good, as that is a situation with a high chance for false positives. When doing the same for a study with a pre-defined scope, where the risk for false positives is generally lower, you risk false negatives. This is a trade-off, and what is the “right” approach is subjective and context-dependent.

Another pre-emptive note

This post is obviously written in a reasonably light tone, and playfully uses the word “cancelled” to parody its use in certain corners of Twitter and the media. I have no doubt that everyone involved behaved in good faith: The first-author is an early-career researcher who delivered a great piece of work with their co-authors, Prof Bishop asked an appropriate question, and the rest of the people probing on Twitter seemed to ask questions out of genuine interest. Nobody is a bad person here.


  • Lu, K., Nicholas, J.M., Pertzov, Y. et al. (2021). Dissociable effects of APOE ε4 and β-amyloid pathology on visual working memory. Nature Aging. doi:10.1038/s43587-021-00117-4


  1. Thanks for the rapid response.
    I don’t like to be snarky, especially to my colleagues, but I have a v. low prior for associations between common variants & cognition, for reasons explained here:

    I note that previous papers found effects with same task, but it’s hard to compare across papers: e.g. one paper found in men only, another found interaction with duration. Of course, it is OK to omit multiple comparison correction if you have a specific prediction, but it wasn’t too clear to me from prior literature which dependent variable(s) were predicted to be superior in the APOE e4 group. Or whether an e4 dosage effect was predicted.
    A preregistered prediction would make me much less sceptical.

    Re evolution, I’m also dubious about the grandmother hypothesis: do you know if anyone has modelled this? My intuition is that the impact of a grandmother on survival in our ancestors would need to be implausibly large to affect persistence, or not, of APOE e4. But I’m willing to be proved wrong on this.

    • Thanks for your reply!

      I completely agree with your prior on common variants and cognition. I also hope that it was clear enough from the post that I think your question was a very valid one! (I do enjoy writing these things with a bit of extra spice, to parody the polemic dynamic that Twitter often forces dialogue into; there is a little disclaimer on the end of the post to avoid people accidentally taking it too seriously!)

      I’m no expert on APOE e4, and really just wanted to do what you suggested the authors do: Apply a correction for multiple comparisons on their main finding. It serves as a nice check (that I almost always do when reviewing papers), and in this case it offers a nice practical example of how people can do a quick back-of-the-envelope Holm-Bonferroni correction on published findings.

      As for the grandmother hypothesis: I have complex feelings about it. On the one hand, I strongly dislike the thought process behind it, which seems to be “Women’s only function is to create babies; so why do they not drop dead after that function is no longer served?” There is obvious misogyny behind asking this but not asking the same about men, and there is thinly veiled misogyny behind the fact that the age-related decline of men’s fertility is highly under-researched because we just assume they’re fine. Regardless, there is a long history of research on the question, and there is mixed evidence. You asked about simulations, so I’m highlighting one of those with somewhat mixed-results: Kachel and colleagues (2010) find that the presence of grandparents could help reduce interbirth intervals, but do note that this isn’t unique to grandparental help. They’re not convinced it massively elongates life expectancy, but do highlight other potential effects.

      In general, I think evolution and the role of social structures is extremely complex, and that our simulations on the matter are limited because our picture of it is. Your linked blogpost highlights the notion that things that happen after successful reproduction can’t impact selection, but this can’t be true if one can still impact the survival chances of those children (and their children). An extreme example: Imagine an allele that turns carriers into a hungry gremlin when their fertility runs out: They would eat all their children well after they were born, and thus severely limit their chances of passing on the hungry-gremlin alleles. I imagine people making APOE e4 arguments have something similar in mind: carriers have a higher chance to turn demented and a poorer memory, and could thus help their offspring less than other carriers would. It’s not quite the same as eating their offspring (and their offspring’s offspring), but ultimately the logic is the same.

  2. i mean sorry but since when are people asking important questions about the paper and data equal to being cancelled? sounds a little bit of an overly dramatic qualification to me for something that is just a regular part of scientific dialogue

Leave a Reply

Your email address will not be published. Required fields are marked *