Model, Quick Fix

When QALYs Are Wrong – Thoughts on the Gates Foundation

Every year, I check in to see if we’ve eradicated polio or guinea worm yet. Disease eradications are a big deal. We’ve only successfully eradicated one disease – smallpox – so being so close to wiping out two more is very exciting.

Still, when I looked at how much resources were committed to polio eradication (especially by the Gates Foundation), I noticed they seemed incongruent with its effects. No polio eradication effort can be found among GiveWell’s top charities, because it is currently rather expensive to prevent polio. The amount of quality-adjusted life years (QALYs, a common measure of charity effectiveness used in the Effective Altruism community) you can save with a donation to preventing malaria is just higher than for polio.

I briefly wondered if it might not be better for all of the effort going to polio eradication to instead go to anti-malaria programs. After thinking some more, I’ve decided that this would be a grave mistake. Since I haven’t seen why explained anywhere else, I figured I’d share my thinking, so that anyone else having the same thought can see it.

A while back, it was much cheaper to buy QALYs using the polio vaccines. As recently as 1988, there were more than 350,000 cases of polio every year. It’s a testament to the excellent work of the World Health Organization and its partners that polio has become so much rarer – and therefore so much more expensive to prevent each new case of. After all, when there are few new cases, you can’t prevent thousands.

It is obviously very good that there are few cases of polio. If we decided that this was good enough and diverted resources towards treating other diseases, we might quickly find that this would no longer be the case. Polio could once again become a source of easy QALY improvements – because it would be running rampant in unvaccinated populations. When phrased this way, I hope it’s clear that polio becoming a source of cheap QALY improvements isn’t a good thing; the existence of cheap QALY improvements means that we’ve dropped the ball on a potentially stoppable disease.

If polio is eradicated for good, we can stop putting any effort into fighting it. We won’t need any more polio vaccines or any more polio monitoring. It’s for this reason that we’re much better off if we finish the eradication effort.

What I hadn’t realized was that a simple focus on present QALYs obscures the potential effects our actions can have on future QALYs. Abandoning diseases until treatments for them save many lives cheaply might look good for our short term effectiveness, but in the long term, the greatest gains come from following through with our eradication efforts, so that we can repurpose all resources from an eradicated disease to the fight against another, forever.

Model, Philosophy

Against Novelty Culture

So, there’s this thing that happens in certain intellectual communities, like (to give a totally random example) social psychology. This thing is that novel takes are rewarded. New insights are rewarded. Figuring out things that no one has before is rewarded. The high-status people in such a community are the ones who come up with and disseminate many new insights.

On the face of it, this is good! New insights are how we get penicillin and flight and Pad Thai burritos. But there’s one itty bitty little problem with building a culture around it.

Good (and correct!) new ideas are a finite resource.

This isn’t news. Back in 2005, John Ioannidis laid out the case for “most published research findings” being false. It turns out that when you have a small chance of coming up with a correct idea even using statistical tests for to find false positives can break down.

A quick example. There are approximately 25,000 genes in the human genome. Imagine you are searching for genes that increase the risk of schizophrenia (chosen for this example because it is a complex condition believed to be linked to many genes). If there are 100 genes involved in schizophrenia, the odds of any given gene chosen at random being involved are 1 in 250. You, the investigating scientist, decide that you want about an 80% chance of finding some genes that are linked (this is called study power and 80% is a common value) You run a bunch of tests, analyze a bunch of DNA, and think you have a candidate. This gene has been “proven” to be associated with schizophrenia at a p=0.05 confidence level.

(A p-value is the possibility of observing an event at least as extreme as the observed one, if the null hypothesis is true. This means that if the gene isn’t associated with schizophrenia, there is only a 1 in 20 chance – 5% – we’d see a result as extreme or more extreme than the one we observed.)

At the start, we had a 1 in 250 chance of finding a gene. Now that we have a gene, we think there’s a 19 in 20 chance that it’s actually partially responsible for schizophrenia (technically, if we looked at multiple candidates, we should do something slightly different here, but many scientists still don’t, making this still a valid example). Which probability to we trust?

There’s actually an equation to figure it out. It’s called Bayes Rule and statisticians and scientists use it to update probabilities in response to new information. It goes like this:

(You can sing this to the tune of Hallelujah; take P of A when given B / times P of A a priori / divide the whole thing by B’s expectation / new evidence you may soon find / but you will not be in a bind / for you can add it to your calculation.)

In plain language, it means that probability of something being true after an observation (P(A|B)) is equal to the probability of it being true absent any observations (P(A), 1 in 250 here), times the probability of the observation happening if it is true (P(B|A), 0.8 here), divided by the baseline probability of the observation (P(B), 1 in 20 here).

With these numbers from our example, we can see that the probability of a gene actually being associated with schizophrenia when it has a confidence level of 0.05 is… 6.4%.

I took this long detour to illustrate a very important point: one of the strongest determinants of how likely something is to actually be true is the base chance it has of being true. If we expected 1000 genes to be associated with schizophrenia, then the base chance would be 1 in 25, and the probability our gene actually plays a role would jump up to 64%.

To have ten times the chance of getting a study right, you can be 10 times more selective (which probably requires much more than ten times the effort)… or you can investigate something ten times as likely to actually occur. Base rates can be more powerful than statistics, more powerful than arguments, and more powerful than common sense.

This suggests that any community that bases status around producing novel insights will mostly become a community based around producing novel-seeming (but false!) insights once it exhausts all of the available true (and easily attainable) insights it could discover. There isn’t a harsh dividing line, just a gradual trend towards plausible nonsense as the underlying vein of truth is mined out, but the studies and blog posts continue.

Except the reality is probably even worse, because any competition for status in such a community (tenure, page views) will become an iterative process that rewards those best able to come up with plausible sounding wrappers on unfortunately false information.

When this happens, we have people publishing studies with terrible analyses but highly sharable titles (anyone remember the himmicanes paper?), with the people at the top calling anyone who questions their shoddy research “methodological terrorists“.

I know I have at least one friend who is rolling their eyes right now, because I always make fun of the reproducibility crisis in psychology.

But I’m just using that because it’s a convenient example. What I’m really worried about is the Effective Altruism community.

(Effective Altruism is a movement that attempts to maximize the good that charitable donations can do by encouraging donation to the charities that have the highest positive impact per dollar spent. One list of highly effective charities can be found on GiveWell; Givewell has demonstrated a noted trend away from novelty such that I believe this post does not apply to them.)

We are a group of people with countless forums and blogs, as well as several organizations devoted to analyzing the evidence around charity effectiveness. We have conventional organizations, like GiveWell, coexisting with less conventional alternatives, like Wild-Animal Suffering Research.

All of these organizations need to justify their existence somehow. All of these blogs need to get shares and upvotes from someone.

If you believe (like I do) that the number of good charity recommendations might be quite small, then it follows that a large intellectual ecosystem will quickly exhaust these possibilities and begin finding plausible sounding alternatives.

I find it hard to believe that this isn’t already happening. We have people claiming that giving your friends cash or buying pizza for community events is the most effective charity. We have discussions of whether there is suffering in the fundamental particles of physics.

Effective Altruism is as much a philosophy movement as an empirical one. It isn’t always the case that we’ll be using P-values and statistics in our assessment. Sometimes, arguments are purely moral (like arguments about how much weight we should give to insect suffering). But both types of arguments can eventually drift into plausible sounding nonsense if we exhaust all of the real content.

There is no reason to expect that we should be able to tell when this happens. Certainly, experimental psychology wasn’t able to until several years after much-hyped studies more-or-less stopped replicating, despite a population that many people would have previously described as full of serious-minded empiricists. Many psychology researchers still won’t admit that much of the past work needs to be revisited and potentially binned.

This is a problem of incentives, but I don’t know how to make the incentives any better. As a blogger (albeit one who largely summarizes and connects ideas first broached by others), I can tell you that many of the people who blog do it because they can’t not write. There’s always going to be people competing to get their ideas heard and the people who most consistently provide satisfying insights will most often end up with more views.

Therefore, I suggest caution. We do not know how many true insights we should expect, so we cannot tell how likely to be true anything that feels insightful actually is. Against this, the best defense is highly developed scepticism. Always remember to ask for implications of new insights and to determine what information would falsify them. Always assume new insights have a low chance of being true. Notice when there seems to be a pressure to produce novel insights long after the low hanging fruit is gone and be wary of anyone in tat ecosystem.

We might not be able to change novelty culture, but we can do our best to guard against it.

[Special thanks to Cody Wild for coming up with most of the lyrics to Bayesian Hallelujah.]