Model, Philosophy

Against Novelty Culture

So, there’s this thing that happens in certain intellectual communities, like (to give a totally random example) social psychology. This thing is that novel takes are rewarded. New insights are rewarded. Figuring out things that no one has before is rewarded. The high-status people in such a community are the ones who come up with and disseminate many new insights.

On the face of it, this is good! New insights are how we get penicillin and flight and Pad Thai burritos. But there’s one itty bitty little problem with building a culture around it.

Good (and correct!) new ideas are a finite resource.

This isn’t news. Back in 2005, John Ioannidis laid out the case for “most published research findings” being false. It turns out that when you have a small chance of coming up with a correct idea even using statistical tests for to find false positives can break down.

A quick example. There are approximately 25,000 genes in the human genome. Imagine you are searching for genes that increase the risk of schizophrenia (chosen for this example because it is a complex condition believed to be linked to many genes). If there are 100 genes involved in schizophrenia, the odds of any given gene chosen at random being involved are 1 in 250. You, the investigating scientist, decide that you want about an 80% chance of finding some genes that are linked (this is called study power and 80% is a common value) You run a bunch of tests, analyze a bunch of DNA, and think you have a candidate. This gene has been “proven” to be associated with schizophrenia at a p=0.05 confidence level.

(A p-value is the possibility of observing an event at least as extreme as the observed one, if the null hypothesis is true. This means that if the gene isn’t associated with schizophrenia, there is only a 1 in 20 chance – 5% – we’d see a result as extreme or more extreme than the one we observed.)

At the start, we had a 1 in 250 chance of finding a gene. Now that we have a gene, we think there’s a 19 in 20 chance that it’s actually partially responsible for schizophrenia (technically, if we looked at multiple candidates, we should do something slightly different here, but many scientists still don’t, making this still a valid example). Which probability to we trust?

There’s actually an equation to figure it out. It’s called Bayes Rule and statisticians and scientists use it to update probabilities in response to new information. It goes like this:

(You can sing this to the tune of Hallelujah; take P of A when given B / times P of A a priori / divide the whole thing by B’s expectation / new evidence you may soon find / but you will not be in a bind / for you can add it to your calculation.)

In plain language, it means that probability of something being true after an observation (P(A|B)) is equal to the probability of it being true absent any observations (P(A), 1 in 250 here), times the probability of the observation happening if it is true (P(B|A), 0.8 here), divided by the baseline probability of the observation (P(B), 1 in 20 here).

With these numbers from our example, we can see that the probability of a gene actually being associated with schizophrenia when it has a confidence level of 0.05 is… 6.4%.

I took this long detour to illustrate a very important point: one of the strongest determinants of how likely something is to actually be true is the base chance it has of being true. If we expected 1000 genes to be associated with schizophrenia, then the base chance would be 1 in 25, and the probability our gene actually plays a role would jump up to 64%.

To have ten times the chance of getting a study right, you can be 10 times more selective (which probably requires much more than ten times the effort)… or you can investigate something ten times as likely to actually occur. Base rates can be more powerful than statistics, more powerful than arguments, and more powerful than common sense.

This suggests that any community that bases status around producing novel insights will mostly become a community based around producing novel-seeming (but false!) insights once it exhausts all of the available true (and easily attainable) insights it could discover. There isn’t a harsh dividing line, just a gradual trend towards plausible nonsense as the underlying vein of truth is mined out, but the studies and blog posts continue.

Except the reality is probably even worse, because any competition for status in such a community (tenure, page views) will become an iterative process that rewards those best able to come up with plausible sounding wrappers on unfortunately false information.

When this happens, we have people publishing studies with terrible analyses but highly sharable titles (anyone remember the himmicanes paper?), with the people at the top calling anyone who questions their shoddy research “methodological terrorists“.

I know I have at least one friend who is rolling their eyes right now, because I always make fun of the reproducibility crisis in psychology.

But I’m just using that because it’s a convenient example. What I’m really worried about is the Effective Altruism community.

(Effective Altruism is a movement that attempts to maximize the good that charitable donations can do by encouraging donation to the charities that have the highest positive impact per dollar spent. One list of highly effective charities can be found on GiveWell; Givewell has demonstrated a noted trend away from novelty such that I believe this post does not apply to them.)

We are a group of people with countless forums and blogs, as well as several organizations devoted to analyzing the evidence around charity effectiveness. We have conventional organizations, like GiveWell, coexisting with less conventional alternatives, like Wild-Animal Suffering Research.

All of these organizations need to justify their existence somehow. All of these blogs need to get shares and upvotes from someone.

If you believe (like I do) that the number of good charity recommendations might be quite small, then it follows that a large intellectual ecosystem will quickly exhaust these possibilities and begin finding plausible sounding alternatives.

I find it hard to believe that this isn’t already happening. We have people claiming that giving your friends cash or buying pizza for community events is the most effective charity. We have discussions of whether there is suffering in the fundamental particles of physics.

Effective Altruism is as much a philosophy movement as an empirical one. It isn’t always the case that we’ll be using P-values and statistics in our assessment. Sometimes, arguments are purely moral (like arguments about how much weight we should give to insect suffering). But both types of arguments can eventually drift into plausible sounding nonsense if we exhaust all of the real content.

There is no reason to expect that we should be able to tell when this happens. Certainly, experimental psychology wasn’t able to until several years after much-hyped studies more-or-less stopped replicating, despite a population that many people would have previously described as full of serious-minded empiricists. Many psychology researchers still won’t admit that much of the past work needs to be revisited and potentially binned.

This is a problem of incentives, but I don’t know how to make the incentives any better. As a blogger (albeit one who largely summarizes and connects ideas first broached by others), I can tell you that many of the people who blog do it because they can’t not write. There’s always going to be people competing to get their ideas heard and the people who most consistently provide satisfying insights will most often end up with more views.

Therefore, I suggest caution. We do not know how many true insights we should expect, so we cannot tell how likely to be true anything that feels insightful actually is. Against this, the best defense is highly developed scepticism. Always remember to ask for implications of new insights and to determine what information would falsify them. Always assume new insights have a low chance of being true. Notice when there seems to be a pressure to produce novel insights long after the low hanging fruit is gone and be wary of anyone in tat ecosystem.

We might not be able to change novelty culture, but we can do our best to guard against it.

[Special thanks to Cody Wild for coming up with most of the lyrics to Bayesian Hallelujah.]

Advice, Model

Improvement Without Superstition

[7 minute read]

When you make continuous, incremental improvements to something, one of two things can happen. You can improve it a lot, or you can fall into superstition. I’m not talking about black cats or broken mirrors, but rather humans becoming addicted to whichever steps were last seen to work, instead of whichever steps produce their goal.

I’ve seen superstition develop first hand. It happened in one of the places you might least expect it – in a biochemistry lab. In the summer of 2015, I found myself trying to understand which mutants of a certain protein were more stable than the wildtype. Because science is perpetually underfunded, the computer that drove the equipment we were using was ancient and frequently crashed. Each crash wiped out an hour or two of painstaking, hurried labour and meant we had less time to use the instrument to collect actual data. We really wanted to avoid crashes! Therefore, over the course of that summer, we came up with about 12 different things to do before each experiment (in sequence) to prevent them from happening.

We were sure that 10 out of the 12 things were probably useless, we just didn’t know which ten. There may have been no good reason that opening the instrument, closing, it, then opening it again to load our sample would prevent computer crashes, but as far as we could tell when we did that, the machine crashed far less. It was the same for the other eleven. More self-aware than I, the graduate student I worked with joked to me: “this is how superstitions get started” and I laughed along. Until I read two articles in The New Yorker.

In The Score (How Childbirth Went Industrial), Dr. Atul Gawande talks about the influence of the Apgar score on childbirth. Through a process of continuous competition and optimization, doctors have found out ways to increase the Apgar scores of infants in their first five minutes of life – and how to deal with difficult births in ways that maximize their Apgar scores. The result of this has been a shocking (six-fold) decrease in infant mortality. And all of this is despite the fact that according to Gawande, “[in] a ranking of medical specialties according to their use of hard evidence from randomized clinical trials, obstetrics came in last. Obstetricians did few randomized trials, and when they did they ignored the results.”

Similarly, in The Bell Curve (What happens when patients find out how good their doctors really are), Gawande found that the differences between the best CF (cystic fibrosis) treatment centres and the rest turned out to hinge on how rigorously each centre followed the guidelines established by big clinical trials. That is to say, those that followed the accepted standard of care to the letter had much lower survival rates than those that hared off after any potentially lifesaving idea.

It seems that obstetricians and CF specialists were able to get incredible results without too much in the way of superstitions. Even things that look at first glance to be minor superstitions often turned out not to be. For example, when Gawande looked deeper into a series of studies that showed forceps were as good as or better than Caesarian sections, he was told by an experienced obstetrician (who was himself quite skilled with forceps) that these trials probably benefitted from serious selection effects (in general, only doctors particularly confident in their forceps skills volunteer for studies of them). If forceps were used on the same industrial scale as Caesarian sections, that doctor suspected that they’d end up worse.

But I don’t want to give the impression that there’s something about medicine as a field that allows doctors to make these sorts of improvements without superstition. In The Emperor of all Maladies, Dr. Siddhartha Mukherjee spends some time talking about the now discontinued practices of “super-radical” mastectomy and “radical” chemotherapy. In both treatments, doctors believed that if some amount of a treatment was good, more must be better. And for a while, it seemed better. Cancer survival rates improved after these procedures were introduced.

But randomized controlled trials showed that there was no benefit to those invasive, destructive procedures beyond that offered by their less-radical equivalents. Despite this evidence, surgeons and oncologists clung to these treatments with an almost religious zeal, long after they should have given up and abandoned them. Perhaps they couldn’t bear to believe that they had needlessly poisoned or maimed their patients. Or perhaps the superstition was so strong that they felt they were courting doom by doing anything else.

The simplest way to avoid superstition is to wait for large scale trials. But from both Gawande articles, I get a sense that matches with anecdotal evidence from my own life and that of my friends. It’s the sense that if you want to do something, anything, important – if you want to increase your productivity or manage your depression/anxiety, or keep CF patients alive – you’re likely to do much better if you take the large scale empirical results and use them as a springboard (or ignore them entirely if they don’t seem to work for you).

For people interested in nootropics, melatonin, or vitamins, there’s self-blinding trials, which provide many of the benefits of larger trials without the wait.  But for other interventions, it’s very hard to effectively blind yourself. If you want to see if meditation improves your focus, for example, then you can’t really hide the fact that you meditated on certain days from yourself [1].

When I think about how far from the established evidence I’ve gone to increase my productivity, I worry about the chance I could become superstitious.

For example, trigger-action plans (TAPs) have a lot of evidence behind them. They’re also entirely useless to me (I think because I lack a visual imagination with which to prepare a trigger) and I haven’t tried to make one in years. The Pomodoro method is widely used to increase productivity, but I find I work much better when I cut out the breaks entirely – or work through them and later take an equivalent amount of time off whenever I please. I use pomos only as a convenient, easy to Beemind measure of how long I worked on something.

I know modest epistemologies are supposed to be out of favour now, but I think it can be useful to pause, reflect, and wonder: when is one like the doctors saving CF patients and when is one like the doctors doing super-radical mastectomies? I’ve written at length about the productivity regime I’ve developed. How much of it is chaff?

It is undeniable that I am better at things. I’ve rigorously tracked the outputs on Beeminder and the graphs don’t lie. Last year I averaged 20,000 words per month. This year, it’s 30,000. When I started my blog more than a year ago, I thought I’d be happy if I could publish something once per month. This year, I’ve published 1.1 times per week.

But people get better over time. The uselessness of super-radical mastectomies was masked by other cancer treatments getting better. Survival rates went up, but when the accounting was finished, none of that was to the credit of those surgeries.

And it’s not just uselessness that I’m worried about, but also harm; it’s possible that my habits have constrained my natural development, rather than promoting it. This has happened in the past, when poorly chosen metrics made me fall victim to Campbell’s Law.

From the perspective of avoiding superstition: even if you believe that medicine cannot wait for placebo controlled trials to try new, potentially life-saving treatments, surely you must admit that placebo controlled trials are good for determining which things aren’t worth it (take as an example the very common knee surgery, arthroscopic partial meniscectomy, which has repeatedly performed no better than sham surgery when subjected to controlled trials).

Scott Alexander recently wrote about an exciting new antidepressant failing in Stage I trials. When the drug was first announced, a few brave souls managed to synthesize some. When they tried it, they reported amazing results, results that we now know to have been placebo. Look. You aren’t getting an experimental drug synthesized and trying it unless you’re pretty familiar with nootropics. Is the state of self-experimentation really that poor among the nootropics community? Or is it really hard to figure out if something works on you or not [2]?

Still, reflection isn’t the same thing as abandoning the inside view entirely. I’ve been thinking up heuristics since I read Dr. Gawande’s articles; armed with these, I expect to have a reasonable shot at knowing when I’m at risk of becoming superstitious. They are:

  • If you genuinely care only about the outcome, not the techniques you use to attain it, you’re less likely to mislead yourself (beware the person with a favourite technique or a vested interest!).
  • If the thing you’re trying to improve doesn’t tend to get better on its own and you’re only trying one potentially successful intervention at a time, fewer of your interventions will turn out to be superstitions and you’ll need to prune less often (much can be masked by a steady rate of change!).
  • If you regularly abandon sunk costs (“You abandon a sunk cost. You didn’t want to. It’s crying.”), superstitions do less damage, so you can afford to spend less mental effort on avoid them.

Finally, it might be that you don’t care that some effects are placebo, so long as you get them and get them repeatedly. That’s what happened with the experiment I worked on that summer. We knew we were superstitious, but we didn’t care. We just needed enough data to publish. And eventually, we got it.

[Special thanks go to Tessa Alexanian, who provided incisive comments on an earlier draft. Without them, this would be very much an incoherent mess. This was cross-posted on Less Wrong 2.0 and as of the time of posting it here, there’s at least one comment over there.]

Footnotes:

[1] Even so, there are things you can do here to get useful information. For example, you could get in the habit of collecting information on yourself for a month or so (like happiness, focus, etc.), then try several combinations of interventions you think might work (e.g. A, B, C, AB, BC, CA, ABC, then back to baseline) for a few weeks each. Assuming that at least one of the interventions doesn’t work, you’ll have a placebo to compare against. Although be sure to correct any results for multiple comparisons. ^

[2] That people still buy anything from HVMN (after they rebranded themselves in what might have been an attempt to avoid a study showing their product did no better than coffee) actually makes me suspect the latter explanation is true, but still. ^