A new paper from British psychologists David Shanks and colleagues will add to the growing sense of a "reproducibility crisis" in the field of psychology. The paper is called Romance, Risk, and Replication and it examines the question of whether subtle reminders of 'mating motives' (i.e. sex) can make people more willing to spend money and take risks. In 'romantic priming' experiments, participants are first 'primed' e.g. by reading a story about meeting an attractive member of the opposite sex. Then, they are asked to do an ostensibly unrelated test, e.g. being asked to say how much money they would be willing to spend on a new watch. There have been many published studies of romantic priming (43 experiments across 15 papers, according to Shanks et al.) and the vast majority have found statistically significant effects. The effect would appear to be reproducible! But in the new paper, Shanks et al. report that they tried to replicate these effects in eight experiments, with a total of over 1600 participants, and they came up with nothing. Romantic priming had no effect. So what happened? Why do the replication results differ so much from the results of the original studies? The answer is rather depressing and it lies in a graph plotted by Shanks et al. This is a funnel plot, a two-dimensional scatter plot in which each point represents one previously published study. The graph plots the effect size reported by each study against the standard error of the effect size - essentially, the precision of the results, which is mostly determined by the sample size.
This particular plot is a statistical smoking gun, and suggests that the positive results from the original studies (black dots) were probably the result of p-hacking. They were chance findings, selectively published because they were positive. Here's why. In theory, the points in a funnel plot should form a "funnel", i.e. a triangle, that points straight up. In other words, the more precise studies at the top should have less spread than the noisier estimates, but they should converge on the same effect size that's also the average of the less precise measures. In this plot, however, the black dots form a 'funnel' which is seriously tilted to the left. The trend line though these points is a diagonal (the red line). In other words, the more precise studies tended to find smaller mating priming effects. The bigger the study, the smaller the romantic priming. In fact, the diagonal red trend line closely tracks the line where an effect stops being statistically significant at p < 0.05 - which is marked as the outer edge of the grey triangle on the plot. Another way of expressing this would be to say that p values just below 0.05 are overrepresented. The published results "hug" the p = 0.05 significance line. So each of the studies tended to report an effect just strong enough to be statistically significant. It's very difficult to see how such a pattern could arise - except through bias. Shanks et al. say that this is evidence of the existence of "either p-hacking in previously published studies or selective publication of results (or both)." These two forms of bias go hand in hand, so the answer is probably both. Publication bias is the tendency of scientists (including peer reviewers and editors) to prefer positive results over negative ones. P-hacking is a process by which scientists can maximize their chances of finding positive results. I've been blogging about these issuesfor years, yet still I was taken aback by the dramatic nature of the bias in this case. The studies are like a torrent, rolling down the mountain of significance. The image is not so much a funnel plot as an avalanche plot.
Taken together with the negative results of the eight replication studies that Shanks et al. conducted, the funnel plot suggests that romantic priming doesn't exist, and that the many studies that did report the effect, were wrong. This doesn't mean that the previous romantic priming researchers were consciously trying to deceive by publishing results that they knew were false. In my view, they were probably led astray by their own cognitive biases, helped along by the culture of 'positive results or bust' in science today. This system can produce replicated positive results out of nowhere. I don't think this is a sustainable way of doing research. Reform is needed.
Shanks DR, Vadillo MA, Riedel B, Clymo A, Govind S, Hickin N, Tamman AJ, & Puhlmann LM (2015). Romance, Risk, and Replication: Can Consumer Choices and Risk-Taking Be Primed by Mating Motives? Journal of experimental psychology. General PMID: 26501730