Replication Markets: Can You Predict Which Social Science Papers Will Replicate?

Discusses the mechanics of the Replication Markets platform; issues in prediction market and survey incentive design; and the (un)replicability of six social science fields.

What are Replication Markets?

Replication Markets are a type of prediction market, mechanisms that enable individuals to bet on the outcome of events. Just as financial markets efficiently determine the perceived value of a security by aggregating the beliefs of buyers and sellers (e.g., I think this stock will rise in price, so I will buy more), prediction markets determine the perceived likelihood of an event by aggregating the beliefs of many individuals. Both financial and prediction markets draw on the wisdom (or sometimes lunacy) of crowds and the power of financial incentives.

There are many demonstrated applications of prediction markets, like predicting political events (PredictIt enables betting on who will be the Democratic nominee in the 2020 US election).

Replication Markets are another such application, and they will be the focus of this article. Replication Markets aggregate predictions about which scientific claims will successfully replicate (i.e., demonstrate a statistically significant effect when the experiment is repeated). Replication markets for social science research have a track record of success, reaching upwards of 75% accuracy and sometimes even surpassing the predictive performance of experts (see this paper and this other paper, or the papers here, for example).

A DARPA project called SCORE (Systematizing Confidence in Open Research and Evidence) intends to use prediction markets to forecast if social science research claims will replicate. 3000 such claims will be in the market. The claims come “from social and behavioral science articles published in the last 10 years” (psychology, economics, education research, management and marketing, political science, and sociology and criminology).

There is a public website, Replication Markets (RM), that is crowdsourcing these predictions for SCORE. Nearly anyone can participate and there are financial rewards for top forecasters. (When I refer to RM, I’m referring to the contest the site is currently hosting.)

In this essay, I’ll explain how RM works, explain the market mechanisms and game theory of it, and walk through my thought process and strategy.

How Replication Markets Works

What Counts as Replication?

I myself am participating in the RM game. Because I’m competitive, I want to keep score. Though the total financial rewards sum to over $100k, realistically the chance of anyone person winning a lot of money is pretty low, so I’m doing this mostly for fun. Nonetheless, I’ll use money as my metric for success.

To win money, we need to predict replication. To do that, we need to know what the criterion for replication is.

According to Replication Markets, for each of the claims (i.e., the social science assertions) selected to be tested, SCORE’s data team “will run a single high-quality replication of each selected claim.” A replication will be deemed successful if they get a statistically significant result in the same direction as the original claim. Markets for claims are binary: ultimately, each claim will either replicate or not.

Format of the Game

Replication Markets is composed of 10, 4-week rounds, starting in September 2019 (plus a special Round 0 for meta-claims in August 2019, which I’ll address later). Each round has two components: markets and surveys, both of which are ways to aggregate many opinions and achieve the wisdom of crowds.

Surveys

Surveys are private evaluations (i.e., opinion polls) of the probability that a given claim is true. These evaluations are made a week prior to the market’s opening. Surveys come in batches of 10 claims (all from the same journal, I believe), and there are around 15 per round. Surveys pay out at the end of the following round, with top 4 participants receiving \$80, \$40, \$20, and \$20 dollars each.

RM isn’t specifying how exactly the surveys will be scored. Because they resolve a month after closing, before the actual replication attempts, they must involve predicting what probabilities the prediction markets will give each claim. I emailed RM to find out about the scoring and was told this:

There is actually an extensive academic literature studying more sophisticated and in many ways superior methods of aggregating crowd forecasts. We are experimenting with such cutting edge methods. In other words, in the survey you are scored against our best judgment of the right forecast, having processed all the forecasts of other participants using our best aggregation techniques.

Personal Communication

Markets

Markets are the prediction market aspect of RM. Each participant is given a set number of points, which she can use to buy Yes or No shares for a claim. Markets “pay out proportionally to winning shares” and will resolve by mid-2020, once the replications have been attempted.

According to the rules, only 100 of the 3000 (i.e., 3.33%) claims will be selected and pay out (these claims will be directly replicated; the data-replicated claims do not pay out). Each of these has a total payout of \$900, paid proportional to the number of winning shares in a resolved claim (e.g., I have 10 Yes shares for Claim X; the total number of Yes shares held by RM participants in 1,000; so, I am paid 10/1,000 out of \$900, which equals \$9).

For the meta round, Round 0, “Meta-claims resolve as the proportion of successful direct replications, so a meta-claim resolving as 60% would pay 0.6 points per ‘Yes’ share and 0.4 points per ‘No’ share.” The total payout for this round is $360 for each of the 12 meta-claims.

Pricing and Game Theory for Markets

How Are Shares Priced?

Unlike traditional financial markets that use a Double Auction process, if you want to buy/sell a share of a claim in RM, you don’t need someone on the other end of the transaction willing to sell/buy at that price. Rather, in RM you buy directly from the market manager using a pricing system developed by Robin Hanson, the Logarithmic Market Scoring Rules (LMSR)

LMSR made no sense to me when I first learned about it. And frankly, it doesn’t totally make sense to me now. Robin Hanson wrote a couple of papers on it, so look into them if you want to deeply understand it.

Since I’m not a finance guy, I don’t care enough to understand it deeply. Most of my knowledge I’ve gleaned from the equations here. These are, in my opinion, the important points:

  1. The price of buying shares at time t depends on the cost of the share at time t-1; the cost of the share changes after every trade.
  2. The cost of shares depends on how many ‘Yes’ and ‘No’ shares have been bought. The market’s prediction of a claim (i.e., its price), p, depends on the number of ‘Yes’ and ‘No’ shares bought. p is equal to the market’s probability rating of the claim.
  3. ‘Yes’ shares get cheaper as p decreases whereas ‘No’ shares get more expensive, and vice versa.
  4. The relationship between p and the cost of a ‘Yes’ or ‘No’ share is essentially linear or slightly concave upward or downward, depending on if you’re buying minority or majority shares.

How to Win the Markets

The game-theoretic aspect of RM is that the price of a ‘Yes’ or ‘No’ share depends on the market’s evaluation of that claim, which can change volatilely. This presents the opportunity for arbitrage, especially in Round 0.

As one participant in RM explains:

If you buy “yes” when the market is at 30% then sell when it is at 50% you end up with more points. Of course, points are only useful for buying shares in other markets, so you should only “cash out profits” if you think there’s another market that is further off than the one you’re cashing out of.

Davidoj, RM participant

Here’s an example of arbitrage, converting 23 ‘No’ shares to 36 ‘Yes’ shares (with the help of 2 other points, which buy about 6 of the 36 shares):

Round 0

For Round 0, you must predict the percentage of claims that will replicate per field and per time period. There are 12 such “meta-claims”. Pay outs are based on the number of ‘Yes’ or ‘No’ shares you have for each of these 12 meta-claims.

To determine the proper strategy and expected payout, we must consider the following parameters:

  • N = the number of total RM participants (as of August 2019, the site claims 500 people have signed up; however, it seems only a 100 or so people are actively participating in the markets; this bodes well, as the fewer participants, the greater your chances of winning.)
  • The number of points (participants start with 100)
  • The proportion of N that spend all their points (I’ll assume this is 100% to be conservative)

A toy example: assume that every meta-claim had a replication rate of 50%. This means that ‘Yes’ and ‘No’ shares are paid the same amount. The total prize money is \$4320 for these 12 claims. The total pool of points is 50,000, assuming N = 500. So, if everyone bought their shares at the exact same price, each point is worth about \$0.086. Given that everyone started with 100 points, each person earns around \$8.60. If we change our assumptions and let N = 100, each person earns around \$43.

A more realistic example: assume that each meta-claim has a 0.5 chance of being either 40% or 60%. Suppose that you end up being on the correct side of all 12 of these markets (i.e., if the replication rate is 60%, you bought all ‘Yes’ shares). Also, suppose that the prediction market’s final estimates for all these claims are 50%. Then, if you start with 100 points and evenly distribute them across claims, your earnings will be around 10% more, as you receive 0.6 points for every “correct” share.

Strategy

There are possible arbitrage and subterfuge opportunities, but frankly, I don’t think they’re worth going into given how small the rewards are.

Rounds 1-10

The more interesting game theory is in Rounds 1-10. Should we uniformly distribute our points, or should we focus on areas in which we have more expertise, and therefore are more likely to correctly predict?

Replication Markets suggests the following:

“The market is probably most accurate when forecasters specialize, because one kind of forecaster won’t have enough points.”

Per round, each forecaster has 300 points to spend on 300 claims, 10 of which will be directly replicated and paid out. Assuming (obviously a simplification) that replication rate is constant across all six fields, that means around 1.67 claims per field will be directly replicated each round. The total prizes per round are \$9000, meaning that the expected prize money for each claim is \$30.

Strategy

Conditional on a claim being selected for replication and you predicting the correct outcome, the fewer the number of participants in that market, the higher your payout. So, your best option is to participate in the markets that have the fewest participants (such a rule wouldn’t function in a traditional double auction financial market, but in RM it works because you buy shares from the market manager). Other than that, there isn’t much strategy to give at the individual level (prediction markets are about aggregate behavior, after all).

(Note, however, that this isn’t an evolutionarily stable strategy; if everyone were to realize markets with fewer participants increased their chances of winning, the equilibrium condition would be an even distribution of participants among all markets.)

How to Win the Surveys

Surveys are by far the more financially rewarding opportunity in RM. They have a tournament payout structure, with only the top 4 forecasters winning. And they don’t involve the confusing LMSR the markets use. To win the surveys, you need to be good at predicting which claims will replicate, which I’ll focus on below. But first, let’s focus on the second-order function of which claims the rest of the survey-takers believe will replicate, as this is key to the survey scoring function.

Issues in Survey Incentive Design

RM will use some secret crowd aggregation technique for scoring surveys. All they’re willing to reveal is that “in the survey you are scored against our best judgment of the right forecast, having processed all the forecasts of other participants using our best aggregation techniques.”

So, I imagine that if you want to win the surveys, your private survey opinions must align with the aggregate survey predictions. Setting aside any sort of nested/higher-order logic on the part of other survey participants (e.g., if everyone adopted a strategy of biasing their survey results towards what they thought everyone else would say, that strategy wouldn’t work), how could we optimally align our survey answers?

This depends on the survey scoring method. A good candidate is to be found in the recent arXiv paper by Yang Liu and Yiling Chen (two members of the RM scoring team), “Surrogate Scoring Rules and a Uniform Dominant Truth Serum“. Considering that RM will be experimenting with novel methods of survey aggregation, I think this fits the bill. The question is: will understanding the scoring system allow us to game it?

Surrogate Scoring Rules

The scoring problem Liu and Chen address is the following:

If you want to aggregate the opinions of many individuals, you must elicit these opinions and verify them. In markets, for example, opinions are elicited by the prospect of financial reward and verified against the ground truth (whether or not a security’s price rises or falls). An agent’s score is how well his prediction matches the ground truth. These are called “Strictly Proper Scoring Rules” (SPSR). Under an SPSR, an agent is expected to maximize his expected score and truthfully reveal his beliefs.

Conversely, situations in which verification is noisy or unavailable (e.g., Replication Markets, as we won’t know if the claims replicate until mid-2020) are called “Information Elicitation Without Verification” (IEWV), and they face a host of issues. A family of peer prediction mechanisms has been developed to deal with these issues. One such issue is that because an agent’s score depends on the beliefs of others, an agent can benefit from misrepresenting his beliefs.

The question for the principal (i.e., RM) is:

Can we design scoring mechanisms to quantify the quality of information as SPSR do and achieve truthful elicitation in a certain form of dominant strategy for IEWV?

Liu and Chen

One of their proposed solutions is “surrogate scoring rules” (SSR).

While Hansons’ LMSR makes a bit of sense to me, SSR makes absolutely no sense at this time. One might be able to derive some value from really digging into this topic, but I think honestly representing your opinions offers a pretty optimal strategy for the surveys. Just give your honest belief about if a claim will replicate.

To that end, how should we predict if a claim will replicate?

The Replication Crisis and Our Base Rate of Replication

If we want to predict which studies will replicate, we first need a base rate of replication among all social science. To determine this, we must ask ourselves: is social science actually science? Some are skeptical:

Though I’m not Taleb-level skeptical, I believe there’s an inkling of truth in his claim that many academics are “intellectuals-yet-idiots” who lack “skin in the game” (though Taleb himself is an IYI on topics like behavioral genetics and psychometrics). Evidence includes the following list of social science and medicine research areas and claims that were once taken seriously but have failed to replicate:

  • The candidate gene era in psychology/genetics
  • The early, small N work done in cognitive neuroscience
  • Much of nutritional research and epidemiology, the methods of which are extremely suspect (e.g., doing a proper RCT is nearly impossible, retrospective food tracking is unreliable, genetic confounding, etc.)
  • The social psychology literature on priming, which is p-hacked and riddled with publication bias
  • The educational research on growth mindset
  • The effect of early-childhood educational interventions on academic outcomes (which has clear publication bias)
  •  Candidate drug targets in oncology; one replication effort had a 6/53 (~11%) replication rate

This list doesn’t inspire confidence. Ex post facto, it’s easy to call these fields BS and impugn the motivations of the researchers doing the spurious research within them. (And sure, some of these fields still suffer from collective delusions inspired by political ideology.) But in most cases, we should remember Hanlon’s razor:

“Never attribute to malice that which is adequately explained by stupidity.”

And, frankly, I don’t think we can chalk it all up to stupidity. All the list of failed replications demonstrates is that the social science publishing game has a lot of perverse incentives. For example, there’s a massive principal-agent problem: the incentives of the agent—the researcher who wants to get tenure and publish novel findings—do not necessarily align with those of the principal—humanity as a whole, which (in theory) has much to gain from the public good, science, that the agent is contributing to.

Consequently, academic malpractice—whether due to malfeasance, obstinance, or negligence—is much more common then we think.

These failures of social science and ethics notwithstanding, and despite claims that most published research findings might be false, large parts of social science are replicable.

Case in point, a 2018 replication project for 21 social science papers published in Science and Nature (two of the top general science journals) between 2010 and 2015 had a replication rate of 62%.

Does Replication Rate Vary Significantly By Field?

A single general social science replication project is useful, but certainly not dispositive. A much better approach is to look at replication on a field-by-field basis, set these as our priors, and forecast replication of specific claims using these priors.

It’s clear that certain fields have better epistemic hygiene than others. One index of poor epistemic hygiene is the percentage of publications in a field that feature results confirming the tested hypothesis, i.e., positive results. Psychology is the worst culprit in this regard.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2850928/

Typically, among the social sciences, it’s the “harder” social sciences (e.g., much of economics, parts of psychology like behavioral genetics and quantitative psychology) that have better epistemic hygiene. Therefore, we’d expect replication to be higher in these fields.

However, I’d like to put concrete estimates on the replication rates and rank-order the fields. To do this, I’ll consider the following for each field:

  • Replication rates from existing replication efforts
  • Any indices of publication bias
  • How strong the culture of replication is
  • How much dissent and debate is tolerated
  • How mathematical each field is (which isn’t to say mathematical models are better—excessive math is often a form of formalist obscurantism, and dressing up a poor model with complex math doesn’t make the underlying model any better)
  • A gut feeling for how ideologically motivated members of each field are (I’ll glance at the journals in each field and see if my BS detector goes off)
  • Any major replication scandals that have occurred in the field

I’ll weigh the existing replication attempts most heavily, so here they are:

Economics

A replication project done by the Experimental Economics Replication Project (EERP) had a 61% replication rate.

Psychology

2015: The “Reproducibility Project: Psychology” attempted to replicate 100 papers published in three top psychology journals in 2008. 39% of papers successfully replicated).

The Many Labs Projects 1, 2, and 3 performed ideal, high-quality replications (i.e., multiple labs attempted to replicate findings). The replication rates were 77%, 50%, and 30%, respectively. The total replication rate (i.e., the weighted sum) was 53%.

Other Metrics of Epistemic Hygiene

Economics

Economists have been discussing replication for quite a while.

That said, economics is having its own replication crisis. Consider publication bias among three of the top economics journals, for example:

We observe a two-humped camel shape with missing p-values between 0.25 and 0.10 that can be retrieved just after the 0.05 threshold and represent 10-20 percent of marginally rejected tests. Our interpretation is that researchers inflate the value of just-rejected tests by choosing “significant” specifications.

Brodeur et al., 2016


If there weren’t publication bias, we’d expect the distribution of z-statistics to smoothly decrease, approximating the tail of a normal distribution.

Additionally, certain “causal” inference methods in economics, like difference in differences—exploiting natural experiments in longitudinal data, e.g., the effects of a new state policy on some outcome—seem to be particularly prone to publication bias and p-hacking. Whereas, randomized controlled trials (the gold standard of causal inference) and regression discontinuity (using putatively random, exogenous treatments to mimic an RCT) don’t demonstrate the same degree of publication bias.

The stars correspond to the z-scores for p-values of 0.1, 0.05, and 0.01.

Effect sizes in economics are wildly overestimated, too.

Psychology

I’m pretty sure most academics have heard about the psychology replication crisis; there’s a lot of talk about it. Given this and the fact that there have been many more replication efforts in psychology than in other fields, I’ll just stick to the empirical data for my priors.

Education

With the exception of research on exceptional children and educational acceleration/tracking (as they’re grounded in strong psychometric foundations), most education research won’t replicate, I imagine.

Per Inside Higher Ed, “Only 0.13 percent of education articles published in the field’s top 100 journals are replications, write Matthew Makel, a gifted-education research specialist at Duke University, and Jonathan Plucker, a professor of educational psychology and cognitive science at Indiana University. In psychology, by contrast, 1.07 percent of studies in the field’s top 100 journals are replications, a 2012 study found.”

Education research also places a premium on novelty (which is what happens when you’re grasping at straws for an education intervention that produces long-lasting, positive effects):

With regard to specific replication failures in education research, growth mindset comes to mind (which I suppose is part of Psychology, too). Though initially hyped (TED talks and all), the most recent, largest replication studies have all flopped, demonstrating tiny effect sizes. Five days after posting a preprint of one such massive replication, the authors took the preprint down:

When pressed about the minuscule effect sizes, Dweck (the researcher who popularized mindset theory) was evasive:

Dweck is just a single bad-faith actor (*cough* financial incentives *cough*) in education/psychology research; she doesn’t represent the field as a whole. But the fact that growth mindset theory has been taken seriously for this long and still won’t die suggests that education research is hostile to dissent and has bad epistemic hygiene.

Other educational claims that have failed, are p-hacked to death, or have inflated/overhyped effect sizes (that I can recall off the top of my head):

  • Flipped classrooms, which exacerbate achievement gaps between the worst and best students
  • Teacher value-added studies, which have tiny effect sizes (viz., work done by Raj Chetty, who is an economist)
  • Early-life educational interventions to improve academic outcomes (viz., the work of Heckmann); these interventions seem to have a positive effect on non-academic outcomes, though. That said, the literature is rife with publication bias.

Criminology and Sociology

In criminology journals in particular, replication studies constitute just over 2 percent of the articles published between 2006 and 2010. Further, those replication studies that were published in criminology journals in that period tended to conflict with the original studies.

“Replication in criminology: A necessary practice”

I don’t know much about criminology, but apparently it does replications at twice the rate of psychology, which bodes well. However, like education research and psychology, criminology has extolled novel interventions that ended up not replicating with similarly large effect sizes.

In Sociology, on the other hand, a large proportion of researchers aren’t willing to release their data (see this informal 2015 study, N = 53), suggesting a lack of replication culture. Here’s what the sociologists Jeremy Freese and David Peterson have to say:

As sociologists, the most striking thing in reviewing recent developments in social science replication is how much all our neighbors seem to be talking and doing about improving replicability. Reading economists, it is hard not to connect their relatively strict replication culture with their sense of importance: shouldn’t a field that has the ear of policy-makers do work that is available for critical inspection by others? The potential for a gloomy circle ensues, in which sociology would be more concerned with replication and transparency if it was more influential, but unwillingness to keep current on these issues prevents it from being more influential. In any case, the integrative and interdisciplinary ambitions of many sociologists are obviously hindered by the field’s inertness on these issues despite the growing sense in nearby disciplines that they are vital to ensuring research integrity.

Unsurprisingly, like in most social sciences, we see evidence of publication bias:

Political Science

I don’t know much about political science, but it seems to have had its fair share of shoddy findings and data fabrication. That said, researchers have responded with calls for better epistemic hygiene practices and journals have adjusted accordingly. There’s even an active page with replications of prominent political science papers.

Management and Marketing

Management and Marketing (M&M) seem to be a mix of social psychology, industrial/organizational (IO) psych, and political science. For example, one of the most cited articles in a top M&M journal showcases some priming interaction effects, which is essentially something straight out psychology (that, we should note, hasn’t had the best replication record). So, we might imagine that the replication rate of M&M will be equal to that of psychology.

Unfortunately, there’s not a lot of data out there on M&M replication. It’s clear that the field doesn’t talk about replication much, though apparentlyManagement Review Quarterly (MRQ) [a top journal] publishes structured literature reviews, meta-analyses and, since 2018, replications.”

My impression is that Replication Markets participants are systematically underpredicting the replicability of research in Management and Marketing, probably because many of them are technical types allergic to business, marketing, and soft skills (even though most M&M papers aren’t specifically about those topics). This presents an interesting opportunity.

My Replication Priors

So, I’ll put my money where my mouth is (metaphorically, as there is no financial downside, just opportunity cost and being wrong) and provide my replication base rates:

Economics: 60 ± 5%

Psychology: 55 ± 5%

Marketing and Management: 50 ± 10%

Political Science: 45 ± 10%

Sociology and Criminology: 45% ± 10%

Education Research: 40 ± 10% (I hope my priors aren’t too influenced by my research on the topic)

Will Replication Rates Vary Significantly By Year?

One might intuit that as norms around scientific best practice change, so too will the quality of science. How could we confirm this before the replications are attempted?

A naïve approach would be to look at Google search term trends over time. For example, here are the trends for “psychology replication” and “open science” from 2004 to 2019:

Here are the number of publications mentioning both “psychology” and “replication”:

Likewise with economics, which has seen a similar 2x increase for publications mentioning both “economics” and “replication”:

The trends look similar for all other social science fields being replicated in SCORE. However, these graphs show total publications over time, which increase as the fields grow larger. If we compare these trends to the number of publications in the fields as a whole, we see that the number of publications with the term “replication” has kept pace with the field (that is, it hasn’t accelerated). 

This is what we’d expect. Yes, there have been some incredibly cool developments in open science over the past five years or so, like the Open Science Framework (OSF), which takes the friction out of preregistration, public sharing of data, methods, code, and more. There have also been a few notable replication projects. Many of these things are possible thanks to the Center for Open Science (COS), which is funded by Arnold Ventures, DARPA, and others. In 2016 the COS released new opensource preprint servers (e.g., PsyArXiv), which has coincided with increased publication of preprints (which have been available in the hard sciences for nearly 30 years, and in biology for around 5.)  All of these trends have conspired to make science more open, presumably limiting publication bias, p-hacking, the garden of forking paths, and other forms of academic malpractice and epistemic pollution.

All that said, talk is cheap. Though these trends indicate a growing awareness of replication and scientific best practices, it doesn’t guarantee that such trends have affected social scientists. My subjective impression from various academic Twitter-verses is that young scientists tend to be quite excited about open science, replication, etc., whereas older, more established scientists aren’t quite as excited.

The question is: do these young scientists publish in the high-prestige journals that Replication Markets will investigate? Probably so, but the majority of papers are likely published by better-established, older scientists. I think there will be a 10-year lag before we see the quality of science increase significantly.

So, no, I don’t believe replication rates will vary a lot over time. Yes, there will probably be a slight upward trend, maybe about 1-2% per year, but I imagine much more of the variation will be among fields.

Forecasting Specific Claims

How should we forecast specific claims?

If we wanted to be lazy (i.e., use heuristics), we’d take the outside view every time, meaning we’d bet the replication rate of study X would be equal to the replication rate of the field as a whole. If we have an inkling of epistemic humility, we might update this to reflect the belief of the prediction market as a whole, as prediction markets have been shown to be pretty good at predicting replication outcomes.

However, if we have subject-matter expertise, or are willing to do a bit of digging, we might take the inside view and update these priors accordingly.

Rules of Thumb

  1. Start with the replication rate for the field as a whole or, absent that, the replication rate for social science as a whole updated with our prior for the specific field.

  2. Update on any relevant knowledge you might have, but only slightly. This is a gross simplification, but: forecasting tournaments show that small, incremental updates of our priors result in the best forecasts.

  3. Be skeptical of certain methods (e.g., difference-in-differences designs in economics, as I discussed before) and statistical effects. For example, give less credence to interaction effects, as they require a much larger sample size to be properly estimated and aren’t as likely to hold up in replication:

Interaction effects are notorious for being much easier to publish than to replicate, partly because it is easy for researchers to forget (?) how they tested many dozens of possible interactions before finding one that is statistically significant and can be presented as though it was hypothesized by the researchers all along.

Various things ought to heighten suspicion that a statistically significant interaction effect has a strong likelihood of not being “real.” Results that imply a plot like the one above practically scream “THIS RESULT WILL NOT REPLICATE.” There are so many ways of dividing a sample into subgroups, and there are so many variables in a typical dataset that have low correlation with an outcome, that it is inevitable that there will be all kinds of little pockets for high correlation for some subgroup just by chance.

Jeremy Freese