Has the Credibility of the Social Sciences Been Credibly Destroyed? Reanalyzing the “Many Analysts, One Data Set” Project

Speaker: Katrin Auspurg

LMU Munich

This event is part of the Sociology Seminar Series, which will take place online throughout Michaelmas Term 2020.

NOTE: This seminar will take place on Thursday 

Abstract:  In 2018, R. Silberzahn and colleagues published a paper where 61 researchers (in 29 teams) separately analyzed the same research question with the same data set: Are soccer referees more likely to give red cards to players with dark skin than to players with light skin? The result was that the 29 teams came up with strongly diverging answers to the research question at hand. The main conclusion from this widely noted exercise was then that the social sciences are not rigorous enough: The answer to a social science research problem depends to a large extent on who does the research. The leading authors recommended to start similar crowdsource initiatives to validate findings by more balanced results and also to sensitize politicians for the high uncertainty built in social science research (Silberzahn/Uhlmann 2015).  We reanalyze the data of this crowdsource project and identify several reasons underlying the variability of the teams’ results. Already Silberzahn et al. (2018) hinted to the role of analytical flexibility: When analyzing social science data, we have to do many choices concerning data preparation, model selection, and model specification. We argue, however, that the most important reason was overlooked. Analytical choices have to be based on a clear definition of the parameter of interest. For instance, model specification crucially depends on the parameter of interest. Silberzahn et al. did not clearly define the parameter of interest. Consequently, each team had to come up with its own interpretation of the research question and therefore it is no wonder that results diverged so widely.  The literal interpretation of the verbal statement of the research question would be that the parameter of interest is the mean difference in red cards per soccer game between dark and light skinned players (gross-effect of skin-tone). Thus, a simple bivariate analysis would have answered the research question. However, some teams interpreted the verbal statement as question about discrimination. Then the parameter of interest is no longer the gross-effect but the net-effect of skin-tone. Thus, these teams estimated regression models where they tried to net out “productivity-related” mediators (such as position played, body weight, etc.). It is no wonder that these teams arrived at different conclusions than the teams that tried to estimate the gross-effect. Still other researchers aimed at exploratory or y-centered research, meaning they did not even try to partial out the effect of skin-tone on red cards.  To sum up: our argument is that the main reason why results diverged so widely is that the research question was not clear enough. We suppose that results would have been much closer to each other, if the research question would have been defined precisely. By adding this source of uncertainty (what is exactly the focus of research?) Silberzahn et al. under-estimated the reliability of social science research, and overlooked an important factor in the improvement of the consistency and credibility of findings.  Empirically, we illustrate our arguments by re-analyzing their data. First, we show that by using ‘mindless’ specification robustness algorithms, the main findings of the teams can simply be reproduced (with quite less research resources!). However, the mean effect identified by this method (that simply consists in trying different regression models and variable combinations) cannot be interpreted as a discrimination effect, because the data are weak on “productivity-related” measures (as was also argued by some original teams). Therefore, secondly, we go beyond the original analyses by using a more rigorous specification of the causal paths of interest. Including only covariates derived from theory, and applying generalized sensitivity analysis (GSA) we arrive at the conclusion that the data at hand are most consistent with the conclusion that there is no discrimination by skin-tone.  We end with some general lessons for social research: We need to specify precisely the research question and parameter of interest, ideally based on a theoretical model of the data generating process (perhaps visualized with a DAG). From this we should derive an identification strategy for the parameter of interest (e.g. by choosing an appropriate adjustment set for a regression model). Only by being more explicit and transparent in these crucial steps social research will become more rigorous.

Cited Reference: Silberzahn, R., Uhlmann, E.L. (2015): Many hands make tight work. Comment. Nature 526: 189-191.

The Sociology Seminar Series for Trinity Term is convened by Christopher Barrie, Fangqi Wen and Tobias Ruttenauer. For more information about this or any of the seminars in the series, please contact