AutoML | Introducing Reproducibility Reviews

By Frank Hutter, Isabelle Guyon, Marius Lindauer and Mihaela van der Schaar (general and program chairs of AutoML-Conf 2022)

Did you ever try to reproduce a paper from a top ML conference and failed to do so? You’re not alone! At AutoML-Conf (see automl.cc), we’re aiming for a higher standard: with the papers we publish you shouldn’t have this problem!

Why is this important?
The reproducibility of research papers is key to the sustained progress of a field. Yet, there are many accounts of poor reproducibility in machine learning [Haibe-Kains et al, 2020], reinforcement learning [Henderson et al. 2018, Agarwal et al. 2021] and also AutoML [Yan et al. 2019, Lindauer & Hutter. 2020]. We believe that at AutoML-Conf we can fix this, and if we’re successful then other top ML conferences may follow.

What is the status quo?
Let’s say Alice wants to reproduce the results of Bob’s paper. If Bob does not make code available, Alice has to implement it from scratch. This often takes substantial time (often weeks) and rarely yields the same results as in the paper. The result? Much reduced impact for Bob, wasted effort for Alice, a slow-down in progress in the community. Not good.
Fortunately, in some cases Bob does make code available by now. However, Alice may still have to fight with it, find it to be incomplete, not runnable, or to not yield the original results. She emails Bob, who initially replies with helpful comments but at some point says he doesn’t have time to help more because he is busy with his next two publications. Again, the same failure as above. This happens far too often.

What can conferences do to improve this?
NeurIPS took a great first step by introducing a reproducibility checklist in 2019. However, NeurIPS neither mandates a code release nor makes the checklist available after the review process, reducing transparency. Next to this “official” checklist, individual reviewers, who have been sensitized to the topic of reproducibility, sometimes ask about code during the rebuttal process. Such requests during the rebuttal are almost always successful. However, the fact that they only occur for a small fraction of papers increases the randomness in reviewing.

Can we do better?
Yes! The problem with the status quo is that the incentive system is broken. It is substantial work to ensure the reproducibility of results, and while there are many incentives for publishing the next paper (e.g., graduation, tenure reviews, hiring criteria, performance reviews at companies, etc), the incentives for ensuring reproducibility aren’t comparable. We thus need to turn this incentive system around: authors should have to do the work for ensuring the reproducibility of their results *in order to get their paper published*. This incentive already works when individual reviewers ask about code, and at AutoML-Conf we’ll consistently integrate such discussions about reproducibility into the review process.

How will we achieve this?
To make the papers we publish at AutoML-Conf highly reproducible, we decided to invite dedicated reproducibility reviewers. These reviewers will be asked to check the authors’ answers to the questions of the reproducibility checklist (see the authors’ instructions included in the Latex template), and to verify them. For example, a reproducibility reviewer could check whether it is easy to install the software and run the experiments as documented (potentially using intermediate results / checkpoints for compute-intensive steps), and provide feedback on how to improve reproducibility further. Authors have the chance to act on this feedback to improve their work’s reproducibility as part of the rebuttal.

What counts as “reproducible”?
At this point, we only aim for a limited notion of reproducibility also known as “replicability”: when a reviewer repeats the authors’ steps, can she obtain the same results? Are the results exactly the same given the same seeds? Are results similar across seeds, e.g., with overlapping confidence bounds? Broader notions of reproducibility, such as qualitatively similar results on related datasets, etc., would be great to consider in the future.

Won’t this cause a lot of additional work for authors?
It does indeed cost time to make your paper reproducible, but it is dramatically more efficient for the authors to do this than for anyone else. Recalling the example above, it would be quite easy for Bob to document the steps he followed right when submitting the paper. It takes more time to do this when Alice asks about the code a few months later. And it takes yet more time (or is impossible) for Alice and other researchers to figure out Bob’s code by themselves. As a silver lining for Bob, papers with properly released code also have a much higher impact on the community than those without code, and when Bob wants to revisit his ideas a year later himself, he also benefits from having left everything in a clean state.

Can authors get extra time for the code release?
Yes; the supplemental material is only due a week after the main paper. Also, authors are free to continue updating their anonymous repository during the rebuttal.

Do authors have to clean up their code?
We see two possible ways for code releases:

The code dump. A quick yet effective way to boost reproducibility: make code & scripts available, with requirements, provide a README detailing how to reproduce results, etc, but don’t spend (“waste”) time to clean it up.
The open-source software package. Authors that want to build a community around their code may choose to invest more time into the code they release.

Option 2 tends to achieve greater lasting impact, and we’re thrilled to see such work, but for many papers Option 1 is more efficient; it is perfectly fine in terms of reproducibility.

Who are the reproducibility reviewers?
Like other reviewers, these will be volunteers from the community. We expect reproducibility reviewers to be more junior on average than standard reviewers; in our eyes, if it requires a PhD to reproduce a paper’s results then the authors didn’t do their job fully.

How do I sign up as a reproducibility reviewer?
Glad you ask! This only works if we have enough volunteers, as we’re aiming to have one reproducibility reviewer per submission. Please sign up here: https://forms.gle/mxki3gaSN7jZZykH9

This process is very much an experiment, but we hope that it will work and contribute to improving the reproducibility in our field, facilitating its sustained progress 🙂