“Reproducibility” is the issue of whether reported scientific results are valid in the sense that they are not an artifact of a specific experimental environment. Following Pineau et al. (2020), it should be ensured that using the same data and the same analytical tools will yield the same results as reported. This clearly requires an exact specification of the used data and tools, so that the reader is enabled to reproduce these results if desired. Ensuring reproducibility is a good practice, and must be promoted.
Clearly, reproducibility may take various forms. Depending on the nature of the results, the resources needed to recover the results may differ (proofs, algorithms, data, etc.). Such resources can be given in the paper itself (e.g., proof sketches, pseudo-code of algorithms and data generators, etc.) or via supplementary material (e.g., by links to data and the implementations of algorithms).
However, it must be kept in mind that full reproducibility cannot always be ensured, whatever the efforts of the authors. For example, when a paper reports the run-times of a program on a specific hardware, the hardware is a part of the material that would be needed for an exact reproduction of the results, which in general cannot be shared. Also the implementation can be so large that checking (or even running) the source code would require a tremendous effort, which is often inadequate and for sure unfeasible in a limited reviewing period. Similarly, authors may have privacy concerns or may worry about their intellectual copyright when sharing code before the publication of their results. Data used in some experiments can be proprietary or hard to anonymize; implementations may also be proprietary and not open sourceable, etc.
In such cases, it must, however, be still expected that the results are reproducible in principle. This means, e.g., that even though an implementation is not readily available, the algorithm should be described so that it can be re-implemented. While we do and expect that all results be reproducible in principle, in this sense, we can nevertheless discriminate works according to the effort it would take to reproduce the results. While we encourage authors to facilitate reproducibility by sharing code and data as much as possible, this is not a strict requirement for a successful submission, if the reviewers are convinced that the results can be reproduced in principle.
At IJCAI, reproducibility will thus form an important criterion (among others) according to which submitted papers will be evaluated. Reviewers will have to answer the following question, using the following scores:
Are the results (e.g., theorems, experimental results) in the paper easily reproducible?
- CONVINCING : I am convinced that the obtained results can be reproduced, possibly with some effort. Key resources (e.g., proofs, code, data) are already available, will be made available upon acceptance, or good reasons as to why they are not (e.g., proprietary data or code) are reported in the paper. Key details (e.g., proofs, experimental setup) are sufficiently well described but their exact recovery may require some work.
- CREDIBLE: I believe that the obtained results can, in principle, be reproduced. Even though key resources (e.g., proofs, code, data) are unavailable at this point, the key details (e.g., proof sketches, experimental setup) are sufficiently well described for an expert to confidently reproduce the main results, if given access to the missing resources.
- IRREPRODUCIBLE: It is not possible to reproduce the results from the description given in the paper. Some key details (e.g., proof sketches, experimental setup) are incomplete/unclear, or some key resources (e.g., proofs, code, data) are not furnished, without any further explanation or justification.
The following checklist can be used by authors and reviewers to figure out the extent to which the results presented in a paper can be considered as reproducible. We do not expect authors to submit this form, but they can feel free to add a section entitled “Reproducibility” to the paper, in case they want to clarify some issues to the reviewers (and readers), or to upload supplementary material for clarification (note, however, that reviewers are not required to read supplementary material).
If the paper introduces a new algorithm, it must include a conceptual outline and/or pseudocode of the algorithm for the paper to be classified as CONVINCING or CREDIBLE.
If the paper makes a theoretical contribution:
- All assumptions and restrictions are stated clearly and formally
- All novel claims are stated formally (e.g., in theorem statements)
- Appropriate citations to theoretical tools used are given
- Proof sketches or intuitions are given for complex and/or novel results
- Proofs of all novel claims are included
For a paper to be classified as CREDIBLE or better, we expect that at least 1. and 2. can be answered affirmatively, for CONVINCING, all 5 should be answered with YES.
If the paper relies on one or more data sets:
- All novel datasets introduced in this paper are included in a data appendix
- All novel datasets introduced in this paper will be made publicly available upon publication of the paper
- All datasets drawn from the existing literature (potentially including authors’ own previously published work) are accompanied by appropriate citations
- All datasets drawn from the existing literature (potentially including authors’ own previously published work) are publicly available
- All datasets that are not publicly available (especially proprietary datasets) are described in detail
Papers can be qualified as CREDIBLE if at least 3., 4,. and 5,. can be answered affirmatively, CONVINCING if all points can be answered with YES.
If the paper includes computational experiments:
- All code required for conducting experiments is included in a code appendix
- All code required for conducting experiments will be made publicly available upon publication of the paper
- Some code required for conducting experiments cannot be made available because of reasons reported in the paper or the appendix
- This paper states the number and range of values tried per (hyper-)parameter during development of the paper, along with the criterion used for selecting the final parameter setting
- This paper lists all final (hyper-)parameters used for each model/algorithm in the experiments reported in the paper
- In the case of run-time critical experiments, the paper clearly describes the computing infrastructure in which they have been obtained
For CREDIBLE reproducibility, we expect that sufficient details about the experimental setup are provided, so that the experiments can be repeated provided algorithm and data availability (3., 5., 6.), for CONVINCING reproducibility, we also expect that not only the final results but also the experimental environment in which these results have been obtained is accessible (1., 2., 4.).
Evaluation of Reproducibility
As for the evaluation, we do not necessarily expect a lower reproducibility score to be used as an argument to reject an otherwise OK paper that provides a very interesting idea, or, conversely, that a high reproducibility score should increase the chances of an otherwise not very interesting paper. However, authors should keep in mind that results backed with much evidence are likely to be considered by reviewers as more convincing than results that are not justified at all. To sum up, on the one hand, reviewers cannot use the lack of code, data or full proofs as a key reason for rejecting a paper; on the other hand, authors cannot complain about reviewers viewing their paper(s) as not very convincing if they do not provide enough evidence to reproduce their results in principle.. In any case, it should not be considered as mandatory for reviewers to look at the supplementary material.
These guidelines are based on the following resources:
(1) Reproducibility Checklist in AAAI 2022, IJCAI 2021, NeurIPS 2020;
(2) Pineau et al. Improving Reproducibility in Machine Learning Research. arXiv:2003.12206.