Repro: An Open-Source Library for Improving the Reproducibility and Usability of Publicly Available Research Code

Daniel Deutsch and Dan Roth
Department of Computer and Information Science
University of Pennsylvania
{ddeutsch,danroth}@seas.upenn.edu

Abstract

We introduce Repro, an open-source library which aims at improving the reproducibility and usability of research code. The library provides a lightweight Python API for running software released by researchers within Docker containers which contain the exact required runtime configuration and dependencies for the code. Because the environment setup for each package is handled by Docker, users do not have to do any configuration themselves. Once Repro is installed, users can run the code for the 30+ papers currently supported by the library. We hope researchers see the value provided to others by including their research code in Repro and consider adding support for their own research code.¹¹1 https://github.com/danieldeutsch/repro

1 Introduction

Running the code released by the original authors of a research paper can be difficult. There are often challenges in replicating the required runtime environment due to installing the correct versions of external libraries or putting resource files in the correct locations. Further, the software packages may not have an easy-to-use API, making it hard to figure out how to use the released code.

In this work, we describe Repro, an open-source library which aims at improving the reproducibility and usability of publicly available research code. Repro is a library of lightweight Python wrappers around code released by the authors of papers which provide a simple interface for running the original code without users needing to maintain or setup the necessary runtime environments themselves.

Repro achieves this using Docker, a platform for packaging together software applications along with all of the necessary dependencies. Each of the papers supported in Repro has a corresponding Docker image that contains the exact runtime environment and dependencies for the code. Then, Repro exposes a simple Python API for passing data to the Docker containers, processing the data with the original code, and returning the output to the user.

Since each codebase’s dependencies are fully contained within the Docker images, users of Repro do not need to put any effort into setting up the correct environment to run code from a paper. Once Repro is installed, users can easily run any of the code from the 30+ papers supported by the library.

This paper describes why Docker is an ideal platform for releasing reproducible code (§2), the details about how Repro is implemented (§3), a discussion of the best practices for ensuring reproducible code we have learned from developing the library (§4), and some of its limitations (§5). We hope that users see the value Repro provides and consider contributing Docker images and Python wrappers for their own papers’ code.

2 Docker as a Tool for Reproducibility

2.1 Background

Repro is built on top of Docker.²²2https://www.docker.com/ Docker is a tool for packaging together software applications into isolated, standalone environments that contain all of the dependencies necessary to run the applications. The environments, called Docker images, can specify which operating system is used, which version of software libraries are installed, and can include data files.

Docker images are built using Dockerfiles. Dockerfiles are text files with a specific syntax that contain a series of commands which are executed in sequence to build the image. They begin with a base image, such as a specific version of Ubuntu, followed by commands that can install software libraries, copy local data files into the image, etc.

Once an image is built, it can be run as a Docker container, which is similar to a virtual machine. The container allows the user to run software in or interact with the environment specified by the image from their own host machine. However, any modifications made within the container do not persist when the container terminates; Every time a container is created, it starts with a fresh environment defined by the image.

Importantly, Docker images and containers are platform independent. If two different machines run the same container, the containers’ environments will be identical up to differences in the machines’ hardware.

Docker images can be easily distributed for others to use through an image registry server such as DockerHub.³³3https://hub.docker.com/ Users can upload their image binaries to DockerHub that others can then download and run, analogous to how GitHub is used to distribute software code. Thus, once a developer creates a Docker image, it is easy for others to replicate that exact runtime environment on their own machines.

2.2 The Advantages of Docker for Reproducibility

It is often challenging and time consuming to reproduce results from research papers. Papers that publicly release code often include links to download pre-trained models or resources and written instructions for replicating the necessary runtime environment for their software. However, it is not uncommon for the environments to be under-specified or not specified at all, for paths to resource files to be hard-coded based to locations on the author’s machine, for the link to the pre-trained model or required resources to be broken, etc. These problems can be exacerbated over time when information about the original environment configuration is forgotten or deleted entirely, making the code difficult to run.

Docker offers a solution to many of the common problems researchers encounter when they try to reproduce a result from a paper using resources released by the authors. In addition to releasing code and dependencies, authors could also release Docker images along with their papers that is the exact environment necessary to run their code. Other researchers would then be able to run the code without having to worry about exact library versions or the locations of pre-trained models since these details would be taken care of by the Docker image. Then, if the images were stored on a public Docker image registry, the environment necessary to run the code associated with the paper would exist indefinitely.

Thus, the advantages of Docker make it an ideal platform for building a library focused on the reproducibility of research code.

3 Repro

Repro is a Python-based library that aims at improving the reproducibility and usability of research code released along with papers. It provides easy-to-use Python APIs for running the original code released by the authors, which can be done from a single, lightweight Python environment. Once Repro is installed, users can run the code from any paper supported by Repro without any additional setup effort.

Improving Reproducibility

Every codebase supported by Repro is packaged into its own Docker image, which contains the original source code, runtime environment, and necessary dependencies. Repro provides a lightweight Python wrapper around the Docker image which facilitates launching a Docker container, transferring data to the container, running the original code in the original environment, and returning the results to the user. Because the paper code is run within a Docker container, that runtime environment will be the same for all users of the library. Thus, the environment configuration which reproduces results from the original paper can be replicated for all Repro users, improving the reproducibility of the original research.

Improving Usability

Because the papers’ code runs in Docker images, users of Repro do not need to maintain the runtime environments or dependencies themselves since they are encapsulated within the images. For instance, they do not have to create a Python environment specific to a codebase, install the software dependencies, or download any resource files, such as pre-trained models, to specific locations. All of this is taken care of by the Docker images, and thus Repro makes running the original research code far easier than before.

The only environment the users need to maintain is the one which Repro is installed in. However, since Repro’s wrappers around the Docker images do not have difficult-to-install dependencies, Repro’s required Python environment from which all of the supported papers’ code can be run is very lightweight.

Figure 1 contains an example of how a user can easily run BART (Lewis et al., 2020) to generate a summary of an input document. The BART model corresponds to a Python class that provides a function to run inference and return the summary. Behind the scenes, Repro launches the Docker container which contains the original code and models released by the BART authors, passes the input to the container, runs the original code in the container to produce the summary, and returns the result to the user in the original Python process. All of this processing is hidden from the user, so BART’s Python API looks like any other standard Python function.

The API for each paper depends on what the code does. For instance, reference-based text generation metrics will accept a text to score and some references, question-answering models will take some input text and a question, etc. We have tried to standardize the inputs and outputs formats for the same types of models so it will be easier to quickly run multiple papers’ code on the same inputs.

⬇

from repro.models.lewis2020 import BART

model = BART(device=0)

summary: str = model.predict(

"Serena Williams is an American "

"professional tennis player …"

)

Figure 1: An example code snippet for generating a summary with BART (Lewis et al., 2020). model.predict() launches the Docker container that contains the runtime environment for BART, uses a pre-trained BART model to generate a summary of the input text, and returns the result as a string type to summary.

Installation & Running

The library itself is lightweight and has minimal dependencies. Our goal is to make it as easy-as-possible to install, which can be done by Python’s pip package manager. Running Repro requires a host machine with Docker installed.

Communication with Docker

Exchanging data between the host machine’s Python process and the Docker container is done via the machine’s file system. First, the Python process serializes the data which needs to be processed to a directory on the host machine. Then, the process launches the Docker container and mounts that directory to the container, which gives the container the ability to read and write to the file system of the host machine. The Python process executes a command within the container to process the data, and the container serializes the result to the same mounted directory and terminates. The process then loads the results and returns them to the user.

While this communication with Docker may sound complex, it is entirely hidden from users of Repro. The Python API uses normal Python types as inputs and outputs even though the data processing is largely done in Docker. As such, it looks the same as a standard Python function. Therefore, users do not need to know how to use Docker in order to use the library.

Distributing Docker Images

All of the Docker images supported by Repro are hosted in DockerHub and have corresponding Dockerfiles in Repro. When a new Dockerfile is added to Repro, a GitHub Action⁴⁴4https://github.com/features/actions is triggered which builds an image from the Dockerfile and pushes it to DockerHub.

If a user attempts to run code in a Docker container for which the corresponding image is not present on their machine, Repro will automatically download that image for them. The images can be manually downloaded from DockerHub as well.

Papers Implemented in Repro

As of this writing, there is code from 30+ papers supported by Repro. The majority of them are related to evaluating generated text based on our own research interests, but there are also models for text summarization, question-answering, question generation, constituency parsing, and more. See Appendix A for the full list.

Once a user installs Repro on a machine with Docker, all 30+ of the codebases can be run without any additional setup. We are continually supporting more papers and welcome contributions by the research community.

Contributing

We hope that the research community sees the benefits of Repro and contributes their own Docker images to the library for others to use. Because many people within the NLP community may not have experience using Docker, our GitHub repository contains tutorials that explain how to install Docker, list basic Docker concepts and useful commands, and provide instructions for packaging a codebase into a Docker image. We additionally explain how to write the Repro wrapper around the Docker image.

4 Reproducibility Best Practices

During the process of building Docker containers for various libraries, we have identified several best practices that future researchers can use to improve the ease-of-use of faithfully running their code as intended.

First, example inputs and expected outputs should be included in the code’s documentation to help to ensure that reproductions are faithful to the original implementation.

The exact programming language environment should be specified in the documentation. For example, this could include the version of Python as well as the versions of the Python packages that were used in the original environment. In the case of Python, package managers such as pip and conda provide tools for exporting a list of installed packages to a file for distributing to others.

External resources, such as data files or pre-trained models, should be stored in locations which are not likely to be moved or deleted. Anecdotally, we have found that files stored in locations owned by organizations (e.g., universities or companies) are more stable than those stored in individuals’ personal storage platforms (e.g., Google Drive).

Authors should provide a command line interface for running their code end-to-end that accepts a file(s) as input and writes a file as output instead of a series of scripts which need to be run in series to process the data. This makes it easier for other authors to run the code on their own data as well as integrate it into the Repro library because it makes the code easier to use.

5 Hardware Limitations

Although Repro significantly improves the ease of reproducing the correct runtime environment and the usability of research code, the library has some limitations surrounding hardware compatibility.

Docker containers are only identical across machines up to differences in hardware. As new hardware is released, it may not be compatible with older software, which could potentially make it difficult to run older Docker images. For example, a new GPU may require software to use a minimum version of CUDA, and a model may use a specific version of PyTorch (Paszke et al., 2019) that is incompatible with any version of CUDA that can be used with the GPU. Thus, the model will not run. However, this issue can be mitigated by the authors updating their code and Docker images to be compatible with current hardware.

6 Related Work

There are various other software libraries which aim at making running a variety of research code easier. The SacreBLEU (Post, 2018), SacreROUGE (Deutsch and Roth, 2020), Huggingface Datasets (Lhoest et al., 2021), and the GEM metrics⁵⁵5https://github.com/GEM-benchmark/GEM-metrics libraries provide wrappers around or implementations of various text generation evaluation metrics in order to establish standardized implementations and make the metrics easier to run. Libraries such as AllenNLP (Gardner et al., 2018) or Transformers (Wolf et al., 2020) provide frameworks for which researchers can train deep learning models. Once someone is familiar with one of these frameworks, running a trained model which uses them is relatively straightforward since models within these frameworks typically use similar APIs.

They key difference between these approaches and Repro is how the environments for the models or metrics are maintained. These libraries attempt to either have one complex runtime environment for all models/metrics they support or have a new environment for each one. Repro instead has one lightweight Python environment for the codebase wrappers and uses Docker to maintain the paper-specific runtime environments, making the research code included in Repro very easy to use since the environments do not need to be maintained by the library users.

7 Conclusion

We have introduced Repro, a library built on Docker that aims at improving the reproducibility and usability of research code. We hope that other researchers see how the library makes running their code more accessible to others and consider contributing their own Docker images.

References

Chen et al. (2020) Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2020. MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6521–6532, Online. Association for Computational Linguistics.
Colombo et al. (2021a) Pierre Colombo, Chloe Clave, and Pablo Piantanida. 2021a. InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation. ArXiv, abs/2112.01589.
Colombo et al. (2021b) Pierre Colombo, Guillaume Staerman, Chloé Clavel, and Pablo Piantanida. 2021b. Automatic Text Evaluation through the Lens of Wasserstein Barycenters. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10450–10466, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA. Association for Computational Linguistics.
Deutsch et al. (2021) Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. 2021. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. Transactions of the Association for Computational Linguistics, 9:774–789.
Deutsch and Roth (2020) Daniel Deutsch and Dan Roth. 2020. SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 120–125, Online. Association for Computational Linguistics.
Dou et al. (2021) Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, and Graham Neubig. 2021. GSum: A General Framework for Guided Neural Abstractive Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4830–4842, Online. Association for Computational Linguistics.
Dugan et al. (2020) Liam Dugan, Daphne Ippolito, Arun Kirubarajan, and Chris Callison-Burch. 2020. RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 189–196, Online. Association for Computational Linguistics.
Durmus et al. (2020) Esin Durmus, He He, and Mona Diab. 2020. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
FitzGerald et al. (2018) Nicholas FitzGerald, Julian Michael, Luheng He, and Luke Zettlemoyer. 2018. Large-Scale QA-SRL Parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2051–2060, Melbourne, Australia. Association for Computational Linguistics.
Gao et al. (2020) Yang Gao, Wei Zhao, and Steffen Eger. 2020. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1347–1354, Online. Association for Computational Linguistics.
Gardner et al. (2018) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1–6, Melbourne, Australia. Association for Computational Linguistics.
Goyal and Durrett (2020) Tanya Goyal and Greg Durrett. 2020. Evaluating Factuality in Generation with Dependency-level Entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3592–3603, Online. Association for Computational Linguistics.
Gupta et al. (2020) Nitish Gupta, Kevin Lin, Dan Roth, Sameer Singh, and Matt Gardner. 2020. Neural Module Networks for Reasoning over Text. In International Conference on Learning Representations.
Hessel et al. (2021) Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Kane et al. (2020) Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh, and Mohamed Coulibali. 2020. NUBIA: NeUral Based Interchangeability Assessor for Text Generation. In Proceedings of the 1st Workshop on Evaluating NLG Evaluation, pages 28–37, Online (Dublin, Ireland). Association for Computational Linguistics.
Kitaev et al. (2019) Nikita Kitaev, Steven Cao, and Dan Klein. 2019. Multilingual Constituency Parsing with Self-Attention and Pre-Training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3499–3505, Florence, Italy. Association for Computational Linguistics.
Kryscinski et al. (2020) Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Lhoest et al. (2021) Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Br, Simon eis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lys Debut, re, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alex Rush, er, and Thomas Wolf. 2021. Datasets: A Community Library for Natural Language Processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740, Hong Kong, China. Association for Computational Linguistics.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
Post (2018) Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
Pyatkin et al. (2021) Valentina Pyatkin, Paul Roit, Julian Michael, Yoav Goldberg, Reut Tsarfaty, and Ido Dagan. 2021. Asking It All: Generating Contextualized Questions for any Semantic Role. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1429–1441, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
Scialom et al. (2021) Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. 2021. QuestEval: Summarization Asks for Fact-based Evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Scialom et al. (2019) Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2019. Answers Unite! Unsupervised Metrics for Reinforced Summarization Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3246–3256, Hong Kong, China. Association for Computational Linguistics.
Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
Staerman et al. (2021) Guillaume Staerman, Pavlo Mozharovskyi, Pierre Colombo, Stéphan Clémençon, and Florence d’Alché Buc. 2021. A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions.
Susanto et al. (2016) Raymond Hendy Susanto, Hai Leong Chieu, and Wei Lu. 2016. Learning to Capitalize with Character-Level Recurrent Neural Networks: An Empirical Study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2090–2095, Austin, Texas. Association for Computational Linguistics.
Thompson and Post (2020) Brian Thompson and Matt Post. 2020. Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 90–121, Online. Association for Computational Linguistics.
Vasilyev et al. (2020) Oleg Vasilyev, Vedant Dharnidharka, and John Bohannon. 2020. Fill in the BLANC: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 11–20, Online. Association for Computational Linguistics.
Wolf et al. (2020) Thomas Wolf, Lys Debut, re, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alex Rush, and er. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Yuan et al. (2021) Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. BARTScore: Evaluating Generated Text as Text Generation. ArXiv, abs/2106.11520.
Zhang and Bansal (2021) Shiyue Zhang and Mohit Bansal. 2021. Finding a Balanced Degree of Automation for Summary Evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6617–6632, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Zhang et al. (2020) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.

Appendix A List of Supported Papers

The following is a list of papers with publicly available code that have implementations in Repro:

•

BART (Lewis et al., 2020)
•

BARTScore (Yuan et al., 2021)
•

BERTScore (Zhang et al., 2020)
•

BaryScore (Colombo et al., 2021b)
•

BertSumExtAbs (Liu and Lapata, 2019)
•

BLANC (Vasilyev et al., 2020)
•

BLEU (Papineni et al., 2002)
•

BLEURT (Sellam et al., 2020)
•

CLIPScore (Hessel et al., 2021)
•

COMET (Rei et al., 2020)
•

DAE (Goyal and Durrett, 2020)
•

DepthScore (Staerman et al., 2021)
•

FactCC (Kryscinski et al., 2020)
•

FEQA (Durmus et al., 2020)
•

GSum (Dou et al., 2021)
•

InfoLM (Colombo et al., 2021a)
•

LERC (Chen et al., 2020)
•

Lite3Pyramid (Zhang and Bansal, 2021)
•

Meteor (Denkowski and Lavie, 2014)
•

MoverScore (Zhao et al., 2019)
•

NMN-Drop (Gupta et al., 2020)
•

NUBIA (Kane et al., 2020)
•

Prism (Thompson and Post, 2020)
•

QAEval (Deutsch et al., 2021)
•

QuestEval (Scialom et al., 2021)
•

ROUGE (Lin, 2004)
•

SummaQA (Scialom et al., 2019)
•

SUPERT (Gao et al., 2020)
•

the question generation model from Pyatkin et al. (2021)
•

the constituency parser from Kitaev et al. (2019)
•

the recipe generator from Dugan et al. (2020)
•

the truecaser from Susanto et al. (2016)
•

the QA-SRL parser from FitzGerald et al. (2018)