Software is a crucial element of modern scientific research. However, all too often, software is neither formally published nor cited in the literature, making it difficult for researchers and developers — and the organizations that fund them — to quantify its impact. A newly released data set aims to fill that gap.
Developed by the Chan Zuckerberg Initiative (CZI), a scientific funder based in Redwood City, California, the CZ Software Mentions data set does not catalogue formal citations, but rather mentions of software in the text of scientific articles1. With 67 million mentions from nearly 20 million full-text research articles, the data set — announced on 28 September last year — is the largest-ever database of scientific-software mentions, says Dario Taraborelli, a science program officer at CZI.
“If you look at the key breakthroughs, not just in biomedicine, but in science in the last decade, they have consistently been computational in nature,” Taraborelli says: the prediction of protein folding, for example, and the depiction of black holes. “And scientific open-source software specifically has been at the core of these breakthroughs.”
Why science needs more research software engineers
CZI has pledged US$40 million over 3 years through its Essential Open Source Software for Science (EOSS) programme to support the programmers developing such software in the biosciences field. But the organization wants future funders to know where their money will have the greatest effect. “Studying mentions was the best possible venue for us to draw a map of where software has an impact,” says Taraborelli, “and making it available to the community will help amplify these efforts.”
To create the data set, Taraborelli’s team started with an artificial-intelligence language model called SciBERT. This is a neural network that has been trained on research papers to view text and fill in missing sections. The researchers further trained SciBERT to process text and decide whether a word or phrase was the name of a piece of scientific software. To do this, they presented it with an existing data set of about 5,000 scientific papers called SoftCite, in which every software mention had been manually labelled. The researchers then applied their refined model to a collection of about 20 million articles that CZI had obtained from the online repository PubMed Central and directly from publishers.
They then tried to work out which specific software tool each mention referred to. Ana-Maria Istrate, a research scientist at CZI, says this was one of the biggest challenges. A set of tools for data analysis called scikit-learn, for example, might appear in text as ‘Scikit learn’, ‘sklearn’, ‘scikit-learn81’ or with other phrasing. The researchers first applied a clustering algorithm to group software mentions by similarity, such that each cluster represented one piece of software. They then picked the most common term in each cluster and searched for it in online software repositories, such as GitHub, to map software names to online locations. Finally, researchers manually cleaned the data to remove phrases that did not actually refer to software.
When applied to a subset of 2.4 million papers, the team detected about 10 million mentions, corresponding to 97,600 unique pieces of software. People could use those data, for instance, to identify the most frequently mentioned tools by research field, to find software titles that appear together or to reveal the most popular pieces of software over time (see ‘Software rising’). These potential uses are documented in a computational notebook that accompanies the Software Mentions data set repository on GitHub. “We’re excited to note some of the software that ranked near the top are tools we fund through our EOSS programme,” Istrate says. These include titles such as Seurat, GSVA, IQ-TREE and Monocle.
Frank Krüger, a computer scientist at the Wismar University of Applied Sciences in Germany, who completed a similar project last year2, says the CZI team “did a great job establishing such a great resource covering software mentions”.
Michelle Barker, who lives in Australia and directs the Research Software Alliance, a nonprofit organization that brings together developers and funders of scientific software, calls the data set an important contribution. “We’re at this fantastic juncture where there’s recognition that research software is a critical part of modern research”, she says, but researchers need “to be able to analyse the data”. Documenting software mentions does more than help to direct funding appropriately, she adds; it also gives developers recognition and helps organizations to know whom to hire and promote.
It also helps developers to know how their work is being used, and shows researchers which specific tools were used to conduct published computational analyses, increasing their reproducibility.
New norms needed
Tools such as the CZ Software Mentions data set account for just one element in recognizing the work of developers. New norms are also needed, according to researchers. The Amsterdam Declaration on Funding Research Software Sustainability3, created by the Research Software Alliance last November, lists several key principles and recommendations, including that research software should be recognized as a research output and that organizations need to hire people to maintain it. (The same arguments have been made about data sets.)
Ex-Google chief’s venture aims to save neglected science software
And in November, Taraborelli and others published ‘Ten simple rules for funding scientific open source software’4, which advises funders to encourage diversity, promote transparent governance of software projects and support not only the creation of tools but also the maintenance of existing ones.
Ironically, the more a tool is used, the less often it tends to be specifically mentioned in papers. Taraborelli points to the ubiquity of Matplotlib and NumPy — popular libraries for numerical analysis and for plotting graphs in the Python programming language — the use of which often goes unstated. But on GitHub, hundreds of thousands of other software packages rely on these libraries. “If you counted software dependencies as citations, some of these projects would be the most impactful artefacts ever produced in science,” he says. “And yet, up until a couple of years ago, major funding agencies declined funding for these projects, stating that they lack sufficient impact.”
“Software, quite rightly, lives or dies depending on how much it’s used,” says Robert Lanfear, a biologist at the Australian National University in Canberra and co-developer of the IQ-TREE software. “Additional measures of usage are always welcome. They can only help us better understand how, and how much, each software package is used.”