Popularity and Innovation in Maven Central

Nkiru Ede, Jens Dietrich, and Ulrich Zülicke {nkiru.ede,jens.dietrich,uli.zuelicke}@vuw.ac.nz
Victoria University of Wellington
Wellington, New Zealand

Abstract

Maven Central is a large popular repository of Java components that has evolved over the last 20 years. The distribution of dependencies indicates that the repository is dominated by a relatively small number of components other components depend on. The question is whether those elites are static, or change over time, and how this relates to innovation in the Maven ecosystem. We study those questions using several metrics. We find that elites are dynamic, and that the rate of innovation is slowing as the repository ages but remains healthy.

I Introduction

Software re-use has been revolutionized by the emergence of software ecosystems (SECOs) [26]. SECOs provide an infrastructure to rapidly release, discover and use software components. They are usually linked to build tools that include dependency managers. Those tools can resolve symbolic references to components in a SECO to physical references by downloading and managing the respective components, often adding additional functionality like security checks and conflict resolution. The availability of standard formats used to declare dependencies allows the study of SECOs as networks, where components are modelled as vertices and dependencies between components as edges [37, 27, 43, 7].

One such ecosystem is Maven. Maven repositories are used to distribute binaries in the Java byte code format, produced by projects using Java, Kotlin and other languages that can be compiled to Java byte code. There are numerous build tools and package managers that can be used to interact (publish, query, etc.) with Maven repositories, maven and gradle being the most popular ones. The main Maven repository used by open-source developers to publish is Maven Central.¹¹1https://central.sonatype.com/

The study of dependency networks is well-established [10, 17], and several datasets have been published to facilitate this over the years [33, 2, 19]. Goblin [19] is the most recent dataset, it is based on Maven Central and made available as a neo4j graph database [20]. For our present study, we are using the 30-08-2024 version.²²2goblin_maven_30_08_2024.dump

It is well-known that SECOs exhibit strong growth over time, and this can be observed in the respective networks [10, 40, 21, 9]. There are several networks that can be considered here. Firstly, artifacts in Maven are identified by a combination of group id (G), artifact id (A) and version (V). In the context of network analysis, we therefore refer to the versioned components as GAVs. Such GAVs refer to other GAVs through the dependencies they declare. In principle, dependencies can refer to sets of GAVs through the use of version ranges, a feature intended to support semantic versioning [32]. However, this feature is rarely used in Maven [11], and it is difficult to accurately model it as the resolution of dependency ranges hinges on the semantics of a particular build tool and the state of the repository at the time a component was built. We therefore decided to ignore dependencies declared to such ranges, and eliminated them from the dataset during dataset cleaning. We also removed GAVs for which we were unable to identify the release date. Of the 14,459,139 vertices and 119,660,406 edges in the dataset, 813,343 vertices (5.26%) and 5,104,196 edges (4.26%) were ignored due to these issues.

It is of interest to also consider the aggregation of GAVs into unversioned components identified only by group and artifact ids (GAs). Such GAs correspond to components, while GAVs correspond to releases. A GA $ga_{1}$ depends on some other GA $ga_{2}$ (i.e., there is a directed edge between $ga_{1}$ and $ga_{2}$ in the GA graph), if some version $gav_{1}$ of $ga_{1}$ depends on some version $gav_{2}$ of $ga_{2}$ . We consider a GA to be released in a given year if any of its versions (GAVs) is released in this year. Figure 1 summarizes the growth of the GAV and GA networks over time.

Organizations often have to make choices about which programming languages to use for developing new products. One of the deciding factors for that can be the availability of a healthy ecosystem of open-source-software libraries. Desirable attributes for such ecosystems include maturity, stability, and positive evolution via continued maintenance and innovation [14, 12]. Our study aims to provide insight into such fundamental characteristics of SECOs. In particular, we focus on ways to observe and quantify innovation as well as the dynamism of popular artifacts.

Innovation is widely recognized as a key driver of growth [24, 28, 35]. The Technical Committee of the International Organization for Standardization (ISO/TC 279) defines innovation as ”a new or modified entity that creates or redistributes value.” [16]. We use release types and frequencies to assess both maintenance (activity) and innovation. To gain a better understanding of the behavior of the key artifacts (elites) responsible for such innovation, we examine whether these elite artifacts remain static over the years as the ecosystem grows. In any given year, we define elite artifacts as those with more dependencies (usages) than their peers.

We use junit³³3https://central.sonatype.com/artifact/junit/junit/versionsand three other test frameworks—TestNG⁴⁴4https://central.sonatype.com/artifact/org.testng/testng/versions, Mockito, and Spock⁵⁵5https://central.sonatype.com/artifact/org.spockframework/spock-core/versions—as a case study. Adding tests is a widely used practice in open source development, and almost all projects have to choose testing libraries for this purpose. Among these, junit has been a fundamental component of software testing within the Maven ecosystem for many years, consistently ranking among the most influential Java projects.

Refer to caption — Figure 1: Time evolution of vertex and edge counts for the GAV and GA networks derived from Maven Central.

In this paper, we study the observed growth of Maven Central in more detail. We start our investigation in Section II by considering whether certain components dominate the ecosystem, finding that this is generally the case. We then check whether the set of those elite components is static or changes over time (Section III). The observations from this section motivate us to have a closer look at the innovation dynamics of the Maven ecosystem, as explained in Section IV. A brief discussion of threats to validity (Section VI), related work (Section V), and a conclusion (Section VIII) wrap up our contribution.

II Popularity distribution

A well-known effect in dynamic networks is the uneven distribution of vertex degrees. This often reflects the uneven spread of resources modelled by the network, such as wealth in a population, links on the World Wide Web, citations to scientific articles or patents etc. [1, 3, 39].

For Maven, some components are extremely popular, such as junit, guava, and various components from the Apache commons family of components. On the other hand, there are also many components that are not used by any other component as a dependency.

To characterize the popularity distribution and concentration of resources, we use the Gini index [13]. Several studies have employed this quantity to measure resource distributions in software systems, including [10, 18, 8, 41]. The basic idea is to model incoming dependencies as wealth. If few components attract most dependencies, this suggests inequality and would result in a Gini value close to $1$ .

In principle, such an analysis can be done on the level of either the GAs or the GAVs. We find it more meaningful to consider the GA network, as popular components tend to release new versions more frequently as they are more actively maintained, i.e., when versions (GAVs) are considered, incoming dependencies will be split between those versions. For instance, the popular junit framework has already released 11 versions between January and October 2024.⁶⁶6https://central.sonatype.com/artifact/org.junit.jupiter/junit-jupiter-api/versions Analysis of historical usage trends reveals that junit consistently ranked among the top 100 contributors to the ecosystem for an extensive period, spanning from 2005 to 2024, with an overall usage of 150,000 and counting, which far outpaces other testing frameworks, with the next closest project, TestNG, accumulating just over 11,900 usages. Here, we see junit’s dominance when compared with its competitors.

To compute the Gini index for the GA network for a given year $y_{0}$ , we model population, wealth and ownership as follows:

1.

population: the components (GAs) with any version released in the year $y_{0}$ or before,
2.

wealth: the dependencies (edges) of components released in the year $y_{0}$ ,
3.

ownership: a component $ga_{2}$ owns a dependency $gav_{1}\rightarrow gav_{2}$ if there are versions $gav_{2}$ of $ga_{2}$ and $gav_{1}$ of $ga_{1}$ such that $gav_{1}$ is released in the year $y_{0}$ .

Figure 2 shows the respective Gini coefficients. The measured values suggest significant inequality within the distribution of dependencies, with the level of inequality increasing over time. A superficial interpretation of this trend could be that “the rich get richer”. Expressed differently, a few components could be dominating the network, attracting more and more dependencies, and preventing new components from becoming popular. We study next whether this scenario is actually realized here.

III Dynamism of elites

The observation of unevenness in the dependency distribution suggests that the repository is dominated by a relatively small number of components used by other components. To find out whether the widely used (“elite”) components change over time, we have studied whether and how the elite status of components evolves in time.

We define elite status on components (GAs) by “wealth” through in-degree as discussed in the last section. For each year, we are studying the top 10, top 100 and top 500 components. We then study how many components join and leave those elites in any given year.

There are some aspects here to consider, to ensure that this analysis is sufficiently robust against confounding effects. Components can be renamed (either the group id, or the artifact id). To study this, we have used information of artifact relocation from mvnrepository.com. For example, for the junit artifact junit:junit, this website contains an entry stipulating that “this artifact was moved to: org.junit.jupiter:junit-jupiter-api”.⁷⁷7https://mvnrepository.com/artifact/junit/junit We collected this information for all components (GAs) that had elite status at some stage and created an alias map with such renaming. This map has 4,000 entries. Using this map reduces both the number of elite removals and additions only due to renaming.

Using these data, we looked at the changes in elites over the years. The results are depicted in Figure 3. While the elites underwent some radical changes during the earlier years, this quickly stabilized around 2006. After some further gradual decline, we observe an annual turnover between 20 and 30% since 2012. This suggests a mature ecosystem where popular components change at a stable rate over time. For example, we see the increasing popularity of frameworks such as Mockito and Spock, which rank among the top 100 contributors between 2012 and 2024, just as TestNG became less popular and consequently fell out of the top 100 contributors in the year 2016. This further demonstrates the dynamic nature of the ecosystem as it has gained traction in recent years. In other words, this is an indication of ongoing innovation in Maven Central. We explore the notion of innovation further in the next Section.

IV Innovation

To further study innovation [30], we introduce two metrics that reflect complementary types of innovation. Those metrics are based on the following quantities measured for a given year: (1) FirstGA is the number of components that had the first release of a version (GAV) during this year. (2) LastGA is the number of GAs that have seen the last release in a given year. Measuring LastGA for the last years of the time period studied is not meaningful, as there is still a reasonable chance that new versions will be released in the future. We therefore measure those values only up to 2022. Notably, once Maven components are added, they cannot be deleted.⁸⁸8This is a desirable property of a repository as withdrawing components from the repository can break dependencies and compromise downstream clients. For instance, this has led to the infamous leftpad incident in npm [25]. (3) MajorReleaseGA is the number of GAs that have seen a major version release in the given year. We assume here that major version releases contain new features and some innovation [6], whereas minor and patch releases are mainly used for maintenance.

The quantities FirstGA, LastGA and MajorReleaseGA exhibit exponential growth over the considered time period, qualitatively similar to the growth pattern of the GA network displayed in Fig. 1. We then consider $innovation1=FirstGA/LastGA$ and $innovation2=MajorReleaseGA/LastGA$ as two proxy measures for innovation. Normalization with respect to LastGA is designed to measure innovative activity in the SECO relative to the noisy background of obsolete, or otherwise irrelevant, components present at any given time. The results are depicted in Fig. 4. Interesting trends are seen to emerge in the time evolution of both $innovation1$ and $innovation2$ that are obscured by exponential growth when only the unnormalized quantities FirstGA and MajorReleaseGA are considered.

We observe a steady slow increase of $innovation2$ (innovation through major improvements of existing components) over time. Conversely, $innovation1$ (innovation through creating new components) is overall decreasing but appears to be stabilizing over time. Notably, $innovation1$ remains greater than one over the time period considered, which is an indication that still more components are being added to the ecosystem than components being abandoned. We interpret these trends as an indication that Maven Central is both healthy and mature as a SECO.

V Related Work

Decan et al. [10] in their study on the empirical analysis of package dependency networks across seven packaging ecosystems (Cargo for Rust, CPAN for Perl, CRAN for R, npm for JavaScript, NuGet for the .NET platform, Packagist for PHP, and RubyGems for Ruby) identified common challenges in these ecosystems. They accessed the growth, changeability, reusability, and fragility of these ecosystems, revealing trends of network expansion, the centrality of a small number of packages in driving updates, and the prevalence of fragile packages with numerous transitive dependencies. Similarly to our approach, their study performed a dynamic analysis using the Gini index as a key metric to explore inequality and concentration within these networks. However, while their study spans a diverse range of ecosystems, differing in size, age, and policies, our present research is more focused, dealing solely with Maven’s ecosystem. Other researchers often use popularity metrics to sample datasets or investigate software properties. While some studies have described the popularity of software components in terms of social characteristics, others have described them in terms of technical aspects [42]. For instance, a study of GitHub developers conducted by Lee et al. [23] demonstrated that very well-known developers, who are often referred to as ”rock stars”, have a greater influence on the projects their followers contribute to. In comparison to our study which explores the network evolution and growth in the Maven ecosystem, they conducted a dynamic analysis of how the actions and interactions of developers evolve. Borges et al. [5] conducted a survey involving 400 Stack Overflow users. The results from their poll showed that the users viewed GitHub metrics such as stars, forks, and watchers as highly valuable indicators of how popular a project is. Furthermore, the majority of the comments from OSS developers questioned by Bogart et al. [4] on why they chose the right dependencies for their software projects fell into groups pertaining, reputation and popularity in the community.

Other researchers like Sajnani et al. [36] who measured the popularity of 2,406 Maven components by analyzing how often they were used in 55,191 open-source Java projects, have also argued that usage of software components can be used as a measure of their popularity. Their interpretations work under the assumption that if a component is widely (re)used, then it is generally regarded as good.

The shortcomings of using social attributes of software components as a popularity metric are exacerbated by the fact that they cannot provide a complete picture of real usage, as they can be easily influenced by individual’s preferences or trend [31]. Research conducted in the past by Kitchenham et al. [22], Fenton et al. [15], and Vasa et al. [38] has demonstrated how widely skewed software metrics are in general, making accurate interpretation with conventional descriptive statistical analysis challenging.

VI Threats to Validity and Reproducibility

VI-A Threats to Validity

There might be some additional patterns that could influence the accuracy of the analysis of elites in Section III. Notably, some components split into smaller components. We currently consider those modules as proper new components, but one could argue that some modules ”inherit” the component status from their respective parents, and should be treated like aliases resulting from relocating components.

We decided not to model version ranges. The impact of this decision is small, as discussed in Section I.

In Section IV, we have made the assumption that major releases correspond to innovation, i.e., generally entail the introduction of new features. This is consistent with the objectives of semantic versioning [32]. We note that many projects do not strictly follow semantic versioning [34, 29]. However, this mainly relates to the presence of breaking changes in non-major releases.

VI-B Reproducibility of Results

The scripts and datasets created and used in this study are available on GitHub.⁹⁹9 https://github.com/nkiru-ede/Popularity_and_Innovation_in_Maven_Central/releases/tag/MSR25v1.0

VII Ethical Implications

In attempts to measure lofty concepts such as utility, popularity, innovation and the like, researchers typically introduce noisy proxy measures of such socially constructed abstractions. One needs to reflect on limitations and inherent biases of the adopted quantities before attempting to interpret these more widely and derive deeper meaning, or even policy directions, from their cross-correlation. Furthermore, characterizing random distributions by only a few summary quantities (e.g., the Gini index) can lead to misrepresentations of diversity and variability in human endeavor. Quantitative analyses need to be augmented with qualitative insights from representative practitioners (in our case, software developers) to establish the real driving forces behind the trends exhibited in the data.

VIII Conclusion

We have studied the distribution of dependencies in Maven Central over 22 years (2002-24), using the Goblin dataset. We find that Maven Central is dominated by a relatively small number of components that attract most dependencies. This in itself is hardly surprising. However, interestingly, the set of elite components is highly dynamic and exhibits significant annual turnover. We also observe that there is a stable rate of renewal through innovation in Maven Central.

References

[1] R. Albert and A.-L. Barabási. Statistical mechanics of complex networks. Reviews of Modern Physics, 74:47–97, 2002.
[2] A. Benelallam, N. Harrand, C. Soto-Valero, B. Baudry, and O. Barais. The maven dependency graph: a temporal graph-based representation of maven central. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 344–348. IEEE, 2019.
[3] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang. Complex networks: Structure and dynamics. Physics Reports, 424(4):175–308, 2006.
[4] C. Bogart, C. Kästner, J. Herbsleb, and F. Thung. How to break an API: Cost negotiation and community values in three software ecosystems. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 109–120. ACM, 2016.
[5] H. Borges and M. T. Valente. What’s in a github star? Understanding repository starring practices in a social coding platform. Journal of Systems and Software, 146:112–129, 2018.
[6] E. M. Brown, C. Osborne, P. Cihon, M. Böhmecke-Schwafert, K. Xu, M. Boehm, and K. Blind. Measuring software innovation with open source software development data. Unpublished, available at https://arxiv.org/abs/2411.05087.
[7] M. Cataldo, I. Scholtes, and G. Valetto. A complex networks perspective on collaborative software engineering. Advances in Complex Systems, 17(7&8):1430001, 2014.
[8] T. Chełkowski, P. Gloor, and D. Jemielniak. Inequalities in open source software development: Analysis of contributor’s commits in Apache software foundation projects. PLoS One, 11(4):e0152976, 2016.
[9] A. Decan, T. Mens, and M. Claes. On the topology of package dependency networks: A comparison of three programming language ecosystems. In Proceedings of the 10th European Conference on Software Architecture Workshops. ACM, 2016. Article 21, 4 pages.
[10] A. Decan, T. Mens, and P. Grosjean. An empirical comparison of dependency network evolution in seven software packaging ecosystems. Empirical Software Engineering, 24(1):381–416, 2019.
[11] J. Dietrich, D. Pearce, J. Stringer, A. Tahir, and K. Blincoe. Dependency versioning in the wild. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 349–359. IEEE, 2019.
[12] J. Dijkers, R. Sincic, N. Wasankhasit, and S. Jansen. Exploring the effect of software ecosystem health on the financial performance of the open source companies. In 2018 ACM/IEEE 1st International Workshop on Software Health, pages 48–55. ACM, 2018.
[13] R. Dorfman. A formula for the Gini coefficient. The Review of Economics and Statistics, 61(1):146–149, 1979.
[14] S. Farshidi, S. Jansen, and M. Deldar. A decision model for programming language ecosystem selection: Seven industry case studies. Information and Software Technology, 139:106640, 2021.
[15] N. E. Fenton and M. Neil. A critique of software defect prediction models. IEEE Transactions on software engineering, 25(5):675–689, 1999.
[16] International Organization for Standardization. Innovation management — fundamentals and vocabulary (ISO 56000:2025), 2025. www.iso.org/standard/84436.html.
[17] C. Fritz, C.-P. Georg, A. Mele, and M. Schweinberger. A strategic model of software dependency networks. Unpublished, available at https://ssrn.com/abstract=4318082.
[18] O. Goloshchapova and M. Lumpe. On the application of inequality indices in comparative software analysis. In 2013 22nd Australian Software Engineering Conference, pages 117–126. IEEE, 2013.
[19] D. Jaime, J. El Haddad, and P. Poizat. Goblin: A framework for enriching and querying the maven central dependency graph. In 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR), pages 37–41. IEEE, 2024.
[20] D. Jaime, J. El Haddad, and P. Poizat. Navigating and exploring software dependency graphs using goblin. In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 2025.
[21] R. Kikas, G. Gousios, M. Dumas, and D. Pfahl. Structure and evolution of package dependency networks. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pages 102–112. IEEE, 2017.
[22] B. A. Kitchenham. An evaluation of software structure metrics. In Proceedings COMPSAC 88: The Twelfth Annual International Computer Software & Applications Conference, pages 369–370. IEEE, 1988.
[23] M. J. Lee, B. Ferwerda, J. Choi, J. Hahn, J. Y. Moon, and J. Kim. Github developers use rockstars to overcome overflow of news. In CHI’13 Extended Abstracts on Human Factors in Computing Systems, pages 133–138. ACM, 2013.
[24] M. Mazzucato and C. Perez. Innovation as growth policy: The challenge for Europe. In J. Fagerberg, S. Laestadius, and B. R. Martin, editors, The Triple Challenge for Europe: Economic Development, Climate Change, and Governance, pages 229–264. Oxford University Press, 2015.
[25] T. Mens. An ecosystemic and socio-technical view on software maintenance and evolution. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 1–8. IEEE, 2016.
[26] T. Mens, C. De Roover, and A. Cleve, editors. Software Ecosystems. Springer, Cham, 2023.
[27] C. R. Myers. Software systems as complex networks: Structure, function, and evolvability of software collaboration graphs. Physical Review E, 68(4):046116, 2003.
[28] A. Nicolaides. Research and innovation–the drivers of economic development. African Journal of Hospitality, Tourism and Leisure, 3(2):1–16, 2014.
[29] L. Ochoa, T. Degueule, J.-R. Falleri, and J. Vinju. Breaking bad? semantic versioning and impact of breaking changes in maven central: An external and differentiated replication study. Empirical Software Engineering, 27(3):61, 2022.
[30] OECD and Eurostat. Oslo Manual 2018: Guidelines for Collecting, Reporting and Using Data on Innovation. The Measurement of Scientific, Technological and Innovation Activities. OECD Publishing, Paris/Eurostat, Luxembourg, 4th edition, 2018.
[31] M. D. Papamichail, T. Diamantopoulos, and A. L. Symeonidis. Measuring the reusability of software components using static analysis metrics and reuse rate information. Journal of Systems and Software, 158:110423, 2019.
[32] T. Preston-Werner. Semantic versioning 2.0.0. https://semver.org/.
[33] S. Raemaekers, A. Van Deursen, and J. Visser. The maven repository dataset of metrics, changes, and dependencies. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 221–224. IEEE, 2013.
[34] S. Raemaekers, A. van Deursen, and J. Visser. Semantic versioning and impact of breaking changes in the maven repository. Journal of Systems and Software, 129:140–158, 2017.
[35] M. Rönkkö, A. Ojala, and P. Tyrväinen. Innovation as a driver of internationalization in the software industry. In 2013 IEEE 3rd International Conference on Research and Innovation in Information Systems (ICRIIS), pages 49–54. IEEE, 2013.
[36] H. Sajnani, V. Saini, J. Ossher, and C. V. Lopes. Is popularity a measure of quality? an analysis of maven components. In 2014 IEEE international conference on software maintenance and evolution, pages 231–240. IEEE, 2014.
[37] S. Valverde, R. Ferrer Cancho, and R. V. Solé. Scale-free networks from optimal design. Europhysics Letters, 60(4):512, 2002.
[38] R. Vasa, J.-G. Schneider, and O. Nierstrasz. The inevitable stability of software change. In 2007 IEEE International Conference on Software Maintenance, pages 4–13. IEEE, 2007.
[39] A. Vespignani. Modelling dynamical processes in complex socio-technical systems. Nature Physics, 8:32–39, 2012.
[40] E. Wittern, P. Suter, and S. Rajagopalan. A look at the dynamics of the javascript package ecosystem. In 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pages 351–361. IEEE, 2016.
[41] D. Wu, G. Zeng, L. Meng, W. Zhou, and L. Li. Gini coefficient-based task allocation for multi-robot systems with limited energy resources. IEEE/CAA Journal of Automatica Sinica, 5(1):155–168, 2017.
[42] A. Zerouali, T. Mens, G. Robles, and J. M. Gonzalez-Barahona. On the diversity of software package popularity metrics: An empirical study of npm. In 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 589–593. IEEE, 2019.
[43] X. Zheng, D. Zeng, H. Li, and F. Wang. Analyzing open-source software systems as complex networks. Physica A, 387(24):6190–6200, 2008.