This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Understanding Contributor Profiles in ML Libraries

Jiawen Liu1, Haoxiang Zhang2, Ying Zou1 1Department of Electrical and Computer Engineering, Queen’s University, Canada 2Software Analysis and Intelligence Lab (SAIL), Queen’s University, Canada
Abstract

With the increasing popularity of machine learning (ML), a growing number of software developers have been attracted to developing and adopting ML approaches. Establishing a deep understanding of developer profiles is critical to the success of software development and maintenance process. Research efforts to study ML contributors have been emerging within the past few years. However, the focus of existing work is limited to the difficulties and challenges perceived by ML contributors using user surveys, interviews, or analyzing posts on Q&A systems. There is a lack of understanding on the characteristics and contributing habits of ML contributors based on their behaviors in the software repositories. In this paper, we aim to identify contributor profiles in ML software projects and study the characteristics, responsibilities and challenges of each contributor profile by mining contributor activity traces in ML software repositories. By analyzing 6 popular ML libraries (i.e., Tensorflow, Pytorch, Keras, MXNet, Theano and ONNX), we identify three ML contributor profiles, namely - inactive, intermediate, and core contributors. We find that core contributors stay longer in the project, and make commits with more programming languages, compared to the intermediate and inactive contributors. Additionally, core contributors make the majority of code contributions (i.e., perfective, corrective and feature-introducting commits) and nonfunctional contributions. We also observe that contributors focus on different ML components in different projects.

1 Introduction

Throughout the past decade, machine learning (ML) has been gathering much attention from a wide range of domains, from autonomous driving vehicles to health care system [9046805, KOUROU20158]. The growing availability of cloud computing services and open-source ML frameworks such as Tensorflow, Pytorch, Keras, MXNet, Theano, and ONNX have significantly lowered the barriers of ML implementation and made it accessible to non-ML specialists from other fields [blog]. On the contrary, in the past, machine learning techniques were mainly available to ML specialists in research labs who possessed GPUs and ML specific source code [mldeveloper].

Together with the popularization of ML frameworks and tools, an increasing number of software developers have been attracted to learning ML technologies and contributing to ML projects. In a software project, correctly understanding software developers could provide valuable insights for the management [ccetin2022analyzing]. contributors can contribute to a software project from many aspects, such as making code commits, conducting code review, fixing bugs, or updating documents. These activities are critical to the success of software development process and software evolution. Therefore, correctly understanding the contributor behaviors in the popular ML framework projects could reveal contribution patterns and anti-patterns, and provide guidelines helping developers contribute for different ML groups.

Related to the study of open-source software (OSS) contributors, existing work mainly consider the contributors in the traditional software projects. These studies have depicted a nearly holistic picture of traditional software developers by revealing their characteristics in many aspects, including code contributions [da2014unveiling, geldenhuys2010finding], technical expertise [dey2021representation, montandon2019identifying], non-technical expertise [fe8a3ca6d43d4253855607d9a2cad816, 6880395], turnover [robles2006contributor], evolution [cheng2017developer, el2017periphery], working habits [binnewies2010recovery, zhang2020understanding, claes2018programmers], personalities [wiesche2014relationship], collaboration network [el2019empirical, cohen2018large] and roles of non-technical contributors [10.1145/3415251].

Research efforts to study ML software developers have been arising within the past few years but are still limited. Consider the huge difference between traditional software and ML software development process [8987482], there is a need to revisit the studies of software developers and take into account the particularity of ML specific aspects. Cai et al. have conducted a survey and investigated the motivations, challenges and desires of software developers when they begin learning ML [mldeveloper]. Hill et al. have interviewed experienced ML contributors and find that the main challenge they faced at work is establishing a repeatable workflow [MLdeveloperexp]. Ishikawa et al. have deployed a questionnaire survey and identify the essential difficulties perceived by contributors when developing ML systems [8836142]. However, the quantitative analysis of ML contributors is still limited and few work have studied ML contributors from their activities in the ML software repositories.

To establish an initial understanding and identify the characteristics, responsibilities and hurdles of ML contributors, we conduct an empirical study on contributors in 6 popular ML framework projects (i.e., Tensorflow, Pytorch, Keras, MXNet, Theano, and ONNX). We extract various contributor features from the repository history and cluster ML contributors into three groups, namely -  inactive,  intermediate and  core. Then, we analyze the commits submitted by different ML contributors and the issue reports they raised in the project repository. Such information can assist decision-makers to manage the ML projects and reveal a potential pathway for traditional software developers to become ML experts. Through our study, we aim to answer the following research questions:

RQ1: What are the types of contributors in OSS ML projects?– To characterize ML contributors, we use K-Prototype clustering algorithm to cluster contributor features and obtain three contributor profiles, namely -  inactive,  intermediate and  core. We further observe that most of ML contributors are locating in North America area.  inactive and  core contributors share a similar working pattern.  core contributors stay longer in the project, make more code contribution and contribute with more programming languages, compared to the other two groups.

RQ2: How do responsibilities distribute among each type of contributors? We study the contributors’ contribution of different types of commits and ML components. We find that the  core group make the major contribution of all four types of commits (i.e., perfective, corrective, feature introducing, and nonfunctional commits). Additionally, all contributors focus on different ML components in different projects.

RQ3: What issues does each type of contributor tend to encounter? To identify the technical difficulties encountered and problems identified by different ML contributors, we study the issue reports they raised on Github repositories. We extract the textual content of issue reports (i.e, title, description, and labels) and use Word2Vec to convert text to vectors. The classifications of issues are obtained through hierarchical clustering of the vectors based on cosine similarity. Lastly, we analyze the relationship between the issue categories and ML contributor profiles.

In summary, we make the following contributions to the software engineering community:

(1) We establish an initial understanding of ML software developers based on quantitatively studying the contributor activities in popular ML project repositories.

(2) We identify the characteristics and contribution patterns of ML experts, which assists ML new comers on becoming experts.

(3) We identify the technical issues encountered by different ML contributors, which can help the ML project managers allocate human resources.

Paper organization. The reminder of our paper is organized as follow. Section 2 introduces related studies. Section 3 describes data collection and experiment setup. Section 4 presents the motivations, approaches and results for answering our research questions. We describe the threats to validity of our study in Section 5. Lastly, we conclude our paper in Section 6.

2 Related Work

In this section, we briefly introduce previous research studying ML library contributors.

Cai et al. investigate the motivations, challenges and desires of software developers when they begin learning ML [mldeveloper]. The authors deploy an online survey on Tensorflow.js framework website and receive 645 responses from its users. Through analyzing the responses, Cai et al. find that the major challenge of ML learners is the lack of conceptual and mathematical understanding and they desire ML frameworks to provide such support. Hill et al. conduct field interviews on 11 ML professionals who develop intelligent software systems in a large enterprise [MLdeveloperexp]. They find that the main challenge these contributors experiencing at work is establishing a repeatable workflow for ML system development process. Ishikawa et al. have conducted a questionnaire survey on 278 ML practitioners in Japanese companies and clarify that the essential difficulties perceived by ML contributors include low accuracy, lack of oracle, and uncertainty of the system behavior [8836142].

Gangash et al. conduct an empirical study on ML related posts on Stack Overflow to investigate the topics interest ML contributors  [8816808]. Islam et al. have classified the posts related to 10 ML libraries (i.e., Tensorflow, Keras, scikit-learn, Weka, Caffe, Theano, MLlin, Torch, Mahout, and H2O) on Stack Overflow and map the resulting categories to ML system development stages [islam2019developers]. They find that the most difficult problems faced by ML contributors are related to data preparation and training.

To summarize, most of prior work on studying ML contributors are related to studying the difficulties and challenges in ML software development. These studies have mainly been conducted through surveys, interviews, or analyzing the posts on Q&A systems. In our study, we quantitatively study the characteristics, responsibilities and challenges of ML contributors through mining open-source ML framework and library projects on Github. We extract contributor features from different aspects (e.g., experience, contribution, social impact, geological information, working habit, and technical merit) from the Github project repositories. Then, we identify ML contributor profiles through clustering the features, and further reveal the characteristics and responsibilities of each profile through analyzing their commit contributions. Lastly, we classify and analyze the issue reports raised by ML contributors to investigate the technique issues they encountered during ML software development and maintenance process.

3 Experiment Setup

In this section, we present data collection and experiment setup. An overview of our approach is shown in Figure 1. We first select the experiment projects from Github and collect the project data using Github API111https://docs.github.com/en/rest?apiVersion=2022-11-28. Then, we extract features from the collected data and use K-Prototype clustering algorithm to identify contributor profiles. We establish the understandings of ML contributors through further investigation of the resulting contributor profiles.

Refer to caption
Figure 1: The overview of our approach.

3.1 Data Collection

As shown in Figure 1, we first select the experiment projects from Github, and then collect the data using Github API and PyDriller.

Project Selection. To run our experiments on the projects that contain rich contributor information and the most up-to-date machine learning technology, we choose to study projects related to ML libraries and frameworks. These projects gather the collaborations between ML experts and usually have a larger size than the projects that are implementing a single machine learning model to solve a problem. We rank the popular ML library and framework projects on Github according to the number of contributors, and select the six projects with different sizes (i.e., two large size projects - Tensorflow and Pythoch, two mediam size projects - Keras and MXNet, and two small size projects - Theano and ONNX). All selected projects contain more than 2k commits, have a lifespan of more than 5 years, and still active on May 2022 (i.e, the day we collect the data), which provide adequate and up-to-date information to conduct our analysis. Detailed information of the selected projects is shown in Table I.

TABLE I: Details of the selected ML library and framework projects.
Project Name Contributors Commits Pull Requests Issues Creation Time
Tensorflow 3222 137512 21940 33767 2015-11
Pytorch 2509 53069 59705 19499 2012-01
Keras 1072 7463 5620 11249 2015-03
MXNet 874 11893 10893 7751 2015-04
Theano 342 28132 4016 2086 2008-01
ONNX 242 2085 2360 1850 2012-01

Collecting Data. After the experiment projects are selected, we collect the data from their repositories on Github. We use Github API to collect the (i)(i) commits; (ii)(ii) pull requests; (iii)(iii) issue reports of each project, and (iv)(iv) the account type of contributors (e.g., bot, organization, or user account) in the projects. PyDriller is used to fetch the complete timestamps of the commits and identify the bug-introducing commits.

3.2 Feature Extraction

31 contributor features are extracted from the collected data. A full list of the features can be found in Appendix A, together with the definitions and calculations. When extracting the contributor features, we exclude the bots, organization accounts and enterprise accounts so that our analysis is focused on human contributors. There are 7636 contributors in the collected projects including 704 developers who contribute to multiple projects. For developers who contribute to multiple projects, we consider them as different individuals when conducting our experiments, since they may behave differently in different projects.

3.3 Correlation and redundancy analysis

Considering that correlated features can be expressed by each other and redundant features can be expressed by other features, highly correlated and redundant features blur the importance of the features to our analysis and result in unnecessary computations. Therefore, we conduct a correlation and redundancy analysis to remove the high-correlated and redundant features.

  • Correlation Analysis: We find that contributor features do not follow a normal distribution, thus we use Spearman Rank Correlation to conduct the correlation analysis [zar2005spearman]. Two features with correlation coefficient higher than 0.7 are considered highly correlated [8613795]. For each pair of highly correlated features, we keep only one in the candidate list and remove the other one. The result of correlation analysis is shown in Figure 2. 11 features are removed and 20 features remain.

    Refer to caption
    Figure 2: Result of the Spearman correlation analysis for contributors in all selected projects.
  • Redundancy Analysis: R-squared is a measure that indicates how much variance of a variable can be explained by other variables. We use a R-squared cut-off at 0.9 to identify the redundant features [miles2005r]. The number of pull requests reviewed, the number of total commits and the number of buggy commits are found to be redundant. As shown in Table II, 17 uncorrelated and non-redundant features remain.

TABLE II: Uncorrelated and non-redundant contributor features.
Category
Features
Experience Duration
Product
Code Contribution, Code Contribution Rate,
Commit Rate, Code Commit Rate, Other Commits
Process
Issue Solved, Total Pull Requests, PR Merged, PR
Contribution Rate, PR Approval Ratio, PR Approval
Density
Social
Network
Followers, Collaborations
Geological
Information
Time Zone
Work
Pattern
Work Time
Technical Languages

4 Results

In this section, we present the motivations, approaches, and the preliminary results of our research questions.

4.1 RQ1: What are the types of contributors in OSS ML projects?

4.1.1 Motivation

In software projects, correct understanding of the software developers could provide valuable insights for the management [ccetin2022analyzing]. Existing studies in software engineering have studied and characterized software developers from different aspects, including their personalities [wiesche2014relationship], code contributions [da2014unveiling, geldenhuys2010finding], non source code contributions [10.1145/3415251], and the number of days staying in the project [fe8a3ca6d43d4253855607d9a2cad816, 6880395]. However, these studies are conducted on the traditional OSS projects and depicted an portrayal of traditional OSS contributors. Machine learning is an emerging field in recent years, many software projects are shifting from traditional approaches to ML based approaches [shafiq2020machine]. The understanding of ML contributors is still under exploration. By addressing this research question, we identify the profiles of contributors in popular ML projects. Knowing the composition of ML contributor group can assist the practitioners to manage human resources and help them adopt ML approaches in their projects.

4.1.2 Approach

To identify ML contributor profiles, we cluster the contributors according to their behaviors in different aspects and identify different groups of contributors that share some similarities. Figure 3 shows an overview of our approach for this research question. We describe the detailed approach below.

First, we select the features that are the representatives of different aspects of the contributors for clustering (i.e., experience, contribution, social impact, geological information, working habit, and technical merit). For each category in Table II, we select one feature to represent the specific aspect of the contributor. There is an exception that we only select one feature to represent the Product and Process category, because these two categories are both relating to a contributor’s contribution to the project. The selected clustering features are: duration (i.e., days in project), code contribution (LOC), number of followers, languages, time zone, and work time. The first four features contain numeric values and the last two features contain categorical values.

Clustering algorithms are calculating the similarity between data points based on their distance in each dimension (i.e., feature). If there are features whose values are significantly larger than other features, the clustering result will be dominant by these features. Therefore, we use min-max normalization to scale all the numeric features to the same range of 0 to 1, to allow for an equal impact for all features on the clustering, while maintaining the distribution of the feature values.

Lastly, we use K-Prototype clustering algorithm [huang1998extensions] to cluster the contributor features. K-Prototype is an extension of the K-Means clustering algorithm [hartigan1979algorithm] which is suitable for clustering with mixed numeric data and categorical data [huang1998extensions]. We apply the elbow method to find the optimal number of resulting clusters (i.e., K value) [syakur2018integration]. The elbow method plots the sum of square error (SSE) of the resulting clusters for each K value. The optimal K value is at the elbow point of the graph, where as K value increases, SSE drops significantly beforehand and slowly afterwards. We identify that 3 is the optimal K value, and the corresponding SSE is approximately 1025.0.

Refer to caption
Figure 3: Overview of approach for identifying contributor profiles.

4.1.3 Result

Based on the clustering result of contributors from 6 popular ML framework projects (i.e., Tensorflow, Pytorch, Keras, MXNet, Theano, and ONNX), three ML contributor profiles are identified, namely -  inactive contributors,  intermediate contributors, and  core contributors. The summary of performance of each profile on the selected features is shown in Figure 4.

Figure 4(a) and Figure 4(b) present the result of identified contributor profiles in two categorical features (i.e., Timezone and Worktime). We find that all ML contributors are mainly locating in North America area (i.e., UTC -10 to -2) with a percentage of 54%, 56%, and 50%, for  core,  intermediate, and  inactive contributors respectively. Compared to  intermediate, more  inactive contributors (20%) are in Europe area (i.e., UTC -1 to +3), and more  core contributors (33%) are locating in Asia area (i.e., UTC +4 to +13). Figure 4(b) shows that  core contributors (50.5%) mainly make contributions after 5 PM and  inactive contributors (52%) mainly make contributions to the project after 3 PM, while  intermediate contributors tend to make relatively equal contributions during the day. We speculate that the reason behind it can be  intermediate group contains more student whose time is more flexible during the day, while  core and  inactive group consists a greater portion of employed contributors.

Figure 4(c) presents logarithm of the duration (i.e., the number of days in project) of each contributor profile. We find that  core,  intermediate, and  inactive contributors are staying the longest, medium, and the shortest time in the project respectively, with the median of 27 days, 549 days, and 1686 days. We find from Figure 4(d) that  core contributors make the most code contributions and  inactive contributors make the least code contributions. The median of lines of code made by  core,  intermediate, and  inactive contributors are 10, 42, and 995 respectively. Figure 4(e) shows that  core contributors tend to have more followers on their Github account, while there is no obvious difference between  intermediate and  inactive. Figure 4(f) shows that majority of  intermediate and  inactive contributors are making code contributions in a single programming languages or making non source code contributions, while  core contributors tend to contribute code of multiple languages with a median of 3.

We use Scott-Knott test  [scott1974cluster, jelihovschi2014scottknott] to examine whether the three ML contributor profiles are statistically different groups in terms of each feature. We find that  core,  intermediate, and  inactive are statistically different groups in most cases, except the time zone of  core and  inactive contributors and the number of followers of inermediateinermediate and  inactive contributors. This indicates that  core and  inactive contributors have similar geological distribution, and  intermediate and  inactive contributors have a similar social popularity. The Scott-Knott results also support our findings that  core contributors have more followers, while no significant difference between inermediateinermediate and  inactive.

To summarize, we find that all ML contributors are mainly locating in North America area.  core contributors are staying longer in the project, making significantly more code contributions, having more followers, and contributing with more programming languages than  inactive and  intermediate contributors.  intermediate contributors preform better than  inactive contributors on making code contributions, staying longer in the project, and using more programming languages.

Refer to caption
(a) Time zone (UTC).
Refer to caption
(b) Work time (24h).
Refer to caption
(c) Logarithm of days in project.
Refer to caption
(d) Logarithm of code contribution(LOC).
Refer to caption
(e) Logarithm of number of followers.
Refer to caption
(f) Number of programming languages.
Figure 4: Performance of identified ML contributor profiles in each feature.

4.2 RQ2: How do responsibilities distribute among each type of contributors?

4.2.1 Motivation

Existing work has proposed various methods to categorize and evaluate the contributor’s expertise in many specific aspects of traditional software development. But few studies have identified what expertise is appreciated by ML software developers and what responsibilities are taken by each type of contributor. With this research question, we identify the types of commits made by the three type of ML contributors and their contributions in different ML components.

4.2.2 Approach

Refer to caption
Figure 5: Overview of RQ2 approach for studying ML contributor responsibilties.

Figure 5 shows an overview of our approach for identifying the responsibilities of each ML contributor profile in the project. We describe the detailed approach below.

To study how responsibilities are distributed among each type of contributors, we study the difference in the types of commits made by  inactive,  intermediate, and  core contributors and the difference in their contributions to different ML components. We first classify the commits in the collected projects into five categories (i.e., perfective, features, corrective, nonfunctional, and unknown commits) using fastText commit classification algorithm [dos2020commit]. The definitions of the five commit categories are described below.

  • Perfective: Commits for making system improvements.

  • Features: Commits for introducing new features.

  • Corrective: Commits for fixing faults.

  • Nonfunctional: Non source code commits, such as updating documents or making changes to comments.

  • Unknown: Any commits that are not belonging to the above four categories.

We map the labeled commits to the contributors who submitted them and observe the relationship between the commit types and the ML contributor profiles.

To investigate the relationship between ML components and ML contributor profiles, we identify five most active directories (i.e, directories where commits are most frequently made) that are common for all collected projects, and then observe the corresponding commit contributions made by the three type of contributors.

4.2.3 Result

Refer to caption
(a) Logarithm of number
                                                                                                                                                  of perfective commits.
Refer to caption
(b) Logarithm of number
                                                                                                                                                  of corrective commits.
Refer to caption
(c) Logarithm of number
                                                                                                                                                  of feature commits.
Refer to caption
(d) Logarithm of number
                                                                                                                                                  of nonfunctional commits.
Figure 6: ML contributor contributions on different types of commits.
Refer to caption
(a) Tensorflow
Refer to caption
(b) Pytorch
Refer to caption
(c) Keras
Refer to caption
(d) MXNet
Refer to caption
(e) Theano
Refer to caption
(f) ONNX
Figure 7: Number of commits made by different ML contributors to ML components (normalized by KLOC in the directory and number of contributors of different profiles in the project).

Our analysis shows that  core contributors are taking more responsibilities on making all types of commit contributions, while there is no significant difference between  inactive and  intermediate contributors. Additionally, we find that contributors concentrate their effort on different ML components in different projects.

We study the contribution of  core,  intermediate, and inactioninaction on making different types of commits (i.e., perfective, corrective, features and nonfunctional commits). Figure 6 presents the logarithm of number of each type of commits made by different contributors. For perfective, corrective and feature introducing commits (i.e., code commits),  inactive and  intermediate group exhibit similar behaviors that the majority of them are making no contributions or few contributions. On the contrary, fewer  core contributors are not making code contributions. However, for the nonfunctional commits, all three types of contributors tend to make little contributions. We also conduct Scott-Knott tests to examine whether the three contributor profiles are statistically different groups in terms of making different type of commits. The results show that for all four types of commits, the number of commits made by  core group are significantly larger, while  inactive and  intermediate are clustered to the same group indicating that the difference between these two groups are negligible.

We identify five most active ML components (i.e., directories) common for our collected projects, which are operator, test, tensor, compile and core. Figure 7 presents the amount of effort made by ML contributors on each ML components, where the effort value is calculated with Equation 1. We find that contributors in different projects tend to focus on different ML components.

Figure 7(a) shows that  core and  intermediate contributors in Tensorflow tend to write more code related to test and compile, while  inactive contributors make little contributions on all ML components. Our result agree with the finding that TensorFlow runs model faster than PyTorch on CPUs [8891042]. The possible reason for this can be contributors in Tensorflow lay emphasis on compilation time and have spent great effort on building compilers.

Form Figure 7(b), we recognize that  core contributors in Pytorch make the majority of contributions to all studied ML components, while  intermediate and  inactive group make little contributions.  core contributors make great effort to writing code for ML operators, tensor, and core functions. This indicates a possible reason for that Pytorch is more user-friendly and provides more predefined functions and schedulers, compared to Tensorflow [novac2022analysis].

As shown in Figure 7(c), contributors in Keras only focus on core functions and spare little efforts to other ML components. A possible reason is that Keras is an interface built on top of Tensorflow. When user build models with Keras, the actual ML implementations and computations are happening in Tensorflow framework. Therefore, Keras contributors do not necessarily need to pay to much attention on developing ML operators or compilers.

Figure 7(d) shows that MXNet contributors contribute more to ML operators, testings and tensor functions, while there are no active directories for compile and core. This could be the reason for its lack of performance on CPUs compared to Tensorflow. However, probably due to the great effort spent on developing ML operators and tensor functionalities, MXNet supports more programming languages and provides greater portability that allows for running models on a wide range of devices.

Figure 7(e) shows that Theano contributors are active in developing ML operators, tests, tensor functions and especially compilers. This agrees with the fact that Theano offers faster computation than Tensorflow.

Figure 7(f) shows that ONNX does not have active directories for ML operators, compile or core functions, thus contributors are more actively contributing to testing and writing code to support tensor functionality. The reason could be that ONNX is developed to assist users to save models to bin files. It does not run ML models, instead, it usually backended with frameworks such as Tensorflow and Pytorch.

effort=commits\textcpfile\textkloccontributor\textcpWhere cp stands for contributor profile, commits\textcp stands for the number of commits submitted by a type of contributor to an ML project directory, contributor\textcp stands for number of contributors of a profile, and file\textkloc stands for the sum of KLOC of all files under a directoryeffort=\frac{commits_{\text{cp}}}{\sum{file_{\text{kloc}}}*contributor_{\text{cp}}}\par\textit{Where cp stands for contributor profile, $commits_{\text{cp}}$ stands for the number of commits submitted by a type of contributor to an ML project directory, $contributor_{\text{cp}}$ stands for number of contributors of a profile, and $\sum{file_{\text{kloc}}}$ stands for the sum of KLOC of all files under a directory} (1)

4.3 RQ3: What issues does each type of contributor tend to encounter?

4.3.1 Motivation

In software development process, contributors might encounter various issues. Existing works have proposed methods to predict the issues that are likely to happen in traditional software and the suitable contributors to solve them. Existing studies related to challenges of ML contributors are mostly conducted through surveys, interviews or analyzing the posts on Q&A systems. However, there is still a lack of direct understanding of technical issue encountered or problems spotted during ML software development and maintenance process. With this research question, we want to provide a categorization of the issues being reported in the machine learning software projects, and identity the relationships between different types of issues and different types of contributors. Understanding the issues happening in ML projects and their impacts on project contributors would facilitate the management and the prioritization of the issues to address.

4.3.2 Approach

An overview of our approach for this research question is shown in Figure 8.

We first extract the textual content contained in the issue reports of the studied projects, including title, description and labels. Then, we use Word2Vec to convert the textual content extracted from the issue reports to vectors. We use hierarchical clustering algorithm to cluster the vectors based on the cosine similarity. Hierarchical clustering does not require the number of clusters in advance, instead, it is determined by the threshold of the distance beyond which two groups will be further divided. We manually investigate each resulting cluster and identify the type of issues each cluster is representing. Lastly, we map the issues back to the contributors who raised them and compare the difference between the types of issues raised by  inactive,  intermediate, and  core contributors.

However, due to the time limitation, we are not able to finish this experiment when we write this section. We will present the results in an updated version of this paper.

Refer to caption
Figure 8: An overview for the approach of identifying issues encountered by contributors.

5 Threats To Validity

In this section, we discuss possible threats to the validity of our study.

Threats to Internal Validity. In the first research question, we use K-Prototype clustering to identify different types of contributors. We identify k=3 is the optimal number of clusters using the elbow method and identify three major contributor profiles accordingly. We admit that having a different number of clusters changes our results. To mitigate this threat, we cluster all contributors in our dataset at once instead of doing clustering per project. We execute the elbow method 10 times and obtain a stable optimal number of clusters (i.e., k=3).

Threats to External Validity. Our study has been conducted over a small number of ML projects on Github. We also limit our study to analyzing ML library and framework projects, instead of studying all the projects on Github that are implementing ML technology or developing ML libraries. To augment the generalizability of our result, we select the ML projects created for different purposes by different organizations with different sizes to conduct our experiment. Additionally, our selected projects are mainly written in Python and C++. We admit that including more ML frameworks that are developed in other programming languages such as Java or Matlab might reveal more meaningful findings regards ML contributors. However, consider that Python is the most preferable programming languages for scientific computing and machine learning [raschka2020machine], the generalizability of our results to other ML projects is still adequate.

Threats to Construct Validity. In our work, we aim to identify the profiles and characteristics of ML contributors. Although our dataset contains only ML projects, there might still be traditional software developers with little ML knowledge. Unfortunately, we are not able to consult all contributors and ask their expertise for developing ML software. To minimize this threat, we select the ML projects that are developing ML frameworks and libraries instead of adopting ML techniques for different domains. We believe that there is a greater portion of ML experts in projects developing ML tools than projects adopting ML tools.

6 Conclusion

In this study, we identify three ML contributor profiles, namely -  inactive,  intermediate, and  core through analyzing 6 popular ML framework and library projects (i.e., Tensorflow, Pytorch, Keras, MXNet, Theano and ONNX). We further investigate the responsibilities taken by each type of ML contributors in the software development process and the technical issues they encountered. By analyzing the ML contributor features, we observed that majority of ML contributors are locating in North America area, and  inactive and  core are sharing a similar working pattern (i.e., working after 3pm and 5pm). Compared to  inactive and  intermediate contributors,  core contributors stay longer in the project, and contribute in more programming languages. As we further study the commit contributions, we find that  core contributors are making significantly more code commits (i.e., perfective, corrective and feature-introducing commits) and nonfunctional commits. contributors in different projects have different focus on developing ML-specific components. Unfortunately, we have not obtained the results for the technical issues encountered by ML contributors at the moment. We plan to include it in our paper as an next step.

Appendix A Appendix A