DistStat.jl: Towards Unified Programming for High-Performance Statistical Computing Environments in Julia

Seyoon Ko Seoul National University Hua Zhou University of California, Los Angeles Jin Zhou University of Arizona Joong-Ho Won Seoul National University

Abstract

The demand for high-performance computing (HPC) is ever-increasing for everyday statistical computing purposes. The downside is that we need to write specialized code for each HPC environment. CPU-level parallelization needs to be explicitly coded for effective use of multiple nodes in cluster supercomputing environments. Acceleration via graphics processing units (GPUs) requires to write kernel code. The Julia software package DistStat.jl implements a data structure for distributed arrays that work on both multi-node CPU clusters and multi-GPU environments transparently. This package paves a way to developing high-performance statistical software in various HPC environments simultaneously. As a demonstration of the transparency and scalability of the package, we provide applications to large-scale nonnegative matrix factorization, multidimensional scaling, and $\ell_{1}$ -regularized Cox proportional hazards model on an 8-GPU workstation and a 720-CPU-core virtual cluster in Amazon Web Services (AWS) cloud. As a case in point, we analyze the on-set of type-2 diabetes from the UK Biobank with 400,000 subjects and 500,000 single nucleotide polymorphisms using the $\ell_{1}$ -regularized Cox proportional hazards model. Fitting a half-million-variate regression model took less than 50 minutes on AWS.

Keywords— Julia, high-performance computing, MPI, distributed computing, graphics processing units, cloud computing

1 Introduction

There is no doubt that statistical practice needs more and more computing power. In recent years, increase in computing power is usually achieved by using more computing cores or interconnecting more machines. Modern supercomputers are dominantly cluster computers that utilize multiple cores of central processing units (CPUs) over multiple machines connected via fast communication devices. Also, co-processors such as graphics processing units (GPUs) are now widely used for accelerating many computing tasks involving linear algebra and convolution operations. In addition, as the cloud computing technology matures, users can now access virtual clusters through cloud service providers such as Amazon Web Services, Google Compute Platform, and Microsoft Azure without need of purchasing or maintaining machines physically. With the demand for analysis of terabyte- or petabyte-scale data in diverse disciplines, the crucial factor for the success of large-scale data analysis is how well one can utilize high-performance computing (HPC) environments.

However, the statistical community appears yet to fully embrace the power of HPC. This is partly because of the difficulty of programming in multiple nodes and on GPUs in R, the most popular programming language among statisticians. Furthermore, statisticians often face the burden of writing separate codes for different HPC environments. While there are a number of software packages in high-level languages, including R, that simplify GPU programming and multi-node programming separately (reviewed in Section 2), those that enable simplification for both multi-GPU programming and multi-node programming with the unified code base are rare. In particular, there are few packages for multi-device distributed linear algebra with non-CPU environments. This leads to the necessity of a tool that merges programming and implement linear algebra operations for disparate HPC environments. High-performance implementation of easily parallelizable algorithms in statistical computing, for example, MM algorithms (Lange et al., , 2000; Hunter and Lange, , 2004; Lange, , 2016), proximal gradient methods (Beck and Teboulle, , 2009), and proximal distance algorithms (Keys et al., , 2019) will benefit from such a tool.

In this paper, we introduce DistStat.jl, which implements a distributed array data structure on both distributed CPU and GPU environments and also provides an easy-to-use interface to the structure in the programming language Julia (Bezanson et al., , 2017). A user can switch between the underlying array implementations for CPU cores or GPUs only with minor configuration changes. Furthermore, DistStat.jl can generalize to any environment on which the message passing interface (MPI) is supported. This package leverages on the built-in support for multiple dispatch and vectorization semantics of Julia, resulting in easy-to-use syntax for elementwise operations and distributed linear algebra. We also demonstrate the merits of DistStat.jl, in applications to highly-parallelizable statistical optimization problems. Concrete examples include nonnegative matrix factorization, multidimensional scaling, and $\ell_{1}$ -regularized Cox regression. These examples were considered in Ko et al., (2020) with their PyTorch (Paszke et al., , 2019) implementation, dist_stat, of the algorithms. We selected these in order to to highlight the merits of Julia in developing high-performance statistical computing software. The improved performance brings sharp contrast in the genome-wide survival analysis of the UK Biobank data with the full 500,000 genomic loci and 400,000 subjects in a cloud of twenty 144 GB-memory, 36-physical-core instances. This doubles the number of subjects analyzed in Ko et al., (2020).

The remainder of this paper is organized as follows. We review related software in Section 2. Then, basic usage and array semantics of Julia are covered in Section 3. In Section 4, we review software interface of the package DistStat.jl. Section 5 shows how DistStat.jl can be used with easily-parallelizable statistical optimization algorithms. We also compare the performance of DistStat.jl in Julia to that of dist_stat in Python. We analyze the real-world UK Biobank data with the Cox proportional hazards model in Section 6. We conclude the paper with summary and discussion in Section 7. The code is available at https://github.com/kose-y/DistStat.jl, released under the MIT License. The package can be installed using the instructions provided in Appendix A.

2 Related Software

2.1 Message-passing interface and distributed array interfaces

The de facto standard for inter-node communication in distributed computing environments is the message passing interface (MPI). The latter defines several ways to communicating between two processes (point-to-point communication) or a group of processes (collective communication). Although MPI is originally defined in C and Fortran, many other high-level languages offer wrappers to this interface: Rmpi (Yu, , 2002) for R, mpi4py (Dalcín et al., , 2008) for Python, and MPI.jl (JuliaParallel Contributors, , 2020) for Julia, etc.

Leveraging on distributed computing environments, there have been several attempts to incorporate array and linear algebra operations through the basic syntax of the base programming language. MATLAB has a distributed array implementation that uses MPI as a backend in the Parallel Computing Toolbox. In Julia, MPIArrays.jl (Janssens, , 2018) defines a matrix-vector multiplication routine that uses MPI as its backend. DistributedArrays.jl (JuliaParallel Contributors, , 2019) is a more general attempt to create a distributed array, allowing various communication modes, including Transmission Control Protocol/Internet Protocol (TCP/IP) and Secure Shell (SSH), and MPI. In R, a package called ddR (Ma et al., , 2016) supports distributed array operations.

2.2 Unified array interfaces for CPU and GPU

For GPU programming, CUDA C for Nvidia GPUs and OpenCL for generic GPUs are by far the most widely used. R package gputools (Buckner et al., , 2010) is one of the earliest efforts to incorporate GPU in R. PyCUDA and PyOpenCL (Klöckner et al., , 2012) for Python and CUDA.jl (previously CUDAnative.jl and CUDAdrv.jl) (Besard et al., , 2018) for Julia allow users to access low-level features of the respective interfaces.

An interface to array and linear algebra operations that works transparently on both CPU and GPU is highly desirable, as it reduces the programming burden tremendously. In MATLAB, the Parallel Computing Toolbox includes the data structure gpuArray. Simply wrapping an ordinary array with the function gpuArray() allows the users to use predefined functions to exploit the single instruction, multiple data (SIMD) parallelism of Nvidia GPUs. In Python, the recent deep learning (LeCun et al., , 2015) boom accelerated development of easy-to-use array interfaces and linear algebra operations along with automatic differentiation for both CPU and GPU. The most popular among them are TensorFlow (Abadi et al., , 2016) and PyTorch (Paszke et al., , 2019). The latter development is worth noting since most statistical computing tasks, not alone neural networks, can benefit from it. For example, the Distributions subpackage of TensorFlow (Dillon et al., , 2017) provides a convenient programming environment for Bayesian computing on GPUs, such as stochastic gradient Monte Carlo Markov chain (Baker et al., , 2018). In Julia, CUDA.jl (previously CuArrays.jl)(Besard et al., , 2019; JuliaGPU Contributors, , 2020) defines many array operations and simple linear algebra routines using the same syntax as the base CPU arrays.

2.3 Previous work

To lessen the burden of programming for different HPC environments, it is desirable to unify these approaches and have a common syntax for within-node CPU/GPU computation and inter-node/inter-GPU communication. In this regard, the authors have previously developed a package dist_stat (Ko et al., , 2020) that implements distributed array and linear algebra operations using PyTorch. Through the examples of nonnegative matrix factorization, multidimensional scaling, and $\ell_{1}$ -regularized Cox regression, they have demonstrated the scalability of their approach on both multiple GPUs and multiple CPU nodes on a cloud. This package provides a transparent and easy-to-use interface for distributed arrays on many computing environments, e.g., a single CPU-only node, a single node with a single GPU, a single node with multiple GPUs, multiple nodes with only CPUs, and multiple nodes with multiple GPUs. However, there are a couple of limitations in dist_stat, mainly coming from those of PyTorch. First, all the processes must have data of an equal size, hence the number of processes to be created must divide the total size of the data. This limite the tunability of the computation jobs. Second, it is very difficult to write effective device-specific code for further acceleration without writing code in C or other packages that support just-in-time (JIT) compilation. Hence the authors decided to migrate to Julia in order to avoid these issues and leverage higher flexibility.

3 Julia Basics

Julia is a high-level programming language that has a flavor of interpreter languages such as R and Python, but compiles for efficient execution via LLVM (Lattner and Adve, , 2004). Its syntax is similar to those of MATLAB and R, leading to easy-to-read code that can run on various hardware with only minor changes, including CPUs and GPUs. In this section, we review the basic syntax of Julia. Our description is based on Julia version 1.4. For more details, see the official documentation (Julia Contributors, , 2020).

3.1 Methods and multiple dispatch

In Julia, a function is “an object that maps a tuple of argument values to a return value”. A function can have many different specific implementations, distinguished by the types of input arguments. Each specific implementation is called a method, and a method of a function is selected by the mechanism called multiple dispatch. Many core functions in Julia possess several methods attached to each of them. A user can also define additional methods to existing functions. For example, a method for function f can be defined as follows:

julia> f(x, y) = "foo"
f (generic function with 1 method)

Each argument can be constrained to certain type, for example:

julia> f(x::Float64, y::Float64) = x * y
f (generic function with 2 methods)

julia> f(x::String, y::String) = x * y
f (generic function with 3 methods)

Float64 is the data type for a double-precision (64-bit) floating point number. An asterisk (*) between two String objects means concatenating them. At runtime, the most specific method is used for the given combination of input arguments.

julia> f("Candy", 3.0)
"foo"

julia> f("test", "me")
"testme"

julia> f(2.0, 3.0)
6.0

Methods and types may have parameters, enclosed by a pair of curly braces ({}). A parametric method is defined as follows:

julia> g(x::T, y::T) where {T <: Real} = x * y
g (generic function with 1 method)

The function g() performs multiplication of the two arguments if the two arguments are the same subtype of Real (a type for real numbers, for example, Float64, Int32 (32-bit integer), etc.) and both are of the same type.

julia> g(2.0, 3.0)
6.0

julia> g(2, 3)
6

julia> g(2.0, 3)
ERROR: MethodError: no method matching g(::Float64, ::Int64)
Closest candidates are:
  g(::T<:Real, ::T<:Real) where T<:Real at REPL[17]:1
Stacktrace:
 [1] top-level scope at REPL[28]:1

The third command throws an error, because the two arguments have different types. Such an error can be avoided by defining a more general method:

julia> g(x::Real, y::Real) = x * y
g (generic function with 2 methods)

Here, the exact type of x and y may be different. An example of parametric types, AbstractArray, is discussed in Section 3.2.

3.2 Multidimensional arrays

An array in Julia is defined as “a collection of objects stored in a multi-dimensional grid”. Each object should be of a specific type for optimized performance, such as Float64, Int32, or String.

The top-level abstract type for a multidimensional array is AbstractArray{T,N}, where parameter T is the type of element (such as Float64, Int32), and N is the number of dimensions. AbstractVector{T} and AbstractMatrix{T} are aliases for AbstractArray{T, 1} and AbstractArray{T, 2}, respectively. Operations for AbstractArrays are provided as fallback methods which would generally work correctly in many cases, but are often slow.

The type DenseArray is a subtype of AbstractArray representing an array stored in contiguous CPU memory. The most frequently used instance of DenseArray is Array, a type for basic CPU array with a grid structure. Vector{T} and Matrix{T} are aliases for Array{T, 1} and Array{T, 2}, respectively. Another subtype of DenseArray is CuArray, defined in CUDA.jl, a contiguous array data type on a CUDA GPU. Many of array operations for CuArray are provided using the same syntax as Arrays.

A Matrix (or an instance of Array{T, 2}) is easily created using a MATLAB-like syntax such as:

julia> A = [1 2; 3 4]
22 Array{Int64,2}:
 1  2
 3  4

An Array can be allocated with undefined values using:

julia> B = Array{Float64}(undef, 2, 3)
23 Array{Float64,2}:
 6.90922e-310  6.90922e-310  6.90922e-310
 6.90922e-310  6.90922e-310  6.90921e-310

There are predefined basic functions for array operations, such as size(A) that returns a tuple of dimensions of A, eltype(A) that returns the type of elements in A, and ndims(A), that shows the number of dimensions of A.

3.3 Matrix multiplication

Linear algebra operations in Julia are defined in the basic package LinearAlgebra. The functions in LinearAlgebra can be loaded to the workspace with the keyword using:

julia> using LinearAlgebra

Matrix multiplication in Julia is defined in the function LinearAlgebra.mul!(C, A, B)¹¹1It is a convention in Julia to end the name of a function that changes the value of its arguments with an exclamation mark (!).. This function computes the post-multiplication of matrix B to matrix A, and stores the result in matrix C. The most general definition of LinearAlgebra.mul!() is:

LinearAlgebra.mul!(C::AbstractMatrix, A::AbstractVecOrMat,
                    B::AbstractVecOrMat)

which implements a naive algorithm for matrix multiplication. For a Matrix stored in the CPU memory, the call to LinearAlgebra.mul!() with arguments

LinearAlgebra.mul!(C::Matrix, A::Matrix, B::Matrix)

invokes the gemm (general matrix multiplication) routine of the BLAS, or the basic linear algebra subprograms (Blackford et al., , 2002), e.g.,

julia> A= [1. 2.; 3. 4.]
22 Array{Float64,2}:
 1.0  2.0
 3.0  4.0

julia> B = [3. 4.; 5. 6.]
22 Array{Float64,2}:
 3.0  4.0
 5.0  6.0

julia> C = Array{Float64, 2}(undef, 2, 2)
22 Array{Float64,2}:
 6.90922e-310  6.90921e-310
 6.90922e-310  6.90922e-310

julia> mul!(C, A, B)
22 Array{Float64,2}:
 13.0  16.0
 29.0  36.0

julia> C
22 Array{Float64,2}:
 13.0  16.0
 29.0  36.0

On the other hand, for matrices on GPU, a call to the same function LinearAlgebra.mul!() with arguments

LinearAlgebra.mul!(C::CuMatrix, A::CuMatrix, B::CuMatrix)

results in operations using cuBLAS (NVIDIA, , 2013), a CUDA version of BLAS:

julia> using CUDA

julia> A_d = cu(A)
22 CuArray{Float32,2,Nothing}:
 1.0  2.0
 3.0  4.0

julia> B_d = cu(B)
22 CuArray{Float32,2,Nothing}:
 3.0  4.0
 5.0  6.0

julia> C_d = cu(C)
22 CuArray{Float32,2,Nothing}:
 0.0  0.0
 0.0  0.0

julia> mul!(C_d, A_d, B_d)
22 CuArray{Float32,2,Nothing}:
 13.0  16.0
 29.0  36.0

The function cu() transforms an Array{T, N} into a CuArray{Float32, N}, transferring data from CPU to GPU.

3.4 Dot syntax for vectorization

Julia has a special “dot” syntax for vectorization. The dot syntax is invoked by prepending a dot to an operator (e.g., .+) or appending a dot to a function name (e.g., soft_threshold.()). Unlike many other programming languages (e.g., R), vectorization in Julia can be applied to any function without a need to deliberately tailor the corresponding method. Julia’s JIT compiler automatically matches singleton dimensions of array arguments to the dimensions of other array arguments. For example,

julia> a = [1, 2]
2-element Array{Int64,1}:
 1
 2

julia> b = [3 4; 5 6]
22 Array{Int64,2}:
 3  4
 5  6

julia> a .+ b
22 Array{Int64,2}:
 4  5
 7  8

Note that a is a column vector and b is a matrix.

The dot syntax can be extended by defining the method broadcast() for each array interface, allowing its generalization to any underlying hardware architecture. In addition, multiple dots on the same line of code fuse into a single call to broadcast(), translated into a single vectorized loop (for CPU) or a single generated kernel (for GPU) for that line.

While broadcasting is one of the simplest way to specifying elementwise operations, it is often not the fastest option. Broadcasting often allocates excessive memory, thus well-optimized compiled loops without memory allocation may be faster in many cases.

4 Software interface

In this section, we review the features of DistStat.jl. DistStat.jl implements a distributed MPI-based array data structure MPIArray as a subtype of AbstractArray, using MPI.jl as a backend. It has been tested for basic Arrays and CuArrays. The standard vectorized “dot” operations can be used for convenient element-by-element operations on MPIArrays as well as broadcasting operations. Furthermore, key distributed matrix operations, such as the matrix-matrix multiplication, for MPIMatrix, or two-dimensional MPIArrays, are also implemented. Reduction and accumulation operations are supported for MPIArrays of any dimensions. The package can be loaded by:

using DistStat

If GPUs are available, one that is to be used is automatically selected in a round-robin fashion upon loading the package. The rank, or the “ID” of a process, and the size, or the total number of the processes, can be accessed by:

DistStat.Rank()
DistStat.Size()

Ranks are indexed 0-based, following the MPI standard. (Note Julia’s array indexes are 1-based.)

4.1 Data Structure for Distributed MPI Array

In DistStat.jl, a distributed array data type MPIArray{T,N,AT} is defined. Here, parameter T is the type of the elements of the array, e.g., Float64 or Float32. Parameter N is the dimension of the array, 1 for vector and 2 for matrix, etc. Parameter AT (for “array type”) is the implementation of AbstractArray used for base operations: Array for the basic CPU array, and CuArray for the arrays on Nvidia GPUs (which require CUDA.jl). If there are multiple CUDA devices, a device is assigned to a process automatically by the rank of the process modulo the size. This assignment scheme extends to the setting in which there are multiple GPU devices in multiple CPU nodes. The type MPIArray{T,N,AT} is a subtype of AbstractArray{T, N}. In MPIArray{T,N,AT}, each rank holds a contiguous block of the full data in AT{T,N} split and distributed along the N-th or the last dimension of an MPIArray.

In the special case of two-dimensional arrays, aliased MPIMatrix{T,AT} and created using the function transpose(), the data is column-major ordered and column-split. The transpose of this matrix has type of Transpose{T,MPIMatrix{T,AT}}, which is a lazy wrapper representing row-major ordered and row-split matrix. There also is an alias for one-dimensional array MPIArray{T,1,AT}, which is MPIVector{T,AT}.

4.1.1 Creation

The syntax MPIArray{T,N,AT}(undef, m, ...) creates an uninitialized MPIArray. For example,

a = MPIArray{Float64, 2, Array}(undef, 3, 4)

creates an uninitialized 3 $\times$ 4 distributed array based on local Arrays of double precision floating-point numbers. The size of this array, the type of each element, and number of dimensions can be accessed using the usual functions in Julia: size(a), eltype(a), and ndims(a). Local data held by each process can be accessed by appending .localarray to the name of the array, e.g.,

a.localarray

Matrices are split and distributed as evenly as possible. For example, if the number of processes is 4 and the size(a) == (3, 7), processes of ranks 0 through 2 hold the local data of size (3, 2) and the rank-3 process holds the local data of size (3, 1).

An MPIArray can also be created by distributing an array residing in a single process. For example, in the following code:

if DistStat.Rank() == 0
    dat = [1, 2, 3, 4]
else
    dat = Array{Int64}(undef, 0)
end
d = distribute(dat)

the data is defined in the rank-0 process, and each other process has an empty instance of Array{Int64}. Using the function distribute, the MPIArray{Int64, 1, Array} of the data [1, 2, 3, 4], equally distributed over four processes, is created.

4.1.2 Filling an array

An MPIArray a can be filled with a number x using the usual syntax of the function fill!(a, x). For example, a can be filled with zero:

fill!(a, 0)

4.1.3 Random number generation

An array can also be filled with random values, extending Random.rand!() for the standard uniform distribution and Random.randn!() for the standard normal distribution. The following code fills a with uniform(0, 1) random numbers:

using Random
rand!(a)

In cases such as unit testing, generating identical data for any configuration is important. For this purpose, the following interface is defined:

function rand!(a::MPIArray{T,N,A}; seed=nothing, common_init=false,
    root=0) where {T,N,A}

If the keyword argument common_init is set true, the data are generated from the process with rank root. If common_init == false, each process independently fills its local array with random numbers. A seed can also be passed. If common_init == false and seed == k, the seed for each process is set to k plus the rank.

4.2 “Dot” syntax and vectorization

The “dot” broadcasting feature of DistStat.jl follows the standard Julia syntax. This syntax provides a convenient way to operate on both multi-node clusters and multi-GPU workstations with the same code. For example, the soft-thresholding operator, which appears commonly in sparse regression, can be defined in the element level:

function soft_threshold(x::T, lambda::T)::T where T <: AbstractFloat
    x > lambda && return (x - lambda)
    x < -lambda && return (x + lambda)
    return zero(T)
end

This function can be applied to each element of an MPIArray using the dot broadcasting, as follows. When the dot operation is used for an MPIArray{T,N,AT}, it is naturally passed to inner array implementation AT. Consider the following arrays filled with random numbers from the standard normal distribution:

a = MPIArray{Float64, 2, Array}(undef, 2, 4) |> randn!
b = MPIArray{Float64, 2, Array}(undef, 2, 4) |> randn!

The function soft_threshold() is applied elementwisely as the following:

a .= soft_threshold.(a .+ 2 .* b, 0.5)

The three dot operations, .=, .+, and .*, are fused into a single loop (in CPU) or a single kernel (in GPU) internally.

A singleton non-last dimension is treated as if the array is repeated along that dimension, just like Array operations. For example,

c = MPIArray{Float64, 2, Array}(undef, 1, 4) |> rand!
a .= soft_threshold.(a .+ 2 .* c, 0.5)

works as if c were a $2\times 4$ array, with its content repeated twice. The terminal dimension is a little bit subtle, as an MPIArray{T,N,AT} is distributed along that dimension. Dot broadcasting along this direction works if the broadcast array has the type AT and holds the same data across the processes. For example,

d = Array{Float64}(undef, 2, 1); fill!(d, -0.1)
a .= soft_threshold.(a .+ 2 .* d, 0.5)

As with any dot operations in Julia, the dot operations for DistStat.jl are convenient but usually not the fastest option. Its implementations can be further optimized by specializing in specific array types. An example of this is given in Section 5.3.

4.3 Reduction operations and accumulation operations

Reduction operations, such as sum(), prod(), maximum(), minimum(), and accumulation operations, such as cumsum(), cumsum!(), cumprod(), cumprod!(), are implemented just like their base counterparts. Example usages of sum() and sum!() are:

sum(a)
sum(abs2, a)
sum(a, dims=1)
sum(a, dims=2)
sum(a, dims=(1,2))
sum!(c, a)
sum!(d, a)

The first line computes the elementwise sum of a. The second line computes the sum of squared absolute values (abs2() is the function that computes the squared absolute values). The third and fourth lines compute the column sums and row sums, respectively. Similar to the dot operations, the third line reduces along the distributed dimensions, and returns a broadcast local Array. The fifth line returns the sum of all elements, but the data type is a $1\times 1$ MPIArray. The syntax sum!(p, q) selects which dimension to reduce based on the shape of p, the first argument. The sixth line computes the columnwise sum and saves it to c, because c is a $1\times 4$ MPIArray. The seventh line computes rowwise sum, because d is a $2\times 1$ local Array.

Given below are examples for cumsum() and cumsum!():

cumsum(a; dims=1)
cumsum(a; dims=2)
cumsum!(b, a; dims=1)
cumsum!(b, a; dims=2)

The first line computes the columnwise cumulative sum, and the second line computes the rowwise cumulative sum. So do the third and fourth lines, but save the results in b, which has the same size as a.

4.4 Distributed linear algebra

4.4.1 Dot product

The method LinearAlgebra.dot() for MPIArrays is defined just like the base LinearAlgebra.dot(), which sums all the elements after an elementwise multiplication of the two argument arrays:

using LinearAlgebra
dot(a, b)

4.4.2 Operations on the diagonal

The “getter” method for the diagonal, diag!(d, a), and the “setter” method for the diagonal, fill_diag!(), are also available. The former obtains the main diagonal of the MPIMatrix a and is stored in d. If d is an MPIMatrix with a single row, the result is obtained in a distributed form. On the other hand, if d is a local AbstractArray, all elements of the main diagonal is copied to all processes as a broadcast AbstractArray:

M = MPIMatrix{Float64, Array}(undef, 4, 4) |> rand!
v_dist = MPIMatrix{Float64, Array}(undef, 1, 4)
v = Array{Float64}(undef, 4)
diag!(v_dist, M)
diag!(v, M)

4.4.3 Matrix multiplication

Function LinearAlgebra.mul!(C, A, B) left-multiplies matrix A ( $p\times q$ ) to another matrix B ( $q\times r$ ) to store the product to matrix C ( $p\times r$ ). The present implementation of mul!() for MPIMatrixes improves the six distributed matrix multiplication scenarios for multiplying tall-and-skinny and/or wide-and-short matrices implemented using Python and MPI in Ko et al., (2020). Each input matrix can be either broadcast (represented by the type AbstractMatrix), column-split-and-distributed (MPIMatrix), or row-split-and-distributed (transposed MPIMatrix), resulting in nine possible combinations. Since MPI communication is not needed when both A and B are broadcast, there remain eight possibilities. While the configuration of the output C is mostly determined by the input combination, there are cases in which multiple configurations are possible. Table 1 collects the scenarios considered. If A is an MPIMatrix (column-distributed) and B has the same configuration or is broadcast, then the output C is a MPIMatrix to preserve the row distribution of A (Scenarios a and e). However, if B is a transposed MPIMatrix (row-distributed), then C can be any of AbstractMatrix, MPIMatrix, or the transposed MPIMatrix (Scenarios b, c, d). Likewise, if A is a transposed MPIMatrix and B has the same configuration or is broadcast, then the output C is a transposed MPIMatrix preserving the row distribution of A (Scenarios h and i), while if B is a MPIMatrix, then C can be either an MPIMatrix or a transposed MPIMatrix (Scenarios f and g). When A is broadcast but B is distributed, the splitting of the inner dimension $q$ does not affect the that of the output (Scenarios j and k). Thus, there are total 11 scenarios of matrix-matrix multiplication. Additional six scenarios are dedicated for matrix-vector multiplication, where A is a MPIMatrix or its transpose, and B is either MPIColVector{T, AT} defined as Union{MPIVector{T,AT}, Transpose{T, MPIMatrix{T,AT}}} to allow distribution of the rows, or broadcast AbstractVector. Scenarios b and c for matrix-matrix multiplication merge into Scenario l; Scenarios d and e translate into Scenarios m and n, respectively. The rest are similarly translated.

The internal implementation uses collective communication calls including MPI.Allreduce!() and MPI.Allgather!() from MPI.jl for communication. For example, when A is an MPIMatrix ( $p\times q$ , column-split), B is a transposed MPIMatrix ( $q\times r$ , row-split), and C is a broadcast AbstractMatrix ( $p\times r$ ), each process computes the local $p\times r$ multiplication first. Then, the results are reduced using MPI.Allreduce!() to compute the sum of local results.

In addition to the base syntax mul!(C, A, B), an extra keyword argument for a temporary memory can be optionally provided, e.g, mul!(C, A, B; tmp=Array(undef, 3, 4)), in order to to save intermediate results. This is useful for avoiding repetitive allocations in iterative algorithms. The user should determine which shape of C minimizes communication and suits better for their application.

	A ( $p\times q$ )	B ( $q\times r$ )	C ( $p\times r$ )	tmp (if defined)	dimension of tmp
a	MPIMatrix	MPIMatrix	MPIMatrix	AbstractMatrix	$p\times q$
b	MPIMatrix	Transposed MPIMatrix	MPIMatrix	AbstractMatrix	$p\times r$
c	MPIMatrix	Transposed MPIMatrix	Transposed MPIMatrix	AbstractMatrix	$r\times p$
d	MPIMatrix	Transposed MPIMatrix	AbstractMatrix
e	MPIMatrix	AbstractMatrix	AbstractMatrix
f	Transpose MPIMatrix	MPIMatrix	MPIMatrix	AbstractMatrix	$q\times p$
g	Transpose MPIMatrix	MPIMatrix	Transposed MPIMatrix	AbstractMatrix	$q\times r$
h	Transpose MPIMatrix	Transposed MPIMatrix	Transposed MPIMatrix	AbstractMatrix	$r\times q$
i	Transpose MPIMatrix	AbstractMatrix	Transposed MPIMatrix
j	AbstractMatrix	MPIMatrix	MPIMatrix
k	AbstractMatrix	Transposed MPIMatrix	AbstractMatrix
l	MPIMatrix	MPIColVector	MPIColVector	AbstractVector	$p$
m	MPIMatrix	MPIColVector	AbstractVector
n	MPIMatrix	AbstractVector	AbstractVector
o	Transposed MPIMatrix	MPIColVector	MPIColVector	AbstractVector	$q$
p	Transposed MPIMatrix	AbstractVector	MPIColVector
q	Transposed MPIMatrix	AbstractVector	AbstractVector

Table 1: List of implementations for LinearAlgebra.mul!(C, A, B) with an optional keyword argument tmp for temporary storage.

4.4.4 Operator norms

The method opnorm() either evaluates ( $\ell_{1}$ and $\ell_{\infty}$ ) or approximates ( $\ell_{2}$ ) matrix operator norms, defined for a matrix $A\in\mathbb{R}^{m\times n}$ as $\|A\|=\sup\{\|Ax\|:x\in\mathbb{R}^{n}\text{ with }\|x\|=1\}$ for each respective vector norm.

opnorm(a, 1)
opnorm(a, 2)
opnorm(a, Inf)

The $\ell_{2}$ -norm is estimated via the power iteration (Golub and Van Loan, , 2013), and can be further configured for convergence criterion and number of iterations. There also is an implementation based on the inequality $\|A\|_{2}\leq\|A\|_{1}\|A\|_{\infty}$ (method="quick"), which overestimates the $\ell_{2}$ -norm.

opnorm(a, 2; method="power", tol=1e-6, maxiter=1000, seed=95376)
opnorm(a, 2; method="quick")

5 Applications

In this section, we consider several statistical applications implemented with DistStat.jl, namely the nonnegative matrix factorization (NMF), multidimensional scaling (MDS), and $\ell_{1}$ -regularized Cox proportional hazards regression. All of these applications require efficient iterative algorithms to scale up to high dimensions. For each mathematical description of the algorithms, we provide a code snippet for the implementation. The snippets are simplified for ease of exposition, and the full implementation is given in the examples/ directory in the code submitted. These code files should be run in such a way to incorporate MPI: by using mpirun or through the job scheduler. Also, we demonstrate the scalability of DistStat.jl over multiple GPUs on a local workstation and a virtual cluster on Amazon Web Services Elastic Compute Cloud (AWS EC2) with these examples and compare the timing with dist_stat, our PyTorch implementation of distributed arrays (Ko et al., , 2020). Julia 1.2.0 was used for DistStat.jl, and PyTorch 1.0 compiled from source installed on Python 3.6 was used for dist_stat. To compare the execution time, we fixed the number of iterations. The system configurations are summarized in Table 2. These are deliberated set the same as Ko et al., (2020) for fair comparison. The virtual AWS EC2 cluster was managed by CfnCluster (Amazon Web Services, , 2019).

local node

AWS c5.18xlarge

CPU

GPU

CPU

Model

Intel Xeon E5-2680 v2

Nvidia GTX 1080

Intel Xeon

Platinum 8124M

# of cores

2560

Clock

2.8 GHz

1.6 GHz

3.0GHz

# of entities

2 (per instance)

\times

1-20 (instances)

Total memory

256 GB

64 GB

144 GB

\times

4–20

Total cores

20,480 (CUDA)

\times

4–20

Table 2: Hardware configuration of experiments. This table is equivalent to Table 2 of Ko et al., (2020)

All the experiments using AWS EC2 instances used double-precision floating-point numbers. All the GPU experiments were conducted with single-precision floating-point numbers unless otherwise noted. It is previously reported that the single-precision results are equivalent to the double-precision results up to six significant digits (Ko et al., , 2020).

In general, multi-GPU implementation results of DistStat.jl are largely comparable to that of dist_stat. In large-scale AWS EC2 experiments, DistStat.jl achieved faster computation thanks to the increased flexibility of process configuration: when the communication is heavy, we can use a configuration with less jobs, each job using more threads; when communication is of little problem, we can use a configuration with more jobs, each job using a single thread. This is nearly impossible with dist_stat: due to the limitation of torch.distributed subpackage of PyTorch, each process has to hold the same size of data. In addition, its MPI wrappers force copying of data before and after the data communication, while MPI.jl on the backend of DistStat.jl does not.

5.1 Nonnegative matrix factorization

Given a matrix $X\in\mathbb{R}^{m\times n}$ that consists of nonnegative values, NMF approximates $X$ by a product of two rank- $r$ ( $r\ll\min\{m,n\}$ ) nonnegative matrices $V\in\mathbb{R}^{m\times r}$ and $W\in\mathbb{R}^{r\times n}$ . This method has been applied in diverse disciplines, such as bioinformatics, recommender systems, astronomy, and image processing (Wang and Zhang, , 2013). In achieving this goal, consider minimizing the cost function

\displaystyle f(V,W)

\displaystyle=\|X-VW\|_{\mathrm{F}}^{2},

where $\|M\|_{\mathrm{F}}$ is the Frobenius norm of matrix $M$ . A famous multiplicative algorithm to compute a local minimum of this cost function is due to Lee and Seung, (1999, 2001):

	$\displaystyle V_{(n+1)}$	$\displaystyle=V_{(n)}\odot XW_{(n)}^{\top}\oslash V_{(n)}W_{(n)}W_{(n)}^{\top}$
	$\displaystyle W_{(n+1)}$	$\displaystyle=W_{(n)}\odot V_{(n+1)}^{\top}X\oslash V_{(n+1)}^{\top}V_{(n+1)}^{\top}W_{(n)},$

where $\odot$ and $\oslash$ are elementwise multiplication and division, respectively. An alternative is the alternating projected gradient (APG) algorithm (Lin, , 2007):

	$\displaystyle V_{(n+1)}$	$\displaystyle=\max\{0,V_{(n)}-\sigma_{n}(V_{(n)}W_{(n)}W_{(n)}^{\top}-XW_{(n)}^{\top})\}$
	$\displaystyle W_{(n+1)}$	$\displaystyle=\max\{0,W_{(n)}-\tau_{n}(V_{(n+1)}^{\top}V_{(n+1)}W_{(n)}-V_{(n+1)}^{\top}X)\},$

where the maximum is applied in an elementwise fashion. The $\sigma_{n}$ and $\tau_{n}$ are the step sizes, for which we can choose $\sigma_{n}=1/2\|W_{(n)}W_{(n)}^{\top}\|_{\mathrm{F}}^{2}$ and $\tau_{n}=1/2\|V_{(n+1)}^{\top}V_{(n+1)}\|_{\mathrm{F}}^{2}$ . Such selection is known to promote the algorithm to converge in fewer iterations than the multiplicative algorithm (Ko et al., , 2020).

An implementation of the multiplicative algorithm of NMF in DistStat.jl is

using DistStat, Random, LinearAlgebra
m = 10000; n = 10000; r = 20
T = Float64; A = Array
X  = MPIMatrix{T, A}(undef, m, n); rand!(X)
Vt = MPIMatrix{T, A}(undef, r, m); rand!(Vt)
W  = MPIMatrix{T, A}(undef, r, n); rand!(W)
WXt = MPIMatrix{T, A}(undef, r, m)
WWt = A{T}(undef, r, r)
WWtVt = MPIMatrix{T, A}(undef, r, m)
VtX = MPIMatrix{T, A}(undef, r, n)
VtV = A{T}(undef, r, r)
VtVW = MPIMatrix{T, A}(undef, r, n)
eps = 1e-10
for i in 1:iter
    mul!(WXt, W, transpose(X))
    mul!(WWt, W, transpose(W))
    mul!(WWtVt, WWt, Vt)
    Vt .= Vt .* WXt ./ (WWtVt .+ eps)
    mul!(VtX, Vt, X)
    mul!(VtV, Vt, transpose(Vt))
    mul!(VtVW, VtV, W)
    W .= W .* VtX ./ (VtVW .+ eps)
end

A main loop for the APG algorithm is given by:

for i in 1:iter
    mul!(WXt, W, transpose(X))
    mul!(WWt, W, transpose(W))
    mul!(WWtVt, WWt, Vt)
    sigma = 1.0 / (2 * (sum(WWt.^2)) + eps)
    Vt .= max.(0.0, Vt .- sigma .* (WWtVt .- WXt))
    mul!(VtX, Vt, X)
    mul!(VtV, Vt, transpose(Vt))
    mul!(VtVW, VtV, v.W)
    tau = 1.0 / (2 * (sum(VtV.^2)) + eps)
    W .= max.(0.0, W .- tau .* (VtVW .- VtX))
end

This code runs on a single CPU node, and can be modified to run on a GPU workstation (multiple GPUs are allowed) by importing the package CUDA.jl and changing A = Array to A = CuArray on the third line. This selection is made through a command-line argument for the full implementation (in the examples/ directory), where multi-node clusters are supported. Matrix $W$ (W) and the transpose of $V$ (Vt) are updated for each iteration. The intermediate results, WXt ( $WX^{\top}$ ), WWt ( $WW^{\top}$ ), WWtVt ( $WW^{\top}V^{\top}$ ), VtX ( $V^{\top}X$ ), VtV ( $V^{\top}V$ ), and VtVW ( $V^{\top}VW$ ) are preallocated. A very small number (eps) is added to the denominator for numerical stability.

Thanks to the transparent implementation of distributed arrays and mul!() in Section 4, the main loop is not different from what one would write with native Julia. The same code with slight changes in the first three lines run on various HPC environments including CPU clusters and multi-GPU workstations in a distributed fashion. In our numerical experiments, further optimization for memory efficiency was conducted.

Table 3 compares the performance of the two NMF algorithms on the multi-GPU setting in Table 2 with $10,000\times 10,000$ data for 10,000 iterations. It can be seen that the performance of the two algorithms is comparable, with APG being slightly slower with fixed number of iterations. This is because APG involves slightly more operations. With more than 4 GPUs, the communication burden outweighs the speed-up from using more GPU cores, and the algorithm slows down. The execution time between the dist_stat and DistStat.jl implementations are also largely comparable, with DistStat.jl version being faster in $r=20$ cases. Experiments with 3, 6, or 7 GPUs were not possible with dist_stat, because the size of the data was not divisible by 3, 6, and 7.

	GPUs	dist_stat			DistStat.jl
	GPUs	$r=20$	$r=40$	$r=60$	$r=20$	$r=40$	$r=60$
Multiplicative	1	62	71	75	62	72	83
	2	43	55	63	42	60	72
	3	–	–	–	38	57	71
	4	37	51	63	34	54	68
	5	39	54	66	38	56	75
	6	–	–	–	36	56	80
	7	–	–	–	37	58	81
	8	40	60	75	37	59	83
APG	1	68	76	82	61	80	85
	2	49	61	69	43	60	79
	3	–	–	–	38	59	74
	4	44	58	70	36	54	72
	5	46	60	73	37	59	78
	6	–	–	–	37	56	75
	7	–	–	–	38	61	88
	8	47	68	83	39	59	82

Table 3: Runtime (in seconds) of NMF algorithms on

10,000\times 10,000

simulated data on GPUs. Experiments could not be run with 3, 6, and 7 GPUs on dist_stat, because 3, 6, and 7 did not divide the data size.

Table 4 compares the algorithms and implementations using $200,000\times 200,000$ data on multiple AWS EC2 instances for 1000 iterations. We used two processes per instance to avoid the communication burden. Once again, elapsed time was largely similar between the two algorithms. APG was faster than the multiplicative algorithms in more cases compared to the GPU case, because the multiplicative algorithm on CPU often suffers from slowdown due to creation of denormal numbers (Ko et al., , 2020). Between the two implementations, the DistStat.jl implementation was faster in 24 out of 30 cases.

	Instances	dist_stat			DistStat.jl
	Instances	$r=20$	$r=40$	$r=60$	$r=20$	$r=40$	$r=60$
Multiplicative	4	1419	1748	2276	1392	1576	2057
	5	1076	1455	1698	1187	1383	1847
	8	859	966	1347	851	936	1430
	10	651	881	1115	708	856	1065
	16	549	700	959	553	694	907
	20	501	686	869	554	672	832
APG	4	1333	1756	2082	1412	1711	2023
	5	1088	1467	1720	1215	1372	1775
	8	766	994	1396	849	916	1388
	10	677	870	1165	673	799	1014
	16	539	733	936	547	684	867
	20	506	730	919	538	727	836

Table 4: Runtime (in seconds) of NMF algorithms on

200,000\times 200,000

simulated data on multiple AWS EC2 instances

5.2 Multidimensional scaling

MDS is a dimensionality reduction method that embeds a high-dimensional data $X=[x_{1},\dotsc,x_{m}]^{\top}\in\mathbb{R}^{m\times n}$ into $\boldsymbol{\theta}=[\theta_{1},\dotsc,\theta_{m}]^{\top}\in\mathbb{R}^{m\times q}$ in a lower dimensional Euclidean space ( $q\ll n$ ) while keeping the distance measures as close as possible. We use the cost function

	$\displaystyle f(\theta)$	$\displaystyle=\sum_{i=1}^{m}\sum_{j\neq i}w_{ij}(y_{ij}-\\|\theta_{i}-\theta_{j}\\|_{2})^{2}$
		$\displaystyle=\sum_{i=1}^{m}\sum_{j\neq i}\left[-2w_{ij}y_{ij}\\|\theta_{i}-\theta_{j}\\|_{2}+w_{ij}\\|\theta_{i}-\theta_{j}\\|_{2}^{2}\right]+\mathrm{const.},$

where $y_{ij}$ is the distance measure between $x_{i}$ and $x_{j}$ ; and $w_{ij}$ are the weights. Using the majorization-minimization principle (de Leeuw and Heiser, , 1977; de Leeuw, , 1977), the update equation is given by

\displaystyle\theta_{(n+1)ik}

\displaystyle=\left.\left(\sum_{j\neq i}w_{ij}\left[y_{ij}\frac{\theta_{(n)ik}-\theta_{(n)jk}}{\|\theta_{(n)i}-\theta_{(n)j}\|_{2}}+(\theta_{(n)ik}+\theta_{(n)jk})\right]\right)\right/\left(2\sum_{j\neq i}w_{ij}\right).

for $i=1,\dotsc,m$ and $k=1,\dotsc,q$ . The code below is a simple implementation of MDS in DistStat.jl, which runs on the same environment as the example of the previous subsection.

using DistStat, Random, LinearAlgebra
m = 1000; n = 1000; r = 20
T = Float64; A = Array; iter=1000
X = MPIMatrix{T, A}(undef, m, n); rand!(X)
Y = MPIMatrix{T, A}(undef, n, n)
theta = MPIMatrix{T, A}(undef, r, n); rand!(theta)
theta = 2theta .- 1.0 # range (-1, 1)
theta_distances = MPIMatrix{T, A}(undef, n, n)
theta_WmZ = MPIMatrix{T, A}(undef, r, n)
d_dist = MPIMatrix{T, A}(undef, 1, n) # dist. row vector
d_local = A{T}(undef, n) # bcast. col vector
DistStat.euclidean_distance!(Y, X) # compute pairwise dist
W_sums = convert(T, n - 1)
for i in 1:iter
    mul!(theta_distances, transpose(theta), theta)
    diag!(d_dist, theta_distances)
    diag!(d_local, theta_distances)
    theta_distances .= -2theta_distances .+ d_dist .+ d_local
    fill_diag!(theta_distances, Inf)
    Z = Y ./ theta_distances
    Z_sums = sum(Z; dims=1) # Z sums, length n dist. row vector.
    WmZ = 1.0 .- Z
    fill_diag!(WmZ, zero(T)) # weights of diagonal elements are zero
    mul!(theta_WmZ, theta, WmZ)
    theta .= (theta .* (Z_sums .+ W_sums) .+ theta_WmZ) ./ 2W_sums
end

This code can also run with local arrays if diag!() and fill_diag!() are defined for them. The function DistStat.euclidean_distance!(Y, X) computes the pairwise Euclidean distance matrix $Y$ between the input points in $X$ , which incorporates the chunking method (Li et al., , 2010).

Table 5 compares the performance of DistStat.jl and dist_stat on multiple GPUs with $10,000\times 10,000$ data matrix $X$ . Timing of dist_stat was faster when the number of GPUs employed was small. This is because dist_stat uses highly-optimized precompiled kernels from PyTorch, while DistStat.jl uses just-in-time compiled generated kernels. However, the gap between the two implementations vanishes as more GPUs are utilized.

GPUs	dist_stat			DistStat.jl
GPUs	$q=20$	$q=40$	$q=60$	$q=20$	$q=40$	$q=60$
1	292	301	307	402	415	423
2	146	151	154	267	275	279
3	–	–	–	210	212	216
4	81	84	88	89	93	97
5	74	78	80	77	83	85
6	–	–	–	64	70	72
7	–	–	–	58	64	69
8	52	58	64	53	60	65

Table 5: Runtime (in seconds) of MDS on

10,000\times 10,000

simulated data on multiple GPUs. Experiments could not be run with 3, 6, and 7 GPUs on dist_stat, because 3, 6, and 7 did not divide the data size.

For the AWS experiments, we used 36 processes per instance, because the step that mainly causes inter-instance communication is the matrix-vector multiplication, and its communication cost is much less than NMF. Note that this setting was not possible with the dist_stat implementation. Table 6 shows the runtime of each experiment for 1000 iterations on $100,000\times 1000$ dataset. It can be easily seen that DistStat.jl implementation is significantly faster.

Instances	dist_stat			DistStat.jl
Instances	$q=20$	$q=40$	$q=60$	$q=20$	$q=40$	$q=60$
4	2875	3097	3089	2093	2007	2188
5	2315	2378	2526	1625	1704	1746
8	1531	1580	1719	1073	1105	1215
10	1250	1344	1479	909	980	1022
16	821	914	1031	630	714	736
20	701	823	903	531	663	701

Table 6: Runtime (in seconds) of MDS on

100,000\times 1000

simulated data on multiple AWS EC2 instances

5.3 $\ell_{1}$ -regularized Cox regression

Another example we consider is the Cox proportional hazards model-based regression (Cox, , 1972) with $\ell_{1}$ -regularization. In this regression problem, the covariate matrix $X\in\mathbb{R}^{m\times n}$ is given, as well as the survival time for each sample, which is possibly right-censored. They are denoted by $y=(y_{1},\dotsc,y_{m})$ where $y_{i}=\min\{t_{i},c_{i}\}$ , and $t_{i}$ is the time to event and $c_{i}$ is time to right censoring. Let $\delta_{i}=I_{\{t_{i}\leq c_{i}\}}$ indicate whether the event occured before the right censoring of sample $i$ . The log-partial likelihood of the Cox proportional hazards model is defined by

\displaystyle L(\beta)

\displaystyle=\sum_{i=1}^{m}\delta_{i}\left[\beta^{\top}x_{i}\log\left(\sum_{j:y_{j}\geq y_{i}}\exp(\beta^{\top}x_{j})\right)\right].

We maximize $L(\beta)-\lambda\|\beta\|_{1}$ . Applying the proximal gradient algorithm, which is equivalent to the iterative soft-thresholding algorithm (Beck and Teboulle, , 2009) to this problem, the iteration is given by:

	$\displaystyle w_{(n+1)i}$	$\displaystyle=\exp(x_{i}^{\top}\beta);\;\;W_{(n+1)j}=\sum_{i:y_{i}\geq y_{j}}w_{(n+1)i}$
	$\displaystyle\pi_{(n+1)ij}$	$\displaystyle=I(t_{i}\geq t_{j})w_{(n+1)i}/W_{(n+1)j}$
	$\displaystyle\Delta_{(n+1)}$	$\displaystyle=X^{\top}(I-P_{(n+1)})\delta,\text{ where $P_{(n+1)}=(\pi_{(n+1)ij})$}$
	$\displaystyle\beta_{(n+1)}$	$\displaystyle=\mathcal{S}_{\lambda}(\beta_{(n)}+\sigma\Delta_{(n+1)}),$

where $\mathcal{S}_{\lambda}(\cdot)$ is the soft-thresholding operator discussed in Section 4.2. If the samples are sorted in nonincreasing order of $y_{i}$ , the summation $\sum_{i:y_{i}\geq y_{j}}w_{(n+1)i}$ can be implemented using the function cumsum(). Convergence of the algorithm is guaranteed if we choose the step size $\sigma=1/(2\|X\|_{2}^{2})$ (Ko et al., , 2020). A simple implementation of this algorithm in Julia, assuming no ties in $y_{i}$ , can be written as:

using DistStat, LinearAlgebra, Random
m = 1000; n = 1000
T = Float64; A = Array; iter=1000
sigma = 0.00001 # step size
lambda = 0.00001 # penalty param
X = MPIMatrix{T,A}(undef, m, n); randn!(X)
delta = convert(A{T}, rand(m) .> 0.3) # 30% 1, 70% 0.
DistStat.Bcast!(delta) # bcast the delta from rank-0 proc.
y = convert(A{T}, collect(reverse(1:size(X,1))))
# assume decreasing observed time
beta = MPIVector{T,A}(undef, n)
pi_ind = MPIMatrix{T,A}(undef, m, m)
y_dist = distribute(reshape(y, 1, :))
fill!(pi_ind, one(T))
pi_ind .= ((pi_ind .* y_dist) .- y) .<= 0
Xbeta = A{T}(undef, m)
W = A{T}(undef, m)
pi_delta = A{T}(undef, m)
gradient = MPIVector{T, A}(undef, n)
for i in 1:iter
    mul!(Xbeta, X, beta)
    w = exp.(Xbeta)
    cumsum!(W, w)
    W_dist = distribute(reshape(W, 1, :))
    pi = pi_ind .* w ./ W_dist
    mul!(pi_delta, pi, delta)
    dmpd = delta .- pi_delta
    mul!(gradient, transpose(X), dmpd)
    beta .= soft_threshold.(beta .+ sigma .* gradient, lambda)
end

For performance optimization, note that in addition to the memory for $X$ , an intermediate storage for two $m\times m$ matrices are needed for pi and pi_ind. This can be avoided by environment-specific implementation. For example, a CPU function to compute $P_{(n+1)}\delta$ can be written using LoopVectorization.jl (Elrod, , 2020) for efficient SIMD parallelization using the Advanced Vector Extensions (AVX, Firasta et al., , 2008). For GPU, a kernel directly computing $P_{(n+1)}\delta$ can be written using CUDA.jl. These environment-specific implementations not only use less memory, but also results in some speed-up. On the local workstation we used, the device-specific CPU implementation with four processes with each process using a single core took roughly half the time compared to the dot broadcasting-based implementation. The GPU implementation with four GPUs was 5-10% faster. Code for accelerating computation of $P_{(n+1)}\delta$ is avialable in Appendix B.

Table 7 demonstrates the scalability of the proximal gradient algorithm for $\ell_{1}$ -regularized Cox regression on multiple GPUs. The speed-up with 8 GPUs is clear. While DistStat.jl was slightly slower than dist_stat with fewere GPUs, the gap is in general small. Unfortunately, the underlying algorithm for the cumsum() method in PyTorch is numerically unstable, and could not be used for very small values of $\lambda$ . For this reason, Ko et al., (2020) resorted to double precision. On the other hand, the cumsum() function from CUDA.jl is numerically stable for these values of $\lambda$ .

GPUs	dist_stat	DistStat.jl
GPUs	Float64	Float64	Float32
1	382	447	292
2	205	196	113
3	–	160	91
4	115	136	80
5	98	121	75
6	–	113	71
7	–	106	69
8	124	86	67

Table 7: Runtime (in seconds) of

\ell_{1}

-regularized Cox regression on

10,000\times 10,000

simulated data on multiple GPUs with

\lambda=10^{-8}

. Experiments could not be run with 3, 6, and 7 GPUs on dist_stat, because 3, 6, and 7 did not divide the data size.

For the AWS experiments on DistStat.jl, 36 processes per instance was used once again. Table 8 shows the runtime of the algorithm for 1000 iterations with a simulated $100,000\times 200,000$ dataset. Thanks to the flexibility of the Julia implementation allowing 36 threads per instance, the speed-up of DistStat.jl over dist_stat, which used two threads per instance, is obvious.

Nodes	dist_stat	DistStat.jl
4	1455	918
5	1169	819
8	809	558
10	618	447
16	389	290
20	318	245

Table 8: Runtime (in seconds) of

\ell_{1}

-regularized Cox regression on

100,000\times 200,000

simulated data on multiple AWS EC2 instances with

\lambda=10^{-8}

6 Genome-wide survival analysis of the UK Biobank dataset

In this section, we demonstrate an application of DistStat.jl on the genome-wide survival analysis on the UK Biobank dataset (Sudlow et al., , 2015) using the $\ell_{1}$ -regularized Cox regression code for Type 2 Diabetes (T2D). The UK Biobank dataset consists of 500,000 subjects and approximately 800,000 single nucleotide polymorphisms (SNPs). We used 402,297 patients including 17,994 T2D patients and 470,194 SNPs, after filtering possible Type 1 Diabetes patients and filtering SNPs with high genotyping failure rates and low minor allele frequencies. In addition to the SNPs, gender, and the top ten principal components were included as unpenalized covariates. Unlike Ko et al., (2020), who used only a half of the full dataset due to the limitations of dist_stat, the improved memory efficiency of DistStat.jl allowed us to use the entire dataset, which is about 350 gigabytes, with the additional memory for computation. The age of onset was used as the “survival time” $y_{i}$ for T2D patients ( $\delta=1$ ), and the age at the last visit was used as $y_{i}$ for non-T2D subjects ( $\delta=0$ ). We used 43 different values of $\lambda$ in range $[6.0\times 10^{-9},1.5\times 10^{-8}]$ , where 0 to 320 SNPs were selected. Breslow’s method (Breslow, , 1972) was applied for any ties in $y_{i}$ . We used 20 c5.18xlarge instances for the analysis. It took less than 2050 iteration until convergence, where convergence was determined by testing if $\frac{|f(\beta_{(n)})-f(\beta_{(n-10)})|}{|f(\beta_{(n)})+1|}<10^{-5}$ . For each $\lambda$ , the experiment took between 3180 and 3720 seconds.

We ranked the SNPs based on the largest of $\lambda$ for which each SNP has nonzero coefficient with ties broken by the absolute values of the coefficients when joining the model. The set of top nine selections was identical to that of Ko et al., (2020) with a slightly different order, as listed in Table 9. The set includes SNPs on TCF7L2, which is a well-established T2D-related gene (Scott et al., , 2007; The Wellcome Trust Case Control Consortium, , 2007), and those from SLC45A2 and HERC2, whose variants are known to affect skin, eye, and hair pigmentation (Cook et al., , 2009). Occurrence of SLC45A2 and HERC2 may be because of bias in European subjects. As with Ko et al., (2020), significance test using unpenalized Cox regression with only selected SNPs, gender, and top 10 principal components gave a result that is more specific to T2D. We selected SNPs with $p$ -values less than $0.01/333$ using Bonferroni correction to control the family-wise error rate under 0.01. Table 10 lists the 9 selected SNPs.

Six of the SNPs, including the SNPs with five lowest $p$ -values were previously reported to have direct association with T2D (rs1801212 from WFS1 (Fawcett et al., , 2010), rs4506565 from TCF7L2 (The Wellcome Trust Case Control Consortium, , 2007; Dupuis et al., , 2010), rs2943640 from IRS1 (Langenberg et al., , 2014), rs10830962 from MTNR1B (Klimentidis et al., , 2014; Salman et al., , 2015), rs343092 from HMGA2 (Ng et al., , 2014), and rs231362 from KCNQ1 (Riobello et al., , 2016)). In addition, rs1351394 is from HMGA2, known to be associated with T2D. This appears to be an improvement over Ko et al., (2020) in which three of the top nine selections were found to be directly associated with T2D and three others were on the known T2D-associated genes.

Rank	SNP ID	Chr	Location	A1^A	A2^B	MAF^C	Mapped Gene	Sign
1	rs4506565	10	114756041	A	T	0.314	TCF7L2	$+$
2	rs16891982	5	33951693	G	C	0.073	SLC45A2	$-$
3	rs12243326	10	114788815	T	C	0.281	TCF7L2	$+$
4	rs12255372	10	1148088902	G	T	0.285	TCF7L2	$+$
5	rs28777	5	33958959	A	C	0.062	SLC45A2	$-$
6	rs35397	5	33951116	T	G	0.096	SLC45A2	$-$
7	rs1129038	15	28356859	T	C	0.261	HERC2	$-$
8	rs12913832	15	28365618	G	A	0.259	HERC2	$-$
9	rs10787472	10	114781297	A	C	0.470	TCF7L2	$+$

A

Major allele,
B

Minor allele,
C

Minor allele frequency. The boldface indicates the risk allele determined by the reference allele and the sign of the regression coefficient.

Table 9: Top nine SNPs selected by

\ell_{1}

-penalized Cox regression

SNP ID	Chr	Location	A1^A	A2^B	MAF^C	Mapped Gene	Coefficient	$p$ -value
rs1801212	4	6302519	A	G	0.270	WFS1	0.1123	<2E-16
rs4506565	10	114756041	A	T	0.314	TCF7L2	0.2665	<2E-16
rs2943640	2	227093585	C	A	0.336	IRS1	0.0891	1.57E-14
rs10830962	11	92698427	C	G	0.402	MTNR1B	0.0731	1.46E-11
rs343092	12	66250940	G	T	0.166	HMGA2	-0.0746	2.26E-07
rs1351394	12	66351826	C	T	0.478	HMGA2	0.0518	1.70E-06
rs2540917	2	60608759	T	C	0.389	RNU1-32P	-0.0476	2.18E-05
rs1254207	1	236368227	C	T	0.395	GPR137B	0.0458	2.84E-05
rs231362	11	2691471	G	A	0.461	KCNQ1	0.0607	2.87E-05

A

Major allele,
B

Minor allele,
C

Minor allele frequency. The boldface indicates the risk allele determined by the reference allele and the sign of the regression coefficient.

Table 10: SNPs with significant coefficients with significance level 0.01 after Bonferroni correction

7 Summary and Discussion

DistStat.jl is an initial step toward a unified programming in various HPC environments, which supplies a distributed array data structure with any underlying array, essential for high-performance distributed statistical computing. While tested on the basic Array on CPU nodes and CuArray on multi-GPU workstations, DistStat.jl can be used with any array type on any HPC environment provided that the array interface is implemented in Julia with MPI support.

Statistical applications of DistStat.jl including NMF, MDS, and $\ell_{1}$ -regularized Cox regularization is examined, and scalability are demonstrated on a 8-GPU workstation and a virtual cluster on AWS cloud with up to 20 nodes. The performance was equivalent to or better than its Python counterpart, dist_stat: the computation was 20-30% faster on CPUs because of flexibility in setting number of threads per node, and with unoptimized GPU kernels, performance on the multi-GPU workstation was comparable to dist_stat. Using DistStat.jl, we were able to analyze a large genomic dataset of size $400,000\times 500,000$ , which appears the largest analyzed using a multivariate survival model. (Conventional genome-wide analysis relies on SNP-by-SNP univariate models.)

DistStat.jl can be extended to communication-avoiding linear algebra (Ballard et al., , 2011). For example, an interface to ScaLAPACK (Choi et al., , 1992) can be merged into DistStat.jl for communication-avoiding dense matrix multiplication for CPU. For sparse matrix-dense matrix multiplication (Koanantakool et al., , 2016) often used in inverse covariance matrix estimation for graphical models (Koanantakool et al., , 2018), its incorporation into DistStat.jl will provide a highly productive distributed CPU and multi-GPU implementations at the same time. Furthermore, the distributed array can be used as the backbone for implementation of communication-avoiding statistical inference (Jordan et al., , 2019) or applications such as communication-avoiding least angle regression (Fountoulakis et al., , 2019).

Appendix A Installation

DistStat.jl requires a standard installation of MPI as a prerequisite. With MPI prepared, the rest of installation follows the standard method for GitHub package in Julia:

julia> using Pkg
julia> pkg"add https://github.com/kose-y/DistStat.jl"

Then, all the dependencies are installed along with the package DistStat.jl.

For the optional multi-GPU support, MPI should be installed with “CUDA-aware” support. For example, when using OpenMPI (Graham et al., , 2006), it should be configured with the flag ----with-cuda. In the shell:

$ ./configure --with-cuda
$ make install

From the Julia side, the standard set of CUDA interface package, CUDA.jl is required. MPI.jl, the underlying MPI interface, supports CUDA-aware MPI. Set the environment variable JULIA_MPI_BINARY=system with CUDA-aware MPI as the system-wide MPI.

Appendix B Code for memory-efficient $\ell_{1}$ -regularized Cox proportional hazards model

For CPU code, the following code accelerates the computation of $P_{(n+1)}\delta$ for $\ell_{1}$ -regularized Cox regression in Section 5.3 using the AVX.

using LoopVectorization
function pi_delta!(out, w, W_dist, delta, W_range)
    # fill ‘out‘ with zeros beforehand.
    m = length(delta)
    W_base = minimum(W_range) - 1
    W_local = W_dist.localarray
    @avx for i in 1:m
        outi = zero(eltype(w))
        for j in 1:length(W_range)
            outi += ifelse(i <= j + W_base,
                delta[j + W_base] * w[i] / W_local[j], zero(eltype(w)))
        end
        out[i] = outi
    end
    DistStat.Barrier()
    DistStat.Allreduce!(out)
    return out
end

DistStat.Allreduce!(out) computes the elementwise sum of out in all ranks, and saves it in the place of out. For GPU, the kernel function can be written as follows:

function pi_delta_kernel!(out, w, W_dist, delta, W_range)
    idx_x = (blockIdx().x-1) * blockDim().x + threadIdx().x
    stride_x = blockDim().x * gridDim().x
    W_base = minimum(W_range) - 1
    for i in idx_x:stride_x:length(out)
        for j in W_range
            @inbounds if i <= j
                out[i] += delta[j] * w[i] / W_dist[j - W_base]
            end
        end
    end
end

And the host function to compute $P_{(n+1)}\delta$ is:

function pi_delta!(out::CuArray, w::CuArray, W_dist, delta, W_range)
    fill!(out, zero(eltype(out)))
    numblocks = ceil(Int, length(w)/256)
    CUDA.@sync begin
        @cuda threads=256 blocks=numblocks pi_delta_kernel!(out,
            w, W_dist.localarray, delta, W_range)
    end
    DistStat.Allreduce!(out)
    out
end

References

Abadi et al., (2016) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283.
Amazon Web Services, (2019) Amazon Web Services (2019). Amazon cfncluster.
Baker et al., (2018) Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. J. (2018). sgmcmc: An R package for stochastic gradient markov chain monte carlo. Journal of Statistical Software, 91(3).
Ballard et al., (2011) Ballard, G., Demmel, J., Holtz, O., and Schwartz, O. (2011). Minimizing communication in numerical linear algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866–901.
Beck and Teboulle, (2009) Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202.
Besard et al., (2019) Besard, T., Churavy, V., Edelman, A., and De Sutter, B. (2019). Rapid software prototyping for heterogeneous and distributed platforms. Advances in Engineering Software, 132:29–46.
Besard et al., (2018) Besard, T., Foket, C., and De Sutter, B. (2018). Effective extensible programming: Unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems.
Bezanson et al., (2017) Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98.
Blackford et al., (2002) Blackford, L. S., Petitet, A., Pozo, R., Remington, K., Whaley, R. C., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., et al. (2002). An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28(2):135–151.
Breslow, (1972) Breslow, N. E. (1972). Discussion of the paper by D. R. Cox. Journal of the Royal Statistical Society: Series B (Methodological), 34(2):216–217.
Buckner et al., (2010) Buckner, J., Wilson, J., Seligman, M., Athey, B., Watson, S., and Meng, F. (2010). The gputools package enables gpu computing in R. Bioinformatics, 26(1):134–135.
Choi et al., (1992) Choi, J., Dongarra, J. J., Pozo, R., and Walker, D. W. (1992). ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In The Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 120–127. IEEE.
Cook et al., (2009) Cook, A. L., Chen, W., Thurber, A. E., Smit, D. J., Smith, A. G., Bladen, T. G., Brown, D. L., Duffy, D. L., Pastorino, L., Bianchi-Scarra, G., Leonard, J. H., Stow, J. L., and Strum, R. A. (2009). Analysis of cultured human melanocytes based on polymorphisms within the SLC45A2/MATP, SLC24A5/NCKX5, and OCA2/P loci. Journal of Investigative Dermatology, 129(2):392–405.
Cox, (1972) Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202.
Dalcín et al., (2008) Dalcín, L., Paz, R., Storti, M., and D’Elía, J. (2008). MPI for Python: Performance improvements and MPI-2 extensions. Journal of Parallel and Distributed Computing, 68(5):655–662.
de Leeuw, (1977) de Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling. In Recent Developments in Statistics (Proc. European Meeting of Statisticians, Grenoble, 1976), pages 133–145.
de Leeuw and Heiser, (1977) de Leeuw, J. and Heiser, W. J. (1977). Convergence of correction matrix algorithms for multidimensional scaling. Geometric Representations of Relational Data, pages 735–752.
Dillon et al., (2017) Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M., and Saurous, R. A. (2017). Tensorflow distributions. arXiv preprint arXiv:1711.10604.
Dupuis et al., (2010) Dupuis, J., Langenberg, C., Prokopenko, I., Saxena, R., Soranzo, N., Jackson, A. U., Wheeler, E., Glazer, N. L., Bouatia-Naji, N., Gloyn, A. L., et al. (2010). New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nature Genetics, 42(2):105.
Elrod, (2020) Elrod, C. (2020). LoopVectorization.jl: Macro(s) for vectorizing loops. Julia package version 0.6.21.
Fawcett et al., (2010) Fawcett, K. A., Wheeler, E., Morris, A. P., Ricketts, S. L., Hallmans, G., Rolandsson, O., Daly, A., Wasson, J., Permutt, A., Hattersley, A. T., Glaser, B., Franks, P. W., McCarthy, M. I., Wareham, N. J., Sandhu, M. S., and Barroso, I. (2010). Detailed investigation of the role of common and low-frequency wfs1 variants in type 2 diabetes risk. Diabetes, 59(3):741–746.
Firasta et al., (2008) Firasta, N., Buxton, M., Jinbo, P., Nasri, K., and Kuo, S. (2008). Intel AVX: New frontiers in performance improvements and energy efficiency. Intel White Paper.
Fountoulakis et al., (2019) Fountoulakis, K., Das, S., Grigori, L., Mahoney, M. W., and Demmel, J. (2019). Parallel and communication avoiding least angle regression. arXiv preprint arXiv:1905.11340.
Golub and Van Loan, (2013) Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations. Johns Hopkins University Press, Baltimore, MD.
Graham et al., (2006) Graham, R. L., Shipman, G. M., Barrett, B. W., Castain, R. H., Bosilca, G., and Lumsdaine, A. (2006). Open MPI: A high-performance, heterogeneous mpi. In 2006 IEEE International Conference on Cluster Computing, pages 1–9. IEEE.
Hunter and Lange, (2004) Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. The American Statistician, 58(1):30–37.
Janssens, (2018) Janssens, B. (2018). MPIArrays.jl: Distributed arrays based on MPI onesided communication.
Jordan et al., (2019) Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the Americal Statistical Association, 114(526):668–681.
Julia Contributors, (2020) Julia Contributors (2020). Julia 1.4 Documentation. Julia version 1.4.
JuliaGPU Contributors, (2020) JuliaGPU Contributors (2020). CUDA.jl: CUDA programming in Julia. Julia package version 1.0.2.
JuliaParallel Contributors, (2019) JuliaParallel Contributors (2019). DistributedArrays.jl: Distributed arrays in Julia. Julia package version 0.6.4.
JuliaParallel Contributors, (2020) JuliaParallel Contributors (2020). MPI.jl: MPI wrappers for Julia. Julia package version 0.11.0.
Keys et al., (2019) Keys, K. L., Zhou, H., and Lange, K. (2019). Proximal distance algorithms: Theory and practice. Journal of Machine Learning Research, 20(66):1–38.
Klimentidis et al., (2014) Klimentidis, Y. C., Zhou, J., and Wineinger, N. E. (2014). Identification of allelic heterogeneity at type-2 diabetes loci and impact on prediction. PloS one, 9(11).
Klöckner et al., (2012) Klöckner, A., Pinto, N., Lee, Y., Catanzaro, B., Ivanov, P., and Fasih, A. (2012). PyCUDA and PyOpenCL: A scripting-based approach to gpu run-time code generation. Parallel Computing, 38(3):157–174.
Ko et al., (2020) Ko, S., Zhou, H., Zhou, J., and Won, J.-H. (2020). High-performance statistical computing in the computing environments of the 2020s. arXiv preprint arXiv:2001.01916.
Koanantakool et al., (2018) Koanantakool, P., Ali, A., Azad, A., Buluc, A., Morozov, D., Oliker, L., Yelick, K., and Oh, S.-Y. (2018). Communication-avoiding optimization methods for distributed massive-scale sparse inverse covariance estimation. In International Conference on Artificial Intelligence and Statistics, pages 1376–1386.
Koanantakool et al., (2016) Koanantakool, P., Azad, A., Buluç, A., Morozov, D., Oh, S.-Y., Oliker, L., and Yelick, K. (2016). Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 842–853. IEEE.
Lange, (2016) Lange, K. (2016). MM Optimization Algorithms, volume 147. SIAM.
Lange et al., (2000) Lange, K., Hunter, D. R., and Yang, I. (2000). Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1–20.
Langenberg et al., (2014) Langenberg, C., Sharp, S. J., Franks, P. W., Scott, R. A., Deloukas, P., Forouhi, N. G., Froguel, P., Groop, L. C., Hansen, T., Palla, L., et al. (2014). Gene-lifestyle interaction and type 2 diabetes: the epic interact case-cohort study. PLoS Med, 11(5):e1001647.
Lattner and Adve, (2004) Lattner, C. and Adve, V. (2004). Llvm: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., pages 75–86. IEEE.
LeCun et al., (2015) LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436.
Lee and Seung, (1999) Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788.
Lee and Seung, (2001) Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, pages 556–562.
Li et al., (2010) Li, Q., Kecman, V., and Salman, R. (2010). A chunking method for Euclidean distance matrix calculation on large dataset using multi-GPU. In 2010 Ninth International Conference on Machine Learning and Applications, pages 208–213. IEEE.
Lin, (2007) Lin, C.-J. (2007). Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10):2756–2779.
Ma et al., (2016) Ma, E., Gupta, V., Hsu, M., and Roy, I. (2016). dmapply: A functional primitive to express distributed machine learning algorithms in R. Proceedings of the VLDB Endowment, 9(13):1293–1304.
Ng et al., (2014) Ng, M. C., Shriner, D., Chen, B. H., Li, J., Chen, W.-M., Guo, X., Liu, J., Bielinski, S. J., Yanek, L. R., Nalls, M. A., et al. (2014). Meta-analysis of genome-wide association studies in african americans provides insights into the genetic architecture of type 2 diabetes. PLoS Genetics, 10(8):e1004517.
NVIDIA, (2013) NVIDIA (2013). Basic linear algebra subroutines (cuBLAS) library.
Paszke et al., (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035.
Riobello et al., (2016) Riobello, C., Gómez, J., Gil-Peña, H., Tranche, S., Reguero, J. R., Jesús, M., Delgado, E., Calvo, D., Morís, C., Santos, F., Coto-Segura, P., Iglesial, S., Alonso, B., Alvarez, V., and Coto, E. (2016). Kcnq1 gene variants in the risk for type 2 diabetes and impaired renal function in the spanish renastur cohort. Molecular and cellular endocrinology, 427:86–91.
Salman et al., (2015) Salman, M., Dasgupta, S., Cholendra, A., Venugopal, P., Lakshmi, G., Xaviour, D., Rao, J., and D’Souza, C. J. (2015). Mtnr1b gene polymorphisms and susceptibility to type 2 diabetes: A pilot study in south indians. Gene, 566(2):189–193.
Scott et al., (2007) Scott, L. J., Mohlke, K. L., Bonnycastle, L. L., Willer, C. J., Li, Y., Duren, W. L., Erdos, M. R., Stringham, H. M., Chines, P. S., Jackson, A. U., et al. (2007). A genome-wide association study of type 2 diabetes in finns detects multiple susceptibility variants. Science, 316(5829):1341–1345.
Sudlow et al., (2015) Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., Liu, B., Matthews, P., Ong, G., Pell, J., Silman, A., Young, A., Sprosen, T., Peakman, T., and Collins, R. (2015). UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3):e1001779.
The Wellcome Trust Case Control Consortium, (2007) The Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661.
Wang and Zhang, (2013) Wang, Y.-X. and Zhang, Y.-J. (2013). Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering, 25(6):1336–1353.
Yu, (2002) Yu, H. (2002). Rmpi: Parallel statistical computing in R. R News, 2(2):10–14.