This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MiniSeg: An Extremely Minimum Network for Efficient COVID-19 Segmentation

Yu Qiu1, Yun Liu2*, Shijie Li3, Jing Xu1
Yun Liu and Jing Xu are corresponding authors.
Abstract

The rapid spread of the new pandemic, i.e., COVID-19, has severely threatened global health. Deep-learning-based computer-aided screening, e.g., COVID-19 infected CT area segmentation, has attracted much attention. However, the publicly available COVID-19 training data are limited, easily causing overfitting for traditional deep learning methods that are usually data-hungry with millions of parameters. On the other hand, fast training/testing and low computational cost are also necessary for quick deployment and development of COVID-19 screening systems, but traditional deep learning methods are usually computationally intensive. To address the above problems, we propose MiniSeg, a lightweight deep learning model for efficient COVID-19 segmentation. Compared with traditional segmentation methods, MiniSeg has several significant strengths: i) it only has 83K parameters and is thus not easy to overfit; ii) it has high computational efficiency and is thus convenient for practical deployment; iii) it can be fast retrained by other users using their private COVID-19 data for further improving performance. In addition, we build a comprehensive COVID-19 segmentation benchmark for comparing MiniSeg to traditional methods.

Introduction

As one of the most severe pandemics in human history, coronavirus disease 2019 (COVID-19) threatens global health with thousands of newly infected patients every day. Effective screening of infected patients is of high importance to the fight against COVID-19. The gold standard for COVID-19 diagnosis is the tried-and-true Reverse Transcription Polymerase Chain Reaction (RT-PCR) testing (Wang et al. 2020). Unfortunately, the sensitivity of RT-PCR testing is not high enough to prevent the spread of COVID-19 (Ai et al. 2020; Fang et al. 2020). Hence, computed tomography (CT) is used as a complementary tool for RT-PCR testing to improve screening sensitivity (Ai et al. 2020; Fang et al. 2020). Besides, CT analysis is necessary for clinical monitoring of disease severity (Inui et al. 2020). However, CT examination needs expert radiologists, but we severely lack experienced radiologists during this pandemic. Therefore, computer-aided systems are expected for automatic CT interpretation.

When it comes to computer-aided COVID-19 screening, deep-learning-based technology is a good choice due to its uncountable successful stories (Sun et al. 2014; He et al. 2015; Liu et al. 2019, 2020, 2018a, 2018b). However, directly applying traditional deep learning models for COVID-19 screening is suboptimal. On the one hand, these models usually have millions of parameters and thus require a large amount of labeled data for training. The problem is that the publicly available COVID-19 data are limited and thus easy to cause overfitting for traditional data-hungry models. On the other hand, traditional deep learning methods, especially the ones for image segmentation, are usually computationally intensive. Considering the current severe pandemic situation, fast training/testing and low computational load are essential for quick deployment and development of computer-aided COVID-19 screening systems.

It is a widely accepted concept that overfitting is easier to happen when a model has more parameters and less training data. To solve the above problems of COVID-19 segmentation, we observe that lightweight networks are not only uneasy to overfit owing to their small number of parameters but also likely to be efficient, making them suitable for computer-aided COVID-19 screening systems. Therefore, we think lightweight COVID-19 segmentation should be the technical solution of this paper. The key is to achieve accurate segmentation under the constraints of the number of network parameters and high efficiency. To achieve this goal, we find that the accuracy of image segmentation can be improved with effective multi-scale learning, which has significantly pushed forward the state of the arts of segmentation (Chen et al. 2018a, 2017, b; Yang et al. 2018; Zhao et al. 2017; Mehta et al. 2018, 2019; Pohlen et al. 2017; Yu et al. 2018b). Hence, we resort to multi-scale learning to ensure the segmentation accuracy of lightweight networks.

With the above analyses, our effort starts with the design of an Attentive Hierarchical Spatial Pyramid (AHSP) module for effective, lightweight multi-scale learning. AHSP first builds a spatial pyramid of dilated depthwise separable convolutions and feature pooling for learning multi-scale semantic features. Then, the learned multi-scale features are fused hierarchically to enhance the capacity of multi-scale representation. Finally, the multi-scale features are merged under the guidance of the attention mechanism, which learns to highlight essential information and filter out noisy information in radiography images. With the AHSP module incorporated, we propose an extremely minimum network for efficient segmentation of COVID-19 infected areas in chest CT slices. Our model, namely MiniSeg, only has 83K parameters, two orders of magnitude less than traditional image segmentation methods, so that current limited COVID-19 data can be enough for training MiniSeg. At last, we build a comprehensive COVID-19 segmentation benchmark to compare MiniSeg to previous methods extensively. Experiments demonstrate that MiniSeg performs favorably against previous state-of-the-art segmentation methods with high efficiency, trained with limited COVID-19 data.

In summary, our contributions are threefold:

  • We propose an Attentive Hierarchical Spatial Pyramid (AHSP) module for effective, lightweight multi-scale learning that is essential for image segmentation.

  • With AHSP incorporated, we present an extremely minimum network, MiniSeg, for accurate and efficient COVID-19 segmentation with limited training data.

  • For an extensive comparison of MiniSeg with previous state-of-the-art segmentation methods, we build a comprehensive COVID-19 segmentation benchmark.

Related Work

Image segmentation is a hot topic due to its wide range of applications. Multi-scale learning plays an essential role in image segmentation because objects in images usually exhibit very large scale changes. Hence most current state-of-the-art methods aim at designing fully convolutional networks (FCNs) (Shelhamer, Long, and Darrell 2017) to learn effective multi-scale representations from input images. For example, U-Net (Ronneberger, Fischer, and Brox 2015), U-Net++ (Zhou et al. 2018), and Attention U-Net (Oktay et al. 2018) propose encoder-decoder architectures to fuse multi-scale deep features at multiple levels. DeepLab (Chen et al. 2018a) and its variants (Chen et al. 2017, 2018b; Yang et al. 2018) design ASPP modules using dilated convolutions with different dilation rates to learn multi-scale features. Besides the multi-scale learning, some studies focus on exploiting the global context information through pyramid pooling (Zhao et al. 2017), context encoding (Zhang et al. 2018a), or non-local operations (Huang et al. 2019; Zhu et al. 2019). The above models aim at improving segmentation accuracy without considering the model size and inference speed, so they are suboptimal for COVID-19 segmentation that only has limited training data and requires high efficiency.

Lightweight networks aim at reducing the parameters and improving the efficiency of deep networks. Convolutional factorization is an intuitive way to reduce the computational complexity of convolution operations. Specifically, many well-known network architectures decompose the standard convolution into multiple steps to reduce the computational complexity, including Flattened Model (Jin, Dundar, and Culurciello 2015), Inception networks (Szegedy et al. 2017), Xception (Chollet 2017), MobileNets (Howard et al. 2017; Sandler et al. 2018), and ShuffleNets (Zhang et al. 2018b; Ma et al. 2018). Among them, Xception and MobileNets factorize a convolution into a pointwise convolution and a depthwise separable convolution. ShuffleNets further factorize a pointwise convolution into a channel shuffle operation and a grouped pointwise convolution. There are also some studies focusing on efficient semantic segmentation network design (Wu et al. 2018; Mehta et al. 2018, 2019; Lo et al. 2019; Wang et al. 2019). Considering COVID-19 segmentation, our goal is to achieve higher accuracy and faster speed by enhancing multi-scale learning in a lightweight setting.

Computer-aided COVID-19 screening has attracted much attention to serve as a supplementary tool for RT-PCR testing to improve screening sensitivity. Some studies (Narin, Kaya, and Pamuk 2020; Gozes et al. 2020; Xu et al. 2020; Li et al. 2020a; Zhang et al. 2020; Wang and Wong 2020) design deep neural networks to classify chest X-rays or CT slices for COVID-19 screening. Fan et al. (2020) proposed a segmentation model for COVID-19 infected area segmentation from CT slices. However, their method also falls into the same category as previous segmentation methods and is thus suboptimal. Some public COVID-19 imaging datasets, such as COVID-19 X-ray Collection (Cohen et al. 2020), COVID-CT-Dataset (Zhao et al. 2020), COVID-19 CT Segmentation Dataset (Jenssen 2020), and COVID-19-CT-Seg (Jun et al. 2020), are introduced. In this paper, we focus on segmenting COVID-19 infected areas from chest CT slices.

Methodology

Attentive Hierarchical Spatial Pyramid Module

Although the factorization of a convolution operation into a pointwise convolution and a depthwise separable convolution (DSConv) can significantly reduce the number of network parameters and computational complexity, it usually comes with the degradation of accuracy (Howard et al. 2017; Sandler et al. 2018; Zhang et al. 2018b; Ma et al. 2018). Inspired by the fact that effective multi-scale learning plays an essential role in improving segmentation accuracy (Chen et al. 2018a, 2017, b; Yang et al. 2018; Zhao et al. 2017; Mehta et al. 2018, 2019; Pohlen et al. 2017; Yu et al. 2018b), we propose the AHSP module for effective and efficient multi-scale learning in a lightweight setting. Besides some common convolution operations, such as vanilla convolution, pointwise convolution, and DSConv, we introduce the dilated DSConv convolution that adopts a dilated convolution kernel for each input channel. Suppose rk×k\mathcal{F}_{r}^{k\times k} denotes a vanilla convolution, where k×kk\times k is the size of convolution kernel and rr is the dilation rate. Suppose ^rk×k\hat{\mathcal{F}}_{r}^{k\times k} denotes a depthwise separable convolution, where k×kk\times k and rr have the same meaning as rk×k\mathcal{F}_{r}^{k\times k}. The subscript rr will be omitted without ambiguity if we have a dilation rate of 1, i.e., r=1r=1. For example, 1×1\mathcal{F}^{1\times 1} represents a pointwise convolution (i.e., 1×11\times 1 convolution). ^23×3\hat{\mathcal{F}}_{2}^{3\times 3} is a dilated 3×33\times 3 DSConv with a dilation rate of 2.

With the above definitions for basic operations, we continue by introducing the proposed AHSP module illustrated in Fig. 1. Let 𝐗C×H×W\mathbf{X}\in\mathbb{R}^{C\times H\times W} be the input feature map so that the output feature map is (𝐗)C×H×W\mathcal{E}(\mathbf{X})\in\mathbb{R}^{C^{\prime}\times H^{\prime}\times W^{\prime}}, where \mathcal{E} denotes the transformation function of AHSP for its input. CC, HH, and WW are the number of channels, height, and width of the input feature map 𝐗\mathbf{X}, respectively. Similar definitions hold for CC^{\prime}, HH^{\prime}, and WW^{\prime}. The input feature map 𝐗\mathbf{X} is first processed by a pointwise convolution to shrink the number of channels into C/KC^{\prime}/K, in which KK is the number of parallel branches which will be described later. This operation can be written as

𝐒=1×1(𝐗).\mathbf{S}=\mathcal{F}^{1\times 1}(\mathbf{X}). (1)

Then, the generated feature map 𝐒\mathbf{S} is fed into KK parallel dilated DSConv, i.e.,

𝐅k=^2k13×3(𝐒),k=1,2,,K,\mathbf{F}_{k}=\hat{\mathcal{F}}_{2^{k-1}}^{3\times 3}(\mathbf{S}),\quad k=1,2,\cdots,K, (2)

where the dilation rate is increased exponentially for enlarging the receptive field. Eqn. 2 is the basis for multi-scale learning with large dilation rates capturing large-scale contextual information and small dilation rates capturing local information. We also add an average pooling operation for 𝐒\mathbf{S} to enrich the multi-scale information, i.e.,

𝐅0=AvgPool3×3(𝐒),\mathbf{F}_{0}={\rm AvgPool}^{3\times 3}(\mathbf{S}), (3)

where AvgPool3×3{\rm AvgPool}^{3\times 3} represents the average pooling with a kernel size of 3×33\times 3. Note that we have 𝐅kCK×H×W\mathbf{F}_{k}\in\mathbb{R}^{\frac{C^{\prime}}{K}\times H^{\prime}\times W^{\prime}} for k=0,1,,Kk=0,1,\cdots,K. If we have HHH\neq H^{\prime} or WWW\neq W^{\prime}, the convolution and pooling operations in Eqn. 2 and Eqn. 3 will have a stride of 2 to downsample the feature map by a scale of 2; otherwise, the stride will be 1.

Refer to caption
Figure 1: Illustration of the proposed AHSP module.

These multi-scale feature maps produced by Eqn. 2 and Eqn. 3 are aggregated in an attentive hierarchical manner. We first add them up hierarchically as

𝐅˙1\displaystyle\dot{\mathbf{F}}_{1} =𝐅0+𝐅1,\displaystyle=\mathbf{F}_{0}+\mathbf{F}_{1}, (4)
𝐅˙2\displaystyle\dot{\mathbf{F}}_{2} =𝐅˙1+𝐅2,\displaystyle=\dot{\mathbf{F}}_{1}+\mathbf{F}_{2},
\displaystyle\cdots
𝐅˙K\displaystyle\dot{\mathbf{F}}_{K} =𝐅˙K1+𝐅K,\displaystyle=\dot{\mathbf{F}}_{K-1}+\mathbf{F}_{K},

where feature maps are gradually fused from small scales to large scales to enhance the representation capability of multi-scale learning. We further adopt a spatial attention mechanism to make the AHSP module automatically learn to focus on target structures of various scales. On the other hand, the attention mechanism can also learn to suppress irrelevant information at some feature scales and emphasize essential information at other scales. Such self-attention makes each scale speak for itself to decide how important it is in the multi-scale learning process. The transformation of 𝐅˙\dot{\mathbf{F}} by spatial attention can be formulated as

𝐅¨k=𝐅˙k+𝐅˙kσ(1×1(𝐅˙k)),k=1,2,,K,\ddot{\mathbf{F}}_{k}=\dot{\mathbf{F}}_{k}+\dot{\mathbf{F}}_{k}\otimes\sigma(\mathcal{F}^{1\times 1}(\dot{\mathbf{F}}_{k})),\quad k=1,2,\cdots,K, (5)

in which σ\sigma is a sigmoid activation function and \otimes indicates element-wise multiplication. The pointwise convolution in Eqn. 5 outputs a single-channel feature map which is then transformed to a spatial attention map by the sigmoid function. This attention map is replicated to the same size as 𝐅˙k\dot{\mathbf{F}}_{k}, i.e., CK×H×W\frac{C^{\prime}}{K}\times H^{\prime}\times W^{\prime}, before element-wise multiplication. Considering the efficiency, we can compute the attention map for all KK branches together, like

𝐀=σ(1×1(Concat(𝐅˙1,𝐅˙2,,𝐅˙K))),\mathbf{A}=\sigma(\mathcal{F}^{1\times 1}({\rm Concat}(\dot{\mathbf{F}}_{1},\dot{\mathbf{F}}_{2},\cdots,\dot{\mathbf{F}}_{K}))), (6)

where Concat(){\rm Concat}(\cdot) means to concatenate a series of feature maps along the channel dimension. The pointwise convolution in Eqn. 6 is a KK-grouped convolution with KK output channels, so we have 𝐀K×H×W\mathbf{A}\in\mathbb{R}^{K\times H^{\prime}\times W^{\prime}}. Hence, we can rewrite Eqn. 5 as

𝐅¨k=𝐅˙k+𝐅˙k𝐀[k],k=1,2,,K,\ddot{\mathbf{F}}_{k}=\dot{\mathbf{F}}_{k}+\dot{\mathbf{F}}_{k}\otimes\mathbf{A}[k],\quad k=1,2,\cdots,K, (7)

in which 𝐀[k]\mathbf{A}[k] means the kk-th channel of 𝐀\mathbf{A}.

Finally, we merge and fuse the above hierarchical feature maps as

𝐅¨\displaystyle\ddot{\mathbf{F}} =Concat(𝐅¨1,𝐅¨2,,𝐅¨K),\displaystyle={\rm Concat}(\ddot{\mathbf{F}}_{1},\ddot{\mathbf{F}}_{2},\cdots,\ddot{\mathbf{F}}_{K}), (8)
(𝐗)\displaystyle\mathcal{E}(\mathbf{X}) =PReLU(BatchNorm(1×1(𝐅¨))),\displaystyle={\rm PReLU}({\rm BatchNorm}(\mathcal{F}^{1\times 1}(\ddot{\mathbf{F}}))),

where BatchNorm(){\rm BatchNorm}(\cdot) denotes the batch normalization (Ioffe and Szegedy 2015) and PReLU(){\rm PReLU}(\cdot) indicates PReLU (i.e., Parametric ReLU) activation function (He et al. 2015). The pointwise convolution in Eqn. 8 is a KK-grouped convolution with CC^{\prime} output channels, so that this pointwise convolution aims at fusing 𝐅¨k\ddot{\mathbf{F}}_{k} (k=1,2,,Kk=1,2,\cdots,K) separately, i.e., adding connection to channels for depthwise convolutions in Eqn. 2. The fusion among various feature scales is achieved through the first pointwise convolution (i.e., Eqn. 1) in the subsequent AHSP module of MiniSeg and the hierarchical feature aggregation (i.e., Eqn. 4). Such a design can reduce the number of convolution parameters in Eqn. 8 by KK times when compared with that using a vanilla pointwise convolution, i.e., C2/KC^{\prime 2}/K vs. C2C^{\prime 2}.

Given an input feature map 𝐗C×H×W\mathbf{X}\in\mathbb{R}^{C\times H\times W}, we can compute the output feature map (𝐗)C×H×W\mathcal{E}(\mathbf{X})\in\mathbb{R}^{C^{\prime}\times H^{\prime}\times W^{\prime}} of an AHSP module using Eqn. 1 - Eqn. 8. We can easily find that increasing KK will reduce the number of AHSP parameters. Considering the balance between segmentation accuracy and efficiency, we set K=4K=4 in our experiments. The proposed AHSP module not only significantly reduces the number of parameters but also enables us to learn effective multi-scale features so that we can adopt the limited COVID-19 data to train a high-quality segmenter.

Refer to caption
Figure 2: Network architecture of the proposed MiniSeg.

Network Architecture

MiniSeg has an encoder-decoder structure. The encoder sub-network focuses on learning effective multi-scale representations for the input image. The decoder sub-network gradually aggregates the representations at different levels of the encoder to predict COVID-19 infected areas. The network architecture of MiniSeg is displayed in Fig. 2.

The encoder sub-network uses AHSP as the basic module, consisting of two paths connected through a series of nested skip pathways. Suppose 𝐈3×H×W\mathbf{I}\in\mathbb{R}^{3\times H\times W} denotes an input chest CT slice, where a grayscale CT slice is replicated three times to make its number of channels the same as color images. The input 𝐈\mathbf{I} is downsampled four times, resulting in four scales of 1/21/2, 1/41/4, 1/81/8, and 1/161/16, with four stages processing such four scales, respectively. Downsampling happens in the first block of each stage.

Suppose in the encoder sub-network we denote the output feature map of the ii-th stage and the jj-th block as 𝐄ji\mathbf{E}^{i}_{j}, w.r.t. i{1,2,3,4}i\in\{1,2,3,4\} and j{1,2,,Ni}j\in\{1,2,\cdots,N_{i}\}, where NiN_{i} indicates the number of blocks at the ii-th stage. Therefore, we have 𝐄jiCi×H2i×W2i\mathbf{E}^{i}_{j}\in\mathbb{R}^{C_{i}\times\frac{H}{2^{i}}\times\frac{W}{2^{i}}}, in which CiC_{i} is the number of feature channels at the ii-th stage. The abovementioned block refers to the proposed AHSP module except for the first stage whose basic block is the vanilla Convolution Block (CB). Since the number of feature channels at the first stage (i.e., C1C_{1}) is small, the vanilla convolution will not introduce too many parameters. Without ambiguity, let ji()\mathcal{E}^{i}_{j}(\cdot) be the transformation function of the ii-th stage and the jj-th block without distinguishing whether this block is a vanilla convolution or an AHSP module. For the another path, we propose a Downsampler Block (DB). The transformation function of a DB block is denoted as 𝒬ki()\mathcal{Q}^{i}_{k}(\cdot), w.r.t. i{1,2,3,4}i\in\{1,2,3,4\} and k{1,2,,Mi}k\in\{1,2,\cdots,M_{i}\}, where MiM_{i} denotes the number of DB at the ii-th stage. We define DB as

𝒬ki(𝐗)=PReLU(BatchNorm(^5×5(1×1(𝐗)))),\mathcal{Q}^{i}_{k}(\mathbf{X})={\rm PReLU}({\rm BatchNorm}(\hat{\mathcal{F}}^{5\times 5}(\mathcal{F}^{1\times 1}(\mathbf{X})))), (9)

where ^5×5()\hat{\mathcal{F}}^{5\times 5}(\cdot) has a stride of 2 for downsampling when we have k=1k=1. Suppose the output of 𝒬ki()\mathcal{Q}^{i}_{k}(\cdot) is 𝐐ki\mathbf{Q}^{i}_{k}.

Therefore, for the first block of the first stage, we have

𝐄11=11(𝐈),𝐐11=𝒬11(𝐈).\mathbf{E}^{1}_{1}=\mathcal{E}^{1}_{1}(\mathbf{I}),\quad\mathbf{Q}^{1}_{1}=\mathcal{Q}^{1}_{1}(\mathbf{I}). (10)

For the first block of other stages, we compute the output feature map as

𝐄i1\displaystyle\mathbf{E}^{i-1} =1×1(Concat(𝐄Ni1i1,𝐐Mi1i1)),\displaystyle=\mathcal{F}^{1\times 1}({\rm Concat}(\mathbf{E}^{i-1}_{N_{i-1}},\mathbf{Q}^{i-1}_{M_{i-1}})), (11)
𝐄1i\displaystyle\mathbf{E}^{i}_{1} =1i(Split(𝐄i1)+𝐄Ni1i1),\displaystyle=\mathcal{E}^{i}_{1}({\rm Split}(\mathbf{E}^{i-1})+\mathbf{E}^{i-1}_{N_{i-1}}),
𝐐1i\displaystyle\mathbf{Q}^{i}_{1} =𝒬1i(Split(𝐄i1)+𝐐Mi1i1),\displaystyle=\mathcal{Q}^{i}_{1}({\rm Split}(\mathbf{E}^{i-1})+\mathbf{Q}^{i-1}_{M_{i-1}}),

where we have i{2,3,4}i\in\{2,3,4\}. The operation Split(){\rm Split}(\cdot) is to split a feature map along the channel dimension into two chunks, which are fed into 1i\mathcal{E}^{i}_{1} and 𝒬1i\mathcal{Q}^{i}_{1}, respectively. Here, 1i()\mathcal{E}^{i}_{1}(\cdot) and 𝒬1i()\mathcal{Q}^{i}_{1}(\cdot) (i{1,2,3,4}i\in\{1,2,3,4\}) have a stride of 2 for downsampling. Instead of only using on-the-fly element-wise sum (Eqn. 12 and Eqn. 13), through Eqn. 11, we conduct a “concat-fuse-split” operation to fully integrate the features from the two paths, as concatenation can do better for feature fusion than sum by avoiding the information loss of sum (Huang et al. 2017). Split(){\rm Split}(\cdot) is used to handle the increased number of channels brought by concatenation.

For other blocks, the output feature map is

𝐄ji=ji(𝐄j1i+𝐐ji)+𝐄j1i,w.r.t. i{1,2,3,4} and j{2,3,,Ni},\begin{gathered}\mathbf{E}^{i}_{j}=\mathcal{E}^{i}_{j}(\mathbf{E}^{i}_{j-1}+\mathbf{Q}^{i}_{j^{\prime}})+\mathbf{E}^{i}_{j-1},\\ \textit{\text{w.r.t. }}i\in\{1,2,3,4\}\text{ and }j\in\{2,3,\cdots,N_{i}\},\end{gathered} (12)

where ji()\mathcal{E}^{i}_{j}(\cdot) has a stride of 1 and a residual connection is included for better optimization. We have j=j1j^{\prime}=j-1 if we also have j1Mij-1\leq M_{i}; otherwise, we have j=Mij^{\prime}=M_{i}. The computation of 𝐐ki\mathbf{Q}^{i}_{k} can be formulated as

𝐐ki=𝒬ki(𝐐k1i+𝐄k1i)+𝐐k1i,w.r.t. i{1,2,3,4} and k{2,3,,Mi}.\begin{gathered}\mathbf{Q}^{i}_{k}=\mathcal{Q}^{i}_{k}(\mathbf{Q}^{i}_{k-1}+\mathbf{E}^{i}_{k-1})+\mathbf{Q}^{i}_{k-1},\\ \textit{\text{w.r.t. }}i\in\{1,2,3,4\}\text{ and }k\in\{2,3,\cdots,M_{i}\}.\end{gathered} (13)

Through Eqn. 12 and Eqn. 13, the two paths of the encoder sub-network build nested skip connections. Such a design benefits the multi-scale learning of the encoder. Considering the balance among the number of network parameters, segmentation accuracy, and efficiency, we set CiC_{i} to {8,24,32,64}\{8,24,32,64\}, NiN_{i} to {3,4,9,9}\{3,4,9,9\}, and MiM_{i} to {2,2,5,4}\{2,2,5,4\} for i{1,2,3,4}i\in\{1,2,3,4\}, respectively.

The decoder sub-network is simple for efficient multi-scale feature decoding. Since the top feature map of the encoder has a scale of 1/161/16 of the original input, it is suboptimal to predict COVID-19 infected areas directly due to the loss of fine details. Instead, we utilize a simple decoder sub-network to gradually upsample and fuse the learned feature map at each scale. A Feature Fusion Module (FFM) is proposed for feature aggregation. Let 𝒟i()\mathcal{D}_{i}(\cdot) represent the function of FFM:

𝐒i\displaystyle\mathbf{S}^{\prime}_{i} =1×1(𝐗),\displaystyle=\mathcal{F}^{1\times 1}(\mathbf{X}), (14)
𝒟i(𝐗)\displaystyle\mathcal{D}_{i}(\mathbf{X}) =BatchNorm(^3×3(𝐒i)+^23×3(𝐒i)),\displaystyle={\rm BatchNorm}(\hat{\mathcal{F}}^{3\times 3}(\mathbf{S}^{\prime}_{i})+\hat{\mathcal{F}}_{2}^{3\times 3}(\mathbf{S}^{\prime}_{i})),

in which 𝒟i(𝐗)\mathcal{D}_{i}(\mathbf{X}) (i=1,2,3i=1,2,3) has CiC_{i} channels as the pointwise convolution is utilized to adjust such number of channels. We denote the feature map in the decoder as 𝐃iCi×H2i×W2i\mathbf{D}_{i}\in\mathbb{R}^{C_{i}\times\frac{H}{2^{i}}\times\frac{W}{2^{i}}}, and we have 𝐃4=BatchNorm(1×1(𝐄N44))\mathbf{D}_{4}={\rm BatchNorm}(\mathcal{F}^{1\times 1}(\mathbf{E}_{N_{4}}^{4})). We compute 𝐃i\mathbf{D}_{i} (i=3,2,1i=3,2,1) as

𝐒i′′\displaystyle\mathbf{S}^{\prime\prime}_{i} =𝒟i(Upsample(𝐃i+1,2)),\displaystyle=\mathcal{D}_{i}({\rm Upsample}(\mathbf{D}_{i+1},2)), (15)
𝐃i\displaystyle\mathbf{D}_{i} =PReLU(𝐒i′′+BatchNorm(1×1(𝐄Nii))),\displaystyle={\rm PReLU}(\mathbf{S}^{\prime\prime}_{i}+{\rm BatchNorm}(\mathcal{F}^{1\times 1}(\mathbf{E}_{N_{i}}^{i}))),

where Upsample(,t){\rm Upsample}(\cdot,t) means to upsample a feature map by a scale of tt using bilinear interpolation. In this way, the decoder sub-network enhances the high-level semantic features with low-level fine details, so that MiniSeg can make accurate predictions for COVID-19 infected areas.

With 𝐃i\mathbf{D}_{i} (i=1,2,3,4i=1,2,3,4) computed, we can make dense prediction using a pointwise convolution, i.e.,

𝐏i=Softmax(Upsample(1×1(𝐃i),2i)),\mathbf{P}_{i}={\rm Softmax}({\rm Upsample}(\mathcal{F}^{1\times 1}(\mathbf{D}_{i}),2^{i})), (16)

where Softmax(){\rm Softmax}(\cdot) is the standard softmax function and this pointwise convolution has two output channels representing two classes of background and COVID-19, respectively. 𝐏iH×W\mathbf{P}_{i}\in\mathbb{R}^{H\times W} is the predicted class label map. We utilize 𝐏1\mathbf{P}_{1} as the final output prediction. In training, we impose deep supervision (Lee et al. 2015) by replacing the softmax function in Eqn. 16 with the standard cross-entropy loss function.

  Datasets #Total/#COVID #Patients
COVID-19-CT100 (Jenssen 2020) 100/100 \sim60
COVID-19-P9 (Jenssen 2020) 829/373 9
COVID-19-P20 (Jun et al. 2020) 1844/1844 20
COVID-19-P1110 (Morozov et al. 2020) 785/785 50
 
Table 1: A summary of public COVID-19 CT datasets. #Total and #COVID denote the numbers of all or COVID-19 infected CT slices, respectively.

Experiments

Experimental Setup

Implementation details.

We implement the proposed MiniSeg network using the well-known PyTorch framework (Paszke et al. 2017). Adam optimization (Kingma and Ba 2015) is used for training with the weight decay of 1e-4. We adopt the learning rate policy of poly, where the initial learning rate is 1e-3. We train 80 epochs on the training set with a batch size of 5. We train all previous state-of-the-art segmentation methods using the same training settings as MiniSeg for a fair comparison.

Dataset.

We utilize four open-access COVID-19 CT segmentation datasets, i.e., two sub-datasets from COVID-19 CT Segmentation Dataset (Jenssen 2020), COVID-19 CT Lung and Infection Segmentation Dataset (Jun et al. 2020), and MosMedData (Morozov et al. 2020), to evaluate MiniSeg. According to the number of CT slices or the number of COVID-19 patients, we rename these datasets as COVID-19-CT100, COVID-19-P9, COVID-19-P20, and COVID-19-P1110 for convenience, respectively. The information of these datasets is summarized in Tab. 1. We utilize the standard cropping and random flipping for data augmentation for MiniSeg and all baselines in training. Moreover, we perform 5-fold cross-validation to avoid statistically significant differences in performance evaluation.

  SB MB AH TP CS Metrics (%)
mIoU SEN SPE DSC HD
75.78 71.80 96.78 59.63 92.12
76.31 76.22 97.40 61.57 88.05
76.58 77.89 97.61 62.06 83.71
76.66 78.72 97.02 62.05 78.67
78.33 79.62 97.71 64.84 71.69
 
Table 2: Effect of the main components in MiniSeg on the COVID-19-P1110 dataset. Note that the metric HD does not have the unit of %.
  PReLU DE DS CB DB 5×55\times 5 FFM Metrics (%)
mIoU SEN SPE DSC HD
ReLU 76.92 75.41 96.90 62.11 78.39
73.11 76.31 97.45 55.35 76.72
76.45 80.94 96.38 62.46 87.27
AHSP 76.71 78.38 96.38 62.05 78.99
3×33\times 3 76.54 78.43 97.09 61.81 80.61
AHSP 77.15 78.69 97.33 62.46 82.98
78.33 79.62 97.71 64.84 71.69
 
Table 3: Effect of some design choices on COVID-19-P1110. Each design choice is replaced with the operation in the table or directly removed (✗). DE: Decoder. DS: Deep Supervision.

Evaluation metrics.

We evaluate COVID-19 segmentation accuracy using five popular evaluation metrics in medical imaging analysis, i.e., mean intersection over union (mIoU), sensitivity (SEN), specificity (SPC), Dice similarity coefficient (DSC), and Hausdorff distance (HD). Specifically, mIoU, SEN, SPC, and DSC range between 0 and 1. The larger these values, the better the model. Note that a lower value of HD indicates better segmentation accuracy. Moreover, we also report the number of parameters, the number of FLOPs, and speed, tested using a 512×512512\times 512 input image and a TITAN RTX GPU.

Ablation Studies

Effect of main components.

As shown in Tab. 2, we start with a single-branch (SB) module that only has the DSConv with a dilation rate of 1. We replace all AHSP modules in MiniSeg with such SB modules and remove the two-path design of the MiniSeg encoder (the 1st1^{\text{st}} line of Tab. 2). Then, we extend such an SB module to a multi-branch (MB) module using the spatial pyramid as in the AHSP module to demonstrates the importance of multi-scale learning (the 2nd2^{\text{nd}} line of Tab. 2). Next, we add the attentive hierarchical fusion strategy (AH) to get the AHSP module to proves the superiority of the attentive hierarchical fusion (the 3rd3^{\text{rd}} line of Tab. 2). We continue by adding the two-path design (TP) to the encoder sub-network to validates that such a two-path design can benefit the network optimization (the 4th4^{\text{th}} line of Tab. 2). At last, we add the channel split (CS) operation to obtain the final MiniSeg model (the 5th5^{\text{th}} line of Tab. 2). These ablation studies demonstrate that the main components in MiniSeg are all effective for COVID-19 segmentation.

Effect of some design choices.

We continue by evaluating the design choices of MiniSeg. The results are provided in Tab. 3. First, we replace the PReLU activation function with the ReLU function. Second, we remove the decoder sub-network and change the stride of the last stage from 2 to 1, so we can directly make predictions at the scale of 1/81/8 and upsample to the original size, as in previous studies (Mehta et al. 2018; Lo et al. 2019; Wu et al. 2018; Paszke et al. 2016; Chen et al. 2018a). Third, we remove deep supervision in training. Fourth, we replace Convolution Blocks (CB) in the first stage with AHSP modules. Fifth, we replace the 5×55\times 5 DSConv in the Downsampler Blocks (DB) with 3×33\times 3 DSConv. Sixth, we replace the Feature Fusion Modules (FFM) in the decoder sub-network with AHSP modules. The default setting achieves the best overall performance, demonstrating the effectiveness of our designs.

  Method Backbone ImageNet #Param FLOPs Speed
U-Net - No 8.43M 65.73G 57.3fps
FCN-8s VGG16 Yes 15.53M 105.97G 4.5fps
SegNet - No 28.75M 160.44G 3.0fps
FRRN - No 17.30M 237.70G 15.8fps
PSPNet ResNet50 Yes 64.03M 257.79G 17.1fps
DeepLabv3 ResNet50 Yes 38.71M 163.83G 25.3fps
DenseASPP - No 27.93M 122.28G 19.3fps
DFN ResNet50 Yes 43.53M 81.88G 56.2fps
EncNet ResNet50 Yes 51.25M 217.46G 18.1fps
DeepLabv3+ Xception Yes 53.33M 82.87G 3.4fps
BiSeNet ResNet18 Yes 12.50M 13.01G 172.4fps
UNet++ - No 8.95M 138.37G 26.8fps
Attention U-Net - No 8.52M 67.14G 49.2fps
OCNet ResNet50 Yes 51.60M 220.69G 19.3fps
DUpsampling ResNet50 Yes 28.46M 123.01G 36.5fps
DANet ResNet50 Yes 64.87M 275.72G 16.4fps
CCNet ResNet50 Yes 46.32M 197.92G 40.0fps
ANNNet ResNet50 Yes 47.42M 203.07G 32.8fps
GFF ResNet50 Yes 90.57M 374.03G 17.5fps
Inf-Net ResNet50 Yes 30.19M 27.30G 155.9fps
MobileNet MobileNet Yes 3.13M 3.02G 416.7fps
MobileNetv2 MobileNetv2 Yes 2.17M 1.60G 137.0fps
ShuffleNet ShuffleNet Yes 0.92M 0.75G 116.3fps
ShuffleNetv2 ShuffleNetv2 Yes 1.22M 0.77G 142.9fps
EfficientNet EfficientNet No 8.37M 13.19G 48.1fps
ENet - No 0.36M 1.92G 71.4fps
ESPNet - No 0.35M 1.76G 125.0fps
CGNet - No 0.49M 3.40G 73.0fps
ESPNetv2 - No 0.34M 0.77G 73.0fps
EDANet - No 0.68M 4.43G 147.1fps
LEDNet - No 2.26M 6.32G 94.3fps
MiniSeg - No 82.91K 0.50G 516.3fps
 
Table 4: Comparison of MiniSeg to previous state-of-the-art methods in terms of parameters, FLOPs, and speed.
Refer to caption
(a)
Refer to caption
(b)
Figure 3: Statistical analysis for MiniSeg on COVID-19-P20. (a) The DSC score vs. the infected area; (b) The DSC score vs. the lesion count in the corresponding CT slice.
  Methods COVID-19-CT100 COVID-19-P9 COVID-19-P20 COVID-19-P1110
mIoU SEN SPC DSC HD mIoU SEN SPC DSC HD mIoU SEN SPC DSC HD mIoU SEN SPC DSC HD
U-Net 77.56 72.24 97.71 68.37 94.25 76.51 88.53 98.93 65.69 133.64 81.81 82.73 97.92 72.66 61.66 74.26 81.85 97.33 58.62 95.72
FCN-8s 71.85 66.47 93.56 58.11 104.68 81.20 87.12 98.40 72.67 91.32 82.54 84.10 98.02 73.60 51.47 70.51 80.75 97.08 53.33 84.43
SegNet 75.02 80.02 96.34 64.84 109.05 73.88 73.59 98.79 62.07 98.38 79.55 81.68 98.44 69.68 77.28 72.32 76.77 97.24 55.92 105.42
FRRN 79.20 78.47 97.50 71.27 86.56 80.83 86.26 99.54 74.03 84.34 80.61 80.75 97.53 71.43 61.28 73.84 75.45 95.80 58.86 87.11
PSPNet 75.61 70.82 96.47 64.55 99.76 82.15 86.84 99.19 74.85 94.40 81.60 83.44 98.17 71.60 65.60 71.41 80.34 97.40 54.82 87.06
DeepLabv3 81.30 84.80 97.48 74.65 81.77 81.50 85.23 98.56 73.10 95.72 80.26 81.60 97.78 70.96 60.50 72.91 80.45 96.85 55.70 81.35
DenseASPP 78.43 81.14 97.02 70.37 156.23 72.78 70.26 98.65 65.53 98.61 81.11 82.21 97.80 71.68 64.05 74.84 69.38 95.65 57.24 76.61
DFN 81.07 84.27 97.49 74.45 83.73 79.19 85.78 98.64 69.93 106.23 79.13 80.96 96.51 69.46 66.56 73.40 80.12 97.13 57.31 87.10
EncNet 71.28 74.11 95.20 62.83 119.55 81.35 86.88 98.65 72.62 94.77 82.43 84.94 98.03 71.60 71.57 71.65 81.23 96.65 54.89 77.82
DeepLabv3+ 79.45 79.58 97.55 71.70 93.09 81.29 77.93 99.30 73.48 81.95 81.26 81.61 95.35 42.79 182.14 74.14 74.65 97.26 57.16 102.78
BiSeNet 63.09 74.07 87.41 58.66 110.47 72.33 67.17 96.35 55.40 164.07 78.08 76.13 97.07 65.24 85.94 70.29 70.90 95.49 52.26 95.11
UNet++ 77.64 77.26 97.28 69.04 91.73 77.95 86.83 99.39 69.27 104.83 80.73 79.61 96.75 70.34 63.01 73.39 75.67 96.13 59.08 88.21
Attention U-Net 77.71 74.75 97.56 68.93 92.15 76.26 76.39 99.24 66.74 102.43 80.70 82.92 97.41 71.27 64.91 74.62 81.32 97.63 59.34 95.16
OCNet 69.29 72.86 89.38 56.14 105.66 81.14 87.41 98.71 72.94 113.21 80.74 80.71 95.82 69.36 56.60 72.05 79.67 97.64 53.97 97.38
DUpsampling 81.69 84.54 97.60 75.27 81.07 79.96 74.42 96.38 69.60 64.62 81.05 79.37 96.34 71.01 60.19 72.16 65.18 91.77 53.98 72.29
DANet 73.57 66.30 92.76 61.34 99.11 81.59 88.78 99.13 73.82 114.69 78.35 79.87 97.31 67.04 83.13 73.47 75.00 95.80 56.07 74.04
CCNet 75.24 69.55 95.92 63.99 98.03 81.27 86.61 99.16 73.93 90.84 82.22 82.93 97.76 73.13 56.98 72.02 79.16 96.29 54.83 83.07
ANNNet 73.93 66.73 95.72 62.06 102.43 79.52 85.20 98.35 69.55 109.31 81.92 84.10 98.13 72.72 56.99 72.28 81.19 97.30 55.21 83.16
GFF 75.75 69.80 97.53 63.88 103.87 81.20 85.35 98.46 72.61 113.48 82.44 84.29 97.49 73.05 63.84 71.82 81.10 96.50 53.88 86.39
Inf-Net 81.62 76.50 98.32 74.44 86.81 80.28 77.59 98.72 71.76 69.46 64.62 69.46 99.02 63.38 79.68 74.32 62.93 93.45 56.39 71.77
MobileNet 80.07 81.19 95.92 63.99 98.03 81.32 85.53 99.62 74.18 128.95 80.52 82.66 97.95 72.05 70.70 74.84 80.08 97.67 59.91 92.88
MobileNetv2 79.73 82.83 97.32 72.53 88.40 80.09 81.77 99.45 72.16 85.15 80.99 83.16 98.20 71.50 68.54 74.32 80.41 96.96 59.43 93.11
ShuffleNet 77.50 74.57 97.64 69.02 86.97 80.87 83.62 99.28 72.66 105.56 81.97 82.34 98.03 73.33 56.68 74.51 77.73 96.38 58.64 78.16
ShuffleNetv2 78.58 81.21 97.30 71.37 84.72 79.54 82.44 98.75 70.29 102.75 81.31 81.86 98.29 71.67 70.06 74.56 76.89 96.58 58.67 78.55
EfficientNet 78.22 80.25 97.04 70.45 75.26 73.13 73.50 99.25 60.20 133.45 81.58 80.10 98.06 72.12 64.30 73.30 80.66 97.07 58.04 96.30
ENet 79.49 81.26 97.53 71.57 96.08 79.27 79.62 99.07 70.43 101.92 77.57 76.35 97.16 68.23 67.40 74.49 74.86 96.38 57.20 85.32
ESPNet 77.45 84.18 96.48 69.30 97.04 76.79 71.30 98.67 67.68 93.58 80.32 80.53 97.52 69.36 91.84 74.75 72.06 96.96 57.77 94.58
CGNet 79.34 81.55 96.34 71.42 90.37 75.10 70.27 92.57 60.37 134.43 82.24 80.73 97.35 72.35 53.63 74.12 74.83 96.16 56.45 74.34
ESPNetv2 78.66 77.84 96.53 70.46 87.77 78.22 72.42 97.23 67.12 88.58 80.78 79.03 97.41 70.13 73.67 74.10 76.60 97.67 58.37 96.73
EDANet 78.74 82.86 96.98 70.67 88.14 80.11 79.40 98.77 72.89 70.40 79.56 76.79 97.42 68.71 70.72 73.21 73.73 96.71 55.11 84.56
LEDNet 77.41 81.69 96.93 68.74 92.49 78.46 80.96 98.47 70.41 120.74 80.34 78.74 97.90 70.10 65.77 73.46 72.27 95.14 55.09 94.19
MiniSeg 82.15 84.95 97.72 75.91 74.42 85.31 90.60 99.15 80.06 58.46 84.49 85.06 99.05 76.27 51.06 78.33 79.62 97.71 64.84 71.69
 
Table 5: Comparison between MiniSeg and previous state-of-the-art segmentation methods.
Refer to caption

     CT Slice          U-Net            FCN         DeepLabV3+        UNet++      Attention UNet       Inf-Net        ShuffleNetV2       ESPNetv2        MiniSeg

Figure 4: Visual comparison between MiniSeg and other methods. Red: true positive; Green: false negative; Blue: false positive.

Comparison with State-of-the-art Methods

Quantitative Evaluation.

To compare MiniSeg to previous state-of-the-art competitors and promote COVID-19 segmentation research, we build a comprehensive benchmark. This benchmark contains 31 previous state-of-the-art image segmentation methods, including U-Net (Ronneberger, Fischer, and Brox 2015), FCN-8s (Shelhamer, Long, and Darrell 2017), SegNet (Badrinarayanan, Kendall, and Cipolla 2017), FRRN (Pohlen et al. 2017), PSPNet (Zhao et al. 2017), DeepLabv3 (Chen et al. 2017), DenseASPP (Yang et al. 2018), DFN (Yu et al. 2018b), EncNet (Zhang et al. 2018a), DeepLabv3+ (Chen et al. 2018b), BiSeNet (Yu et al. 2018a), UNet++ (Zhou et al. 2018), Attention U-Net (Oktay et al. 2018), OCNet (Yuan and Wang 2018), DUpsampling (Tian et al. 2019), DANet (Fu et al. 2019), CCNet (Huang et al. 2019), ANNNet (Zhu et al. 2019), GFF (Li et al. 2020b), Inf-Net (Fan et al. 2020), MobileNet (Howard et al. 2017), MobileNetv2 (Sandler et al. 2018), ShuffleNet (Zhang et al. 2018b), ShuffleNetv2 (Ma et al. 2018), EfficientNet (Tan and Le 2019), ENet (Paszke et al. 2016), ESPNet (Mehta et al. 2018), CGNet (Wu et al. 2018), ESPNetv2 (Mehta et al. 2019), EDANet (Lo et al. 2019), and LEDNet (Wang et al. 2019). Among them, Inf-Net is designed for COVID-19 segmentation. MobileNet, MobileNetv2, ShuffleNet, ShuffleNetv2, and EfficientNet are designed for lightweight image classification. We view them as the encoder and add the decoder of MiniSeg to them so that they are reformed as image segmentation models. ENet, ESPNet, CGNet, ESPNetv2, EDANet, and LEDNet are well-known lightweight segmentation models. The code of these methods is provided online by the authors. We believe that this benchmark would be useful for future research on COVID-19 segmentation.

The comparison between MiniSeg and competitors, in terms of the number of parameters, the number of FLOPs, and speed, is summarized in Tab. 4. We can clearly see that the numbers of parameters and FLOPs of MiniSeg are extremely small. Meanwhile, the speed of MiniSeg is much faster than others. The numerical evaluation results of MiniSeg and other competitors are presented in Tab. 5. MiniSeg consistently achieves the best or close to the best performance in terms of all metrics on all datasets. For the metric of SPC, MiniSeg performs slightly worse than the best method on COVID-19-CT100 and COVID-19-P9. On the COVID-19-P1110 dataset, MiniSeg does not achieve the best results in terms of SEN. The fact that MiniSeg consistently outperforms other competitors demonstrates its effectiveness and superiority in COVID-19 infected area segmentation. Note that MiniSeg does not need to be pretrained on ImageNet (Russakovsky et al. 2015) owing to its small model size. Therefore, we can come to the conclusion that MiniSeg has a low computational load, a fast speed, and good accuracy, making it convenient for practical deployment that is of high importance in the current severe situation of COVID-19.

Qualitative Comparison.

To explicitly show the superiority of MiniSeg, we provide a qualitative comparison between MiniSeg and eight state-of-the-art methods in Fig. 4. We select some representative images from the above datasets. This visual comparison further indicates that MiniSeg outperforms baseline methods remarkably.

Statistical Analysis.

To further study the characteristics of MiniSeg, we perform statistical analysis on the largest COVID-19-P20 dataset. Fig. 3a and Fig. 3b illustrate the relationship between the DSC score and the infected area, or the lesion count in a CT slice, respectively. We find that MiniSeg achieves a DSC score larger than 0.7 for most CT slices regardless of the infected area. The medium DSC is above 0.8 regardless of the lesion counts. This suggests that MiniSeg is robust to different cases for COVID-19 infected area segmentation.

Conclusion

In this paper, we focus on segmenting COVID-19 infected areas from chest CT slices. To address the lack of COVID-19 training data and meet the efficiency requirement for the deployment of computer-aided COVID-19 screening systems, we propose an extremely minimum network, i.e., MiniSeg, for accurate and efficient COVID-19 infected area segmentation. MiniSeg adopts a novel multi-scale learning module, i.e., the Attentive Hierarchical Spatial Pyramid (AHSP) module, to ensure its accuracy under the constraint of the extremely minimum network size. To extensively compare MiniSeg with previous state-of-the-art image segmentation methods and promote future research on COVID-19 infected area segmentation, we build a comprehensive benchmark that would be useful for future research. The comparison between MiniSeg and state-of-the-art image segmentation methods demonstrates that MiniSeg not only achieves the best performance but also has high efficiency, making MiniSeg suitable for practical deployment.

References

  • Ai et al. (2020) Ai, T.; Yang, Z.; Hou, H.; Zhan, C.; Chen, C.; Lv, W.; Tao, Q.; Sun, Z.; and Xia, L. 2020. Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: A report of 1014 cases. Radiology 200642.
  • Badrinarayanan, Kendall, and Cipolla (2017) Badrinarayanan, V.; Kendall, A.; and Cipolla, R. 2017. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI 39(12): 2481–2495.
  • Chen et al. (2018a) Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, A. L. 2018a. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE TPAMI 40(4): 834–848.
  • Chen et al. (2017) Chen, L.-C.; Papandreou, G.; Schroff, F.; and Adam, H. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 .
  • Chen et al. (2018b) Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; and Adam, H. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 801–818.
  • Chollet (2017) Chollet, F. 2017. Xception: Deep learning with depthwise separable convolutions. In IEEE CVPR, 1251–1258.
  • Cohen et al. (2020) Cohen, J. P.; Morrison, P.; Dao, L.; Roth, K.; Duong, T. Q.; and Ghassemi, M. 2020. COVID-19 Image Data Collection: Prospective Predictions Are the Future. arXiv preprint arXiv:2006.11988 .
  • Fan et al. (2020) Fan, D.-P.; Zhou, T.; Ji, G.-P.; Zhou, Y.; Chen, G.; Fu, H.; Shen, J.; and Shao, L. 2020. Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Images. IEEE TMI 39(8): 2626–2637.
  • Fang et al. (2020) Fang, Y.; Zhang, H.; Xie, J.; Lin, M.; Ying, L.; Pang, P.; and Ji, W. 2020. Sensitivity of chest CT for COVID-19: Comparison to RT-PCR. Radiology 200432.
  • Fu et al. (2019) Fu, J.; Liu, J.; Tian, H.; Fang, Z.; and Lu, H. 2019. Dual attention network for scene segmentation. In IEEE CVPR, 3146–3154.
  • Gozes et al. (2020) Gozes, O.; Frid-Adar, M.; Greenspan, H.; Browning, P. D.; Zhang, H.; Ji, W.; Bernheim, A.; and Siegel, E. 2020. Rapid AI development cycle for the coronavirus (COVID-19) pandemic: Initial results for automated detection & patient monitoring using deep learning CT image analysis. arXiv preprint arXiv:2003.05037 .
  • He et al. (2015) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In IEEE ICCV, 1026–1034.
  • Howard et al. (2017) Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 .
  • Huang et al. (2017) Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In IEEE CVPR, 4700–4708.
  • Huang et al. (2019) Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; and Liu, W. 2019. CCNet: Criss-cross attention for semantic segmentation. In IEEE ICCV, 603–612.
  • Inui et al. (2020) Inui, S.; Fujikawa, A.; Jitsu, M.; Kunishima, N.; Watanabe, S.; Suzuki, Y.; Umeda, S.; and Uwabe, Y. 2020. Chest CT findings in cases from the cruise ship “Diamond Princess” with coronavirus disease 2019 (COVID-19). Radiology: Cardiothoracic Imaging 2(2): e200110.
  • Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 448–456.
  • Jenssen (2020) Jenssen, H. B. 2020. COVID-19 CT segmentation dataset. http://medicalsegmentation.com/covid19/. Accessed 04 10, 2020.
  • Jin, Dundar, and Culurciello (2015) Jin, J.; Dundar, A.; and Culurciello, E. 2015. Flattened convolutional neural networks for feedforward acceleration. In ICLR.
  • Jun et al. (2020) Jun, M.; Cheng, G.; Yixin, W.; Xingle, A.; Jiantao, G.; Ziqi, Y.; Minqing, Z.; Xin, L.; Xueyuan, D.; Shucheng, C.; et al. 2020. COVID-19 CT Lung and Infection Segmentation Dataset. Zenodo, Apr 20.
  • Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
  • Lee et al. (2015) Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; and Tu, Z. 2015. Deeply-supervised nets. In AISTATS, 562–570.
  • Li et al. (2020a) Li, L.; Qin, L.; Xu, Z.; Yin, Y.; Wang, X.; Kong, B.; Bai, J.; Lu, Y.; Fang, Z.; Song, Q.; et al. 2020a. Artificial intelligence distinguishes COVID-19 from community acquired pneumonia on chest CT. Radiology 200905.
  • Li et al. (2020b) Li, X.; Zhao, H.; Han, L.; Tong, Y.; Tan, S.; and Yang, K. 2020b. Gated Fully Fusion for Semantic Segmentation. In AAAI, 11418–11425.
  • Liu et al. (2018a) Liu, Y.; Cheng, M.-M.; Fan, D.-P.; Zhang, L.; Bian, J.; and Tao, D. 2018a. Semantic edge detection with diverse deep supervision. arXiv preprint arXiv:1804.02864 .
  • Liu et al. (2019) Liu, Y.; Cheng, M.-M.; Hu, X.; Bian, J.-W.; Zhang, L.; Bai, X.; and Tang, J. 2019. Richer Convolutional Features for Edge Detection. IEEE TPAMI 41(8): 1939–1946.
  • Liu et al. (2018b) Liu, Y.; Jiang, P.-T.; Petrosyan, V.; Li, S.-J.; Bian, J.; Zhang, L.; and Cheng, M.-M. 2018b. DEL: Deep Embedding Learning for Efficient Image Segmentation. In IJCAI, 864–870.
  • Liu et al. (2020) Liu, Y.; Wu, Y.-H.; Ban, Y.; Wang, H.; and Cheng, M.-M. 2020. Rethinking computer-aided tuberculosis diagnosis. In IEEE CVPR, 2646–2655.
  • Lo et al. (2019) Lo, S.-Y.; Hang, H.-M.; Chan, S.-W.; and Lin, J.-J. 2019. Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In ACM Multimedia Asia, 1–6.
  • Ma et al. (2018) Ma, N.; Zhang, X.; Zheng, H.-T.; and Sun, J. 2018. ShuffleNet v2: Practical guidelines for efficient CNN architecture design. In ECCV, 116–131.
  • Mehta et al. (2018) Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; and Hajishirzi, H. 2018. ESPNet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, 552–568.
  • Mehta et al. (2019) Mehta, S.; Rastegari, M.; Shapiro, L.; and Hajishirzi, H. 2019. ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network. In IEEE CVPR, 9190–9200.
  • Morozov et al. (2020) Morozov, S.; Andreychenko, A.; Pavlov, N.; Vladzymyrskyy, A.; Ledikhova, N.; Gombolevskiy, V.; Blokhin, I.; Gelezhe, P.; Gonchar, A.; Chernina, V.; et al. 2020. MosMedData: Chest CT Scans with COVID-19 Related Findings. medRxiv .
  • Narin, Kaya, and Pamuk (2020) Narin, A.; Kaya, C.; and Pamuk, Z. 2020. Automatic detection of coronavirus disease (COVID-19) using X-ray images and deep convolutional neural networks. arXiv preprint arXiv:2003.10849 .
  • Oktay et al. (2018) Oktay, O.; Schlemper, J.; Folgoc, L. L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N. Y.; Kainz, B.; et al. 2018. Attention U-Net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 .
  • Paszke et al. (2016) Paszke, A.; Chaurasia, A.; Kim, S.; and Culurciello, E. 2016. ENet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 .
  • Paszke et al. (2017) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in PyTorch. In NIPS Workshop, 1–4.
  • Pohlen et al. (2017) Pohlen, T.; Hermans, A.; Mathias, M.; and Leibe, B. 2017. Full-resolution residual networks for semantic segmentation in street scenes. In IEEE CVPR, 4151–4160.
  • Ronneberger, Fischer, and Brox (2015) Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 234–241.
  • Russakovsky et al. (2015) Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. ImageNet large scale visual recognition challenge. IJCV 115(3): 211–252.
  • Sandler et al. (2018) Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and Chen, L.-C. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In IEEE CVPR, 4510–4520.
  • Shelhamer, Long, and Darrell (2017) Shelhamer, E.; Long, J.; and Darrell, T. 2017. Fully Convolutional Networks for Semantic Segmentation. IEEE TPAMI 39(4): 640–651.
  • Sun et al. (2014) Sun, Y.; Chen, Y.; Wang, X.; and Tang, X. 2014. Deep learning face representation by joint identification-verification. In NIPS, 1988–1996.
  • Szegedy et al. (2017) Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI, 4278–4284.
  • Tan and Le (2019) Tan, M.; and Le, Q. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In ICML, 6105–6114.
  • Tian et al. (2019) Tian, Z.; He, T.; Shen, C.; and Yan, Y. 2019. Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In IEEE CVPR, 3126–3135.
  • Wang and Wong (2020) Wang, L.; and Wong, A. 2020. COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images. arXiv preprint arXiv:2003.09871 .
  • Wang et al. (2020) Wang, W.; Xu, Y.; Gao, R.; Lu, R.; Han, K.; Wu, G.; and Tan, W. 2020. Detection of SARS-CoV-2 in different types of clinical specimens. J. American Medical Association .
  • Wang et al. (2019) Wang, Y.; Zhou, Q.; Liu, J.; Xiong, J.; Gao, G.; Wu, X.; and Latecki, L. J. 2019. LEDNet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation. In IEEE ICIP, 1860–1864. IEEE.
  • Wu et al. (2018) Wu, T.; Tang, S.; Zhang, R.; and Zhang, Y. 2018. CGNet: A light-weight context guided network for semantic segmentation. arXiv preprint arXiv:1811.08201 .
  • Xu et al. (2020) Xu, X.; Jiang, X.; Ma, C.; Du, P.; Li, X.; Lv, S.; Yu, L.; Chen, Y.; Su, J.; Lang, G.; et al. 2020. Deep learning system to screen coronavirus disease 2019 pneumonia. arXiv preprint arXiv:2002.09334 .
  • Yang et al. (2018) Yang, M.; Yu, K.; Zhang, C.; Li, Z.; and Yang, K. 2018. DenseASPP for semantic segmentation in street scenes. In IEEE CVPR, 3684–3692.
  • Yu et al. (2018a) Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; and Sang, N. 2018a. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 325–341.
  • Yu et al. (2018b) Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; and Sang, N. 2018b. Learning a discriminative feature network for semantic segmentation. In IEEE CVPR, 1857–1866.
  • Yuan and Wang (2018) Yuan, Y.; and Wang, J. 2018. OCNet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 .
  • Zhang et al. (2018a) Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; and Agrawal, A. 2018a. Context encoding for semantic segmentation. In IEEE CVPR, 7151–7160.
  • Zhang et al. (2020) Zhang, J.; Xie, Y.; Li, Y.; Shen, C.; and Xia, Y. 2020. COVID-19 Screening on Chest X-ray Images Using Deep Learning based Anomaly Detection. arXiv preprint arXiv:2003.12338 .
  • Zhang et al. (2018b) Zhang, X.; Zhou, X.; Lin, M.; and Sun, J. 2018b. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In IEEE CVPR, 6848–6856.
  • Zhao et al. (2017) Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid scene parsing network. In IEEE CVPR, 2881–2890.
  • Zhao et al. (2020) Zhao, J.; Zhang, Y.; He, X.; and Xie, P. 2020. COVID-CT-Dataset: A CT scan dataset about COVID-19. arXiv preprint arXiv:2003.13865 .
  • Zhou et al. (2018) Zhou, Z.; Siddiquee, M. M. R.; Tajbakhsh, N.; and Liang, J. 2018. UNet++: A nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 3–11. Springer.
  • Zhu et al. (2019) Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; and Bai, X. 2019. Asymmetric non-local neural networks for semantic segmentation. In IEEE ICCV, 593–602.