¹¹institutetext: Beihang University, China
²²institutetext: Nanyang Technological University, Singapore
³³institutetext: Deakin University, Australia
⁴⁴institutetext: Stony Brook University, USA
⁵⁵institutetext: University of Fukui, Japan

Deep Patch-based Human Segmentation

Dongbo Zhang Joint first author11 Zheng Fang⁰ 22 Xuequan Lu 33 Hong Qin 44 Antonio Robles-Kelly 33 Chao Zhang 55 Ying He 22

Abstract

3D human segmentation has seen noticeable progress in recent years. It, however, still remains a challenge to date. In this paper, we introduce a deep patch-based method for 3D human segmentation. We first extract a local surface patch for each vertex and then parameterize it into a 2D grid (or image). We then embed identified shape descriptors into the 2D grids which are further fed into the powerful 2D Convolutional Neural Network for regressing corresponding semantic labels (e.g., head, torso). Experiments demonstrate that our method is effective in human segmentation, and achieves state-of-the-art accuracy.

Keywords:

Human segmentation Deep learning Parameterization Shape descriptors.

1 Introduction

3D human segmentation is a fundamental problem in human-centered computing. It can serve many other applications such as skeleton extraction, editing, interaction etc,. Given that traditional optimization methods have limited segmentation outcomes, deep learning techniques have been put forwarded to achieve better results.

Recently, a variety of human segmentation methods based upon deep learning have emerged [13, 14, 22, 23]. The main challenges are twofold. Firstly the “parameterization” scheme and, secondly, the feature information as input. Regarding the parametrization scheme, some methods convert 3D geometry data to 2D image style with brute force [13]. Methods such as [22] convert the whole human model into an image-style 2D domain using geometric parameterization. However, it usually requires certain prior knowledge like the selection of different groups of triplet points. Some methods like [23] simply perform a geodesic polar map. Nevertheless, such methods often need augmentation to mitigate origin ambiguity and sometimes generate poor patches for non-rigid humans. Regarding the input feature information, one simple solution is using 3D coordinates for learning which highly relies on data augmentation [14]. Other methods [13, 22] employ shape descriptors like WKS [3] as their input.

In this paper, we propose a novel deep learning approach for 3D human segmentation. In particular, we first cast the 3D-2D mapping as a geometric parameterization problem. We then convert each local patch into a 2D grid. We do this so as to embed both global features and local features into the channels of the 2D grids which are taken as input for powerful image-based deep convolutional neural networks like VGG [30]. In the testing phase, we first parameterize a new 3D human shape in the same way as training, and then feed the generated 2D grids into the trained model to output the labels.

We conduct experiments to validate our method and compare it with state-of-the-art human segmentation methods. Experimental results demonstrate that it achieves highly competitive accuracy for 3D human segmentation. We also conduct further ablation studies on different features and different neural networks.

2 Related Work

2.1 Surface Mapping

Surface mapping approaches solve the mapping or parameterization, ranging from local patch-like surfaces to global shapes. The Exponential Map is often used to parameterize a local region around a central point. It defines a bijection in the local region and preserves the distance with low distortion. Geodesic Polar Map (GPM) describes the Exponential Map using polar coordinates. [9, 19, 27, 24] implemented GPM on triangular meshes based on approximate geodesics. Exact discrete geodesic algorithms such as [31, 35] are featured with relatively accurate tracing of geodesic paths and hence polar angles. The common problem with GPM is that it easily fails to generate a one-to-one map due to the poor approximation of geodesic distances and the miscalculation of polar angles. To overcome the problem one needs to find the inward ray of geodesics mentioned in [21]. However, sometimes the local region does not form a topological disk and the tracing of the isocurve among the triangles is very difficult. To guarantee a one-to-one mapping in a local patch, one intuitive way is to adapt the harmonic maps or the angle-preserving conformal maps. A survey [10] reviewed the properties of these mappings. The harmonic maps minimize deformation and the algorithm is easy to implement on complex surfaces. However, as shown in [6, 8], in the discrete context (i.e. a triangle mesh) if there are many obtuse triangles, the mapping could be flipped over. [12, 15, 26, 29, 28] solved the harmonic maps on closed surfaces with zero genus, which is further extended to arbitrary-genus by [12, 20]. These global shapes are mapped to simple surfaces with the same genus. If the domains are not homeomorphous, one needs to cut or merge pieces into another topology [32, 7]. These methods are globally injective and maintain the harmonicity while producing greater distortion around the cutting points.

2.2 Deep Learning on Human Segmentation

Inspired by current deep learning techniques, there have been a number of approaches attempting to extend these methods to handle the 3D human segmentation task. Limited by irregular domain of 3D surfaces, successful network architecture can not be applied straightforwardly. By leveraging Convolutional Neural Networks (CNNs), Guo et al. [13] initially handled 3D mesh labeling/segmentation in a learning way. To use CNNs on 3D meshes, they reshape per-triangle hand-crafted features (e.g. Curvatures, PCA, spin image) into a regular gird where CNNs are well defined. This approach is simple and flexible for applying CNNs on 3D meshes. However, as the method only considers per-triangle information, it fails to aggregate information among nearby triangles which is crucial for human segmentation. At the same time, Masci et al. [23] designed the network architecture, named GCNN (Geodesic Convolutional Neural Networks), so as to deal with non-Euclidean manifolds. The convolution is based on a local system of geodesic polar coordinates to parameterize a local surface patch. This convolution requires to be insensitive to the origin of angular coordinates, which means it disregards patch orientation. Following [23], anisotropic heat kernels were introduced in [5] to learn local descriptor with incorporating patch orientation. To use CNNs on surface setting, Maron et al. [22] introduced a deep learning method on 3D mesh models via parameterizating a surface to a canonical domain (2D domain) where the successful CNNs can be applied directly. However, their parameterization rely on the choice of three points on surfaces, which would involve significant angle and scale distortion. Later, an improved version of parameterization was employed in [14] to produce a low distortion coverage in the image domain. Recently, Rana et al. [16] designed a specific method for triangle meshes by modifying traditional CNNs to operate on mesh edges.

3 Method

3.1 Overview

In this work, we address 3D human segmentation by assigning a semantic label to each vertex with the aid of its local structure (patch). Due to intrinsic irregularity of surfaces, traditional 2D CNNs can not be applied to this task immediately. To this end, we map a surface patch into a 2D grid (or image), in which we are able to leverage successful network architectures (e.g. ResNet [17], VGG [30]).

As shown in Fig. 1, for each vertex on a 3D human model, a local patch is built under geodesic measurement. We then convert each local patch into a 2D grid (or image) via a 3D-2D mapping step, to suit the powerful 2D CNNs. To preserve geometric information both locally and globally, we embed local and global shape descriptors into the 2D grid as input features. Finally, we establish the relation between per-vertex (or per-patch) feature tensor and its corresponding semantic label in a supervised learning manner. We first introduce the surface mapping step for converting a local patch into 2D grid in Section 3.2, and then explain the neural network and implementation details in Section 3.3.

Refer to caption — Figure 1: Overview of our method. For each vertex, we first build a local patch on surface and then parameterize it into a 2D grid (or image). We embed the global and local features (WKS, Curvatures, AGD) into the 2D grid which is finally fed into VGG16 [30] to regress its corresponding semantic label.

3.2 Surface Mapping

Patch extraction. Given a triangular mesh $M$ , we compute the local patch $P$ for each vertex $v\in M$ based on the discrete geodesic distance $d$ by satisfying $d(v_{p})<r_{p}$ for all $v_{p}\in P$ . $r_{p}$ is an empirically fixed radius for all patches. Assume the area of $M$ is $\alpha$ , $r_{p}=\sqrt{(\alpha/m)}$ , where $m$ is set to $1000$ in this work. The geodesic distance $d$ is computed locally using the ICH algorithm [35] due to its efficiency and effectiveness.

Parameterization. There are two cases for parameterization in our context, whereby $P$ is a topological disk and otherwise. For the former case, we denote the 2D planar unit disk by $D$ , we compute the harmonic maps $\mu:P\to D$ by solving the Laplace equations

\sum_{(v_{j},v_{i})\in M}c_{ij}(\mu(v_{j})-\mu(v_{i}))=0,

(1)

with Dirichlet boundary condition

\mu(v^{\prime}_{k})=(\cos\theta_{k},\sin\theta_{k}),\theta_{k}=2\pi\frac{\sum_{l=1}^{k}|v^{\prime}_{l}-v^{\prime}_{l-1}|}{\sum_{o=1}^{m}|v^{\prime}_{o}-v^{\prime}_{o-1}|},

(2)

where $v_{i}$ is an interior vertex of $P$ (Eq. (1)) and $c_{ij}$ is the cotangent weight on edge $(v_{i},v_{j})$ . In Eq. (2), $v^{\prime}_{k}$ ( $k\in[1,m]$ ) belongs to the boundary vertex set of $P$ . The boundary vertex set contains $m$ vertices, which are sorted in a clockwise order according to the position on the boundary of $P$ . Suppose $(v_{i},v_{j})$ is an interior edge. $(v_{i},v_{j},v_{k})$ and $(v_{i},v_{j},v_{l})$ are two adjacent triangles, $c_{ij}$ is calculated as

c_{ij}=\frac{1}{2}(\cot\beta_{k}+\cot\beta_{l}),

(3)

where $\beta_{k}$ and $\beta_{l}$ is the angle between $(v_{i},v_{k})$ and $(v_{j},v_{k})$ , and between $(v_{i},v_{l})$ and $(v_{j},v_{l})$ , respectively.

There are cases where the local patch $P$ is not a topological disk and the harmonic maps can not be computed. In this case, we trace the geodesic paths for each $v_{p}\in P$ , by reusing the routing information stored by the ICH algorithm when computing $d$ . See Fig. 2 for illustration of parameterization. Similar to [23], we then obtain a surface charting represented by polar coordinates on $D$ . We next perform an alignment and a $32\times 32$ grid discretization on $D$ .

Alignment and grid discretization. The orientation of $D$ is ambiguous in the context of the local vertex indexing. We remove the ambiguity by aligning each patch with a flow vector field $\mathcal{\phi}$ on $M$ . For each vertex $v\in M$ and its associated patch $P_{v}$ , the flow vector $\mathcal{\phi}(v)$ serves as the reference direction of $P_{v}$ when mapping to $\mathbb{R}^{2}$ . Fig. 2 illustrates the reference direction as an example. $\mathcal{\phi}$ is defined as a vector field flowing from a set of pre-determined sources $v_{s}\in M$ to the sinks $v_{t}\in M$ . We initially solve a scalar function $u$ on $M$ using the following Laplace equation.

	$\displaystyle\mathop{}\!\mathbin{\bigtriangleup}u(v)=0,$
	$\displaystyle u(v_{s})=0,~{}u(v_{t})=1,$

and the flow vector field is $\mathcal{\phi}=\nabla u$ . We further calibrate the polar angles of $D$ with $\phi$ . Considering the first adjacent edge $e_{base}$ around a source vertex $v_{s}$ as a base edge, BaseToRef is the angle between the projected flow vector $\phi^{\prime}(v_{s})$ and $e_{base}$ . $\phi^{\prime}$ is the projected $\phi$ onto a random adjacent face of the base edge. From the harmonic maps $\mu$ , we easily obtain the polar angles of the local, randomly-oriented polar coordinate system. The polar angles are represented by AxisToV for all $v\in D$ and AxisToBase for $e_{base}$ . To align the local polar axis to the reference direction, the calibrated polar angle for all $v\in D$ is calculated as $\theta_{v}=\textit{AxisToV}-\textit{AxisToBase}-\textit{BaseToRef}$ .

The grid with $32\times 32$ cells is embedded inside the calibrated $D$ such that $D$ is the circumcircle of the grid. We build a Cartesian coordinate system in $D$ , and the origin is the pole in the polar system. The x-axis and the y-axis overlap the polar axis and $\theta=\pi/2$ , respectively. The vertices and triangles on $D$ are converted to this Cartesian coordinate system. Some cells belong to a triangle if the cell centers are on the triangle. We compute the barycentric coordinates of each involved cell (center) with respect to the three vertices of that triangle. The barycentric coordinates will be used for calculating cell features based on vertex features later.

Shape descriptors. After generating $32\times 32$ grids (or images), we embed shape descriptors as features into them. The features of each cell are calculated with linear interpolation using the barycentric coordinates computed above. The descriptors include Wave Kernel Signature (WKS) [3], curvatures (minimal, maximal, mean, Gaussian) and average geodesic distance (AGD) [18]. We normalize each kind of descriptors on a global basis, that is, the maximum and minimum values are selected from the descriptor matrix, rather than simply from a single row or column of the matrix.

3.3 Neural Network and Implementation

Neural network. As a powerful and successful network, we adopt VGG network architecture with $16$ layers (see Fig. 1) as our backbone in this work. The cross-entropy loss is employed as our loss function for the VGG16 net. It is worth noting, however, that the surface parameterization presented in this work is quite general in nature, being applicable to many other CNNs elsewhere in the literature.

Implementation details. We implement the VGG16 network in PyTorch on a desktop PC with an Intel Core i7-9800X CPU (3.80 GHz, 24GB memory). We set a training epoch number of $200$ and a mini-batch size of $64$ . SGD is set as our optimizer and the learning rate is decreased from $1.0\times 10^{-3}$ to $1.0\times 10^{-9}$ with increasing epochs. To balance the distribution of each label, in the training stage we randomly sample $5,000$ samples per label in each epoch. Training takes about $3.5$ hours on a GeForce GTX 2080Ti GPU (11GB memory, CUDA 9.0).

Once the model is trained, we can infer semantic labels of a human shape in a vertex-wise way. Given a human shape, we first compute the involved shape descriptors for each vertex. For each vertex, we build a local surface patch and parameterize it into a 2D grid (or image) as described in Section 3.2. We embed all the shape descriptors into a 2D grid and feed it into our trained model for prediction.

4 Experimental Results

In this section, we first introduce the dataset used in our experiments, and then explain the evaluation metric. We then show the visual and the quantitative results. We also perform ablation studies for the input features and different neural networks.

4.1 Dataset Configuration

In this work, we use dataset from [22] which consists of $373$ train human models from SCAPE [2], FAUST [4], MIT [33] and Adobe Fuse [1], and $18$ test human models from SHREC07 [11]. Some examples of our training dataset are shown in Fig. 3. For each human model, there are $8$ semantic labels (e.g., Head, Arm, Torso, Limb, Feet), as shown in Fig. 1. To represent geometric information of a human model both globally and locally, we concatenate a set of shape descriptors as input features: $26$ WKS features [3], $4$ curvature features ( $C_{min}$ , $C_{max}$ , $C_{mean}$ , $C_{gauss}$ ) and AGD [18].

4.2 Evaluation Metric

To provide a fair comparison, we also evaluate our segmentation results in an area-aware manner [22]. For each segmentation result, the accuracy is computed as a weighted ratio of correctly labeled triangles over the sum of all triangle area. Therefore, the overall accuracy on all involved human shapes is defined as

ACC=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{A_{i}}\sum_{j\in J_{i}}a_{ij},

(4)

where $N$ denotes the number of test human models and $A_{i}$ is the sum of triangle area of the $i$ -th human model. $J_{i}$ is the set including the indices of correctly labeled triangles of the $i$ -th human model and $a_{ij}$ represents the $j$ -th triangle area of the $i$ -th human model. Since we address the human segmentation task in a vertex-wise manner, the per-vertex labels need to be transferred into per-face labels for the quantitative evaluation. The face label is simply estimated by using a voting strategy among its three vertex labels. We immediately set the label with two or three vertices as the label on the face. We randomly select a vertex label as the face label, if three vertex labels are totally different.

4.3 Visual and Quantitative Results

In this section, we show the visual and quantitative results. As shown in Fig. 4, the top row lists several of our results in the test set, and the bottom row displays the corresponding ground-truth models. To further evaluate our method for 3D human segmentation, a quantitative comparison with recent human segmentation techniques are summarized in Table 1. As we can see from Table 1, our method achieves an accuracy of $89.89\%$ , ranking the second place among all methods. Our approach is a bit inferior to the best method [14] which certainly benefits from its data augmentation strategy.

Table 1: Comparisons with recent methods for 3D human segmentation.

Method	#Features	ACC
DynGCNN [34]	64	86.40%
Toric CNN [22]	26	88.00%
MDGCNN [25]	64	89.47%
SNGC [14]	3	91.03%
GCNN [23]	64	86.40%
Our Method	31	89.89%

4.4 Ablation Study

Besides the above results, we also evaluate different selection choices for input features. Table 2 shows that the input features including WKS, curvatures and AGD obtain the best performance, in terms of accuracy. Moreover, we evaluate the performance of two different neural networks in 3D human segmentation, as shown in Table 3. It is obvious that the VGG16 obtains a better accuracy than the ReseNet50, and we thus employ VGG16 as the backbone in this work.

Table 2: Comparisons for different input features. For simplicity, S, W, C and A are respectively short for SI-HKS, WKS, Curvatures (Cmin, Cmax, Cmean, Cgauss) and AGD.

Features Used	#Features	ACC
SWCA	50	89.25%
SWA	46	89.81%
WCA (Our)	31	89.89%

Table 3: Comparisons for two different network architectures.

Network	Features	ACC
ResNet50	WKS, Curvatures, AGD	87.60%
VGG16	WKS, Curvatures, AGD	89.89%

5 Conclusion

We have presented a deep learning method for 3D human segmentation. Given a 3D human mesh as input, we first parameterize each local patch in the shape into 2D image style, and feed it into the trained model for automatically predicting the label of each patch (i.e., vertex). Experiments demonstrate the effectiveness of our approach, and show that it can achieve state-of-the-art accuracy in 3D human segmentation. In the future, we would like to explore and design more powerful features for learning the complex relationship between the non-rigid 3D shapes and the semantic labels.

References

[1] Adobe fuse 3d characters. https://www.mixamo.com
[2] Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM SIGGRAPH 2005 Papers, pp. 408–416 (2005)
[3] Aubry, M., Schlickewei, U., Cremers, D.: The wave kernel signature: A quantum mechanical approach to shape analysis. In: 2011 IEEE International Conference on Computer Vision Workshops. pp. 1626–1633. IEEE (2011)
[4] Bogo, F., Romero, J., Loper, M., Black, M.J.: Faust: Dataset and evaluation for 3d mesh registration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3794–3801 (2014)
[5] Boscaini, D., Masci, J., Rodolà, E., Bronstein, M.: Learning shape correspondence with anisotropic convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 3189–3197 (2016)
[6] Duchamp, T., Certain, A., DeRose, A., Stuetzle, W.: Hierarchical computation of pl harmonic embeddings. preprint (1997)
[7] Floater, M.: One-to-one piecewise linear mappings over triangulations. Mathematics of Computation 72(242), 685–696 (2003)
[8] Floater, M.S.: Parametric tilings and scattered data approximation. International Journal of Shape Modeling 4(03n04), 165–182 (1998)
[9] Floater, M.S.: Mean value coordinates. Computer Aided Geometric Design 20(1), 19–27 (2003)
[10] Floater, M.S., Hormann, K.: Surface parameterization: a tutorial and survey. In: Advances in Multiresolution for Geometric Modelling, pp. 157–186. Springer (2005)
[11] Giorgi, D., Biasotti, S., Paraboschi, L.: Shape retrieval contest 2007: Watertight models track. SHREC competition 8(7) (2007)
[12] Gu, X., Yau, S.T.: Global conformal surface parameterization. In: Proceedings of the 2003 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing. pp. 127–137 (2003)
[13] Guo, K., Zou, D., Chen, X.: 3d mesh labeling via deep convolutional neural networks. ACM Transactions on Graphics 35(1), 1–12 (2015)
[14] Haim, N., Segol, N., Ben-Hamu, H., Maron, H., Lipman, Y.: Surface networks via general covers. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 632–641 (2019)
[15] Haker, S., Angenent, S., Tannenbaum, A., Kikinis, R., Sapiro, G., Halle, M.: Conformal surface parameterization for texture mapping. IEEE Transactions on Visualization and Computer Graphics 6(2), 181–189 (2000)
[16] Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleishman, S., Cohen-Or, D.: Meshcnn: a network with an edge. ACM Transactions on Graphics 38(4), 1–12 (2019)
[17] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
[18] Hilaga, M., Shinagawa, Y., Kohmura, T., Kunii, T.L.: Topology matching for fully automatic similarity estimation of 3d shapes. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. pp. 203–212 (2001)
[19] Ju, T., Schaefer, S., Warren, J.: Mean value coordinates for closed triangular meshes. In: ACM SIGGRAPH 2005 Papers, pp. 561–566 (2005)
[20] Khodakovsky, A., Litke, N., Schröder, P.: Globally smooth parameterizations with low distortion. ACM Transactions on Graphics 22(3), 350–357 (2003)
[21] Kokkinos, I., Bronstein, M.M., Litman, R., Bronstein, A.M.: Intrinsic shape context descriptors for deformable shapes. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 159–166. IEEE (2012)
[22] Maron, H., Galun, M., Aigerman, N., Trope, M., Dym, N., Yumer, E., Kim, V.G., Lipman, Y.: Convolutional neural networks on surfaces via seamless toric covers. ACM Transactions on Graphics. 36(4), 71–1 (2017)
[23] Masci, J., Boscaini, D., Bronstein, M., Vandergheynst, P.: Geodesic convolutional neural networks on riemannian manifolds. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 37–45 (2015)
[24] Melvær, E.L., Reimers, M.: Geodesic polar coordinates on polygonal meshes. In: Computer Graphics Forum. vol. 31, pp. 2423–2435. Wiley Online Library (2012)
[25] Poulenard, A., Ovsjanikov, M.: Multi-directional geodesic neural networks via equivariant convolution. ACM Transactions on Graphics 37(6), 1–14 (2018)
[26] Praun, E., Hoppe, H.: Spherical parametrization and remeshing. ACM Transactions on Graphics 22(3), 340–349 (2003)
[27] Schmidt, R., Grimm, C., Wyvill, B.: Interactive decal compositing with discrete exponential maps. In: ACM SIGGRAPH 2006 Papers, pp. 605–613 (2006)
[28] Sheffer, A., Gotsman, C., Dyn, N.: Robust spherical parameterization of triangular meshes. Computing 72(1-2), 185–193 (2004)
[29] Sheffer, A., de Sturler, E.: Parameterization of faceted surfaces for meshing using angle-based flattening. Engineering with Computers 17(3), 326–337 (2001)
[30] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[31] Surazhsky, V., Surazhsky, T., Kirsanov, D., Gortler, S.J., Hoppe, H.: Fast exact and approximate geodesics on meshes. ACM Transactions on Graphics 24(3), 553–560 (2005)
[32] Tutte, W.T.: How to draw a graph. Proceedings of the London Mathematical Society 3(1), 743–767 (1963)
[33] Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. In: ACM SIGGRAPH 2008 Papers, pp. 1–9 (2008)
[34] Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphic 38(5), 1–12 (2019)
[35] Xin, S.Q., Wang, G.J.: Improving chen and han’s algorithm on the discrete geodesic problem. ACM Transactions on Graphics 28(4), 1–8 (2009)