A Novel Patch Convolutional Neural Network for View-based 3D Model Retrieval

Zan Gao Shandong Artifical Intelligence Institute, QiLu University of TechnologyJinanChina [email protected] , Yuxiang Shao Shandong Artifical Intelligence Institute, QiLu University of TechnologyJinanChina [email protected] , Weili Guan School of Information Technology, Monash UniversityMelbourneAustralia [email protected] , Meng Liu Shandong Jianzhu UniversityJinanChina [email protected] , Zhiyong Cheng Shandong AI InstituteJinanChina [email protected] and Shengyong Chen Tianjin University of TechnologyTianjinChina [email protected]

(2021)

Abstract.

Recently, many view-based 3D model retrieval methods have been proposed and have achieved state-of-the-art performance. Most of these methods focus on extracting more discriminative view-level features and effectively aggregating the multi-view images of a 3D model, but the latent relationship among these multi-view images is not fully explored. Thus, we tackle this problem from the perspective of exploiting the relationships between patch features to capture long-range associations among multi-view images. To capture associations among views, in this work, we propose a novel patch convolutional neural network (PCNN) for view-based 3D model retrieval. Specifically, we first employ a CNN to extract patch features of each view image separately. Secondly, a novel neural network module named PatchConv is designed to exploit intrinsic relationships between neighboring patches in the feature space to capture long-range associations among multi-view images. Then, an adaptive weighted view layer is further embedded into PCNN to automatically assign a weight to each view according to the similarity between each view feature and the view-pooling feature. Finally, a discrimination loss function is employed to extract the discriminative 3D model feature, which consists of softmax loss values generated by the fusion classifier and the specific classifier. Extensive experimental results on two public 3D model retrieval benchmarks, namely, the ModelNet40, and ModelNet10, demonstrate that our proposed PCNN can outperform state-of-the-art approaches, with mAP values of 93.67%, and 96.23%, respectively.

3D Model Retrieval, Patch Convolutional Neural Network, Adaptive Weighted View Layer, Discrimination Loss

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†doi: 10.1145/3474085.3475450^†^†conference: Proceedings of the 29th ACM International Conference on Multimedia; October 20–24, 2021; Virtual Event, China^†^†booktitle: Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China^†^†price: 15.00^†^†isbn: 978-1-4503-8651-7/21/10^†^†ccs: Information systems Multimedia and multimodal retrieval^†^†ccs: Information systems Information retrieval query processing

1. Introduction

With the wide application of 3D systems in industrial enterprises, a huge number of 3D models have been generated and stored in enterprise repositories. As the retrieval of 3D models can save a lot of time and cost in new product development and manufacturing, it plays an important role in industrial enterprises, and researchers have developed some 3D models retrieval methods (su2015multi, ; feng2018gvcnn, ; jiang2019mlvcnn, ; he2019view, ). In general, the 3D model retrieval problem is formulated as follows: given a 3D model (query), the matching or relevant 3D models (documents) are retrieved, and the documents are ranked according to the similarity with the query. Since the 3D model representation plays a key important role in the 3D model retrieval, researchers often pay more attention to the 3D model representation. Then only a simple similarity metric approach, such as Euclidean distance, is utilized. The effectiveness of the 3D model representation mainly determines the success or failure of the 3D model retrieval method. Thus, with the tremendous advances of deep learning in recent years (su2015multi, ; yang2020tree, ; Liu2018Cross, ; feng2018gvcnn, ; jiang2019mlvcnn, ; Gao2021Pairwise, ; Yang2021DVMR, ; he2019view, ; Liu2019Online, ), various deep networks have been employed for learning 3D model representation. According to different methods of feature representation, related methods can be roughly divided into two categories: 1) model-based methods (maturana2015voxnet, ; qi2017pointnet++, ) and 2) view-based methods (su2015multi, ; Nie2020Multigraph, ; feng2018gvcnn, ; Zhu2015Learning, ; jiang2019mlvcnn, ; he2019view, ; Zhou2020Semantic, ). In the model-based methods, the feature is directly extracted from the 3D model, and view-based methods place multiple virtual cameras around the 3D model to generate 2D images, which are used as the input data in view-based methods. In comparison with the above model-based methods, view images can be easily obtained in the real world, and view-based methods have achieved satisfying performance in the 3D model retrieval task due to the rapid development of deep networks in 2D image analysis. Besides, the feature extraction of the 3D model is often complex, and its performance is also unsatisfactory in the 3D model retrieval task. Thus, in this work, we continue to focus on view-based 3D model retrieval. Although many view-based 3D model retrieval approaches (su2015multi, ; feng2018gvcnn, ; Liu2021Hierarchical, ; jiang2019mlvcnn, ; he2019view, ) have been proposed, they often focus on applying different aggregating strategies on view-level features to explore the content relationships between multiple views of a 3D model. For example, some early methods, such as MVCNN (su2015multi, ) and MVCNN-MultiRes (qi2016volumetric, ), adopt view-wise pooling strategies to generate the model feature, which treat all views equally.

Nonetheless, the view-pooling scheme discards meaningful content information and spatial relationships between view images, which may help improve the performance in the retrieval task. GVCNN (feng2018gvcnn, ) assigns different weights to views to exploit relationships among view-level features. Other methods (han2018seqviews2seqlabels, ; dai2018siamese, ) utilize recurrent neural networks to obtain correlative information among view-level features. We note that these works extract the feature of each view image independently, while the latent relationships of views are left unexplored in the view feature extraction stage. Different from existing methods that aggregate view-level features, MHBN (yu2018multi, ) employs patch-to-patch similarity measurement to obtain the model representation, but it is difficult to capture long-range associations of all multi-view images. The reason is that view features might differ significantly due to viewpoint variations. Even if there are view features with distinct differences, patches within view images might be more relevant. In comparison with utilizing view-level features directly, it is more reasonable and effective to capture long-range associations among views with patch-level features.

To solve this issue, in this work, we propose a novel patch convolutional neural network that utilizes patch-level features to learn the content information and spatial information within multi-view images. The PCNN takes sequential multi-view images as input, which are rendered from a circle around the 3D model. In comparison with unordered views, sequential views can help the network to learn the spatial information between adjacent views. Specifically, we first employ the CNN to separately extract patch features of each view image. Second, a novel neural network module named PatchConv is designed to capture long-range associations of all multi-view images. In detail, a k-nearest neighbor graph is constructed for each patch according to the similarity between patch features, and then, a convolution-like operation is employed to extract relational information from neighboring patches. To avoid the information loss of multiple views caused by view pooling, an adaptive weighted view layer is further embedded into the PCNN, which explores spatial and content relationships between adjacent view features and automatically assigns a weight to each view to make full use of the distinguishing information in the views. Finally, to improve the feature discriminability, a discrimination loss function is employed to ensure that not only model features but also view features can be distinguished, which consists of the fusion classifier and the specific classifier. The main contributions of this work are summarized as follows:

•

We propose a novel PCNN method for view-based 3D model retrieval where the intrinsic relationships mining of all multi-view images, the fusion of different views, and the extraction of discriminative features are explored in a unified framework.
•

We design a patch convolutional layer to capture long-range associations among all multi-view images of a 3D model and propose an adaptive weighted view layer to automatically assign different weights to all views. Moreover, a discrimination loss function is further utilized to improve the feature discriminability of the 3D model, which consists of softmax loss values generated by the fusion classifier and the specific classifier.
•

Extensive experiments on two 3D model benchmarks show that PCNN can outperform state-of-the-art 3D model retrieval methods in terms of mAP.

2. Related Work

In the existing 3D model retrieval methods, researchers have paid more attention to the 3D model representation, and then a simple similarity metric approach such as Euclidean distance, and Cosine Distance, is used. Among the 3D model representation, since the feature extraction of 3D model features from voxels or 3D mesh or 3D point clouds (maturana2015voxnet, ; qi2017pointnet++, ) is often complex, and its performance is also unsatisfied in the 3D model retrieval task, but view-based 3D model representation have achieved good performance, thus, in this work, we also focus on the view-based 3D model representation, and review it in this section.

In view-based methods, the 3D model is usually projected into a set of 2D view images. In the early development stage, hand-crafted features were often extracted to describe each view (Chen2003On, ; Gao2020Exploring, ; gao20113d, ), such as Zernike moments, light-field descriptors, and compact multiview descriptors. For example, Gao $et$ $al.$ (gao20113d, ) proposed a view-based 3D model retrieval method, which employed a weighted bipartite graph matching to compare the similarity between two 3D models. Ansary $et$ $al.$ (Ansary2007A, ) utilized adaptive views clustering, which is based on an adaptive clustering algorithm, to select the number of views with statistical model distribution scores. To remove the limitation of camera array restriction, Gao $et$ $al.$ (Gao2012Camera, ) proposed a 3D model retrieval algorithm, in which views can be captured from any direction without camera constraint. With the development of deep neural networks, many deep learning-based methods (su2015multi, ; dai2018siamese, ) have been proposed for 3D model retrieval. For example, Su $et$ $al.$ (su2015multi, ) proposed a multi-view CNN (MVCNN) approach that first used a CNN to extract the feature of each view image individually and then max-pooled all view features into a global model feature. Further, a low-rank Mahalanobis metric was employed to improve retrieval performance. However, the simple aggregation strategy leads to the loss of meaningful information from different view images. To explore the discriminative information of different views, Feng $et$ $al.$ (feng2018gvcnn, ) introduced a hierarchical view-group-shape architecture to assign different weights to different view images. Dai $et$ $al.$ (dai2018siamese, ) proposed a Siamese CNN-BiLSTM network, which employed a BiLSTM to capture the relationships among view-level features. Jiang $et$ $al.$ (jiang2019mlvcnn, ) utilized multiple groups of views from different loop directions to represent the 3D model and proposed a hierarchical view-loop-shape architecture to explore the intrinsic associations among views. Gao et al. (Gao2020Exploring, ) systematically evaluated the performance of deep learning features in view-based 3D model retrieval on four popular datasets, meanwhile, these deep learning features were also compared with the hand-crafted features in view-based 3D model retrieval task. Inspired by triplet loss and center loss (wen2016discriminative, ), He $et$ $al.$ (he2018triplet, ) proposed triplet-center loss for 3D model retrieval, which could improve the feature discriminability. Most of these approaches focus on extracting more discriminative view-level features and effectively aggregating the multi-view images of a 3D model, but the patch information of multi-view images is ignored. Thus, different from aggregating view-level features, Yu $et$ $al.$ (yu2018multi, ) represented the 3D model from the perspective of patch-to-patch similarity measurement between all patch features. In detail, bilinear pooling was employed to aggregate patch features, and singular values of the fused feature were harmonized to generate more discriminative 3D model descriptors. However, we also observe that associations between neighboring patches in the feature space are ignored in the patch-to-patch matching. Thus, we focus on the capture of long-range associations among all multi-view images according to relational information of neighboring patches in this work.

3. Patch Convolutional Neural Network

In this section, we introduce the proposed PCNN framework in detail. As shown in Fig. 1, the framework of PCNN can be divided into four parts, namely, patch feature extraction, PatchConv, adaptive weighted view layer, and the discrimination loss function. A shared convolutional neural network is first employed to extract the patch features for each view image. Inspired by DGCNN (wang2019dynamic, ) and MHBN (yu2018multi, ), PatchConv is designed to capture the long-range associations among all multi-view images with patch features. Moreover, an adaptive weighted view layer is further embedded into PCNN to assign different weights to different view images. Finally, the discrimination loss function is designed to improve the feature discriminability.

Refer to caption — Figure 1. The framework of PCNN

3.1. Patch Feature Extraction

Given a 3D model $S$ , we first generate a set of 2D greyscale images $X_{S}=\{x_{1},x_{2},...,x_{N}\}$ , where $x_{i}$ represents the $i^{th}$ view image and $N$ denotes the number of view images. In this work, we follow the camera setup in the MVCNN (su2015multi, ), which generates 12 rendered 2D view images by setting 12 virtual cameras around the model every 30 degrees. The generated view images are arranged in order. To extract patch features, ResNet34 pre-trained on the ImageNet dataset is employed as the backbone, but the last average pooling layer of ResNet34 is removed. The reason why ResNet34 is utilized as the backbone, is that overfitting often occurs due to lack of sufficient training samples, when ResNet50 or ResNet101 is employed as the backbone. Also when ResNet18 is employed, it cannot well represent each view. Thus, to balance the performance and the number of training samples, ResNet34 is utilized as the backbone. In the experiments, to match the input of ResNet34, each 2D image is first scaled to $224\times 224$ , and then these multi-view images are simultaneously fed into the backbone, thus, patch-level features can be extracted for each view. Each patch feature contains information of the corresponding region in the image. Each view image can be represented with $7\times 7$ patch features whose dimensions are $D$ -dimension, and the patch features can be denoted as $p_{j}\in\mathbb{R}^{D}$ . Thus, we can define $P_{S}=\{p_{1},p_{2},...,p_{M}\}$ as the set of patch features from all multi-view images to denote the 3D model $S$ , where $p_{j}$ denotes the $j^{th}$ patch and $M$ represents the number of all patches of all multi-view images. Note that $7\times 7$ patches and the feature dimension of each patch depend on the structure of ResNet34. Since ResNet34 is employed, thus, in our experiments, $N=12$ , $D=512$ , and $M$ = $7\times 7\times N$ .

3.2. Patch Convolution Layer (PatchConv)

In the existing view-based 3D model retrieval methods, the relationships between view-level features are fully exploited, but this is not enough. Due to viewpoint variations, view features are very different from each other, in this way, it leads to the difficulty of directly mining relationships between view-level features. Although these views are very different, there might exist more relevant patches within all views. For example, if the patch feature represents the airplane wing in a particular view, we can find other patch features that represent airplane wings in different views through the k-NN algorithm, and combining features from all views can compensate for the information loss in a particular view. Through these more relevant patches, more relational information between views can be learned. Thus, it is more reasonable to utilize patch-level features, rather than view-level features, to capture long-range associations among all multi-view images, which is the key to obtain robust 3D model features.

Inspired by DGCNN (wang2019dynamic, ) and MHBN (yu2018multi, ), we propose a novel neural network module named PatchConv. Instead of directly working on view features, PatchConv exploits patch features and patch coordinates to construct multiple local neighborhood graphs in the patch feature space. The convolution-like operations are then utilized to extract relational information from the edges connecting neighboring patches in the graphs. Through this operation, we can obtain new patch-level features containing information of neighboring patches, which can make patch features more robust and capture long-range associations among all multi-view images. Supposing that there is a patch $p_{i}$ , the $k$ -nearest neighbors algorithm ( $k$ -NN) is employed to obtain the $k$ -nearest patches among all patch features $P_{S}=\{p_{1},p_{2},...,p_{M}\}\in\mathbb{R}^{M\times D}$ . Moreover, a graph $G(V,E)$ is further constructed, where $V=\{1,2,...,k\}$ and $E\subseteq|V|\times|V|$ denote the set of vertices and edges, respectively. In the simplest case, we construct $G$ as the k-nearest neighbors graph, and $E$ represents the edges of the form $(i,j_{1}),...,(i,j_{k})$ from $p_{i}$ to its $k$ -nearest patches $p_{j}\in\mathcal{N}(i)$ , which denotes the k-nearest neighbors of patch $i$ . In PatchConv, we adopt the form $(p_{i},p_{j}-p_{i})$ to represent the features of edges connecting patch $p_{i}$ and its neighboring patches, which combine both the patch feature (captured by $p_{i}$ ) and local information of neighboring patches (captured by $p_{j}-p_{i}$ ). Due to viewpoint variations, the position of the model in different views will change, which leads to the differences between patch features in the same position of views. Considering that each patch has its particular position in the views, more information about spatial transformation, which is caused by viewpoint variation, can be obtained by utilizing patch coordinates. Thus, we further introduce $C=\{c_{1},c_{2},...,c_{M}\}\in\mathbb{R}^{M\times 3}$ to represent patch coordinates, which contain spatial information of patches in the original space. Concretely, $c_{i}=(x,y,z)$ denotes the coordinate of the $i^{th}$ patch, and $(x,y,z)$ represents the $x^{th}$ row and the $y^{th}$ column of the $z^{th}$ view image. Thus, the new edge feature can be represented by $(p_{i},p_{j}-p_{i},c_{i},c_{j}-c_{i})$ , which contains not only the information of patches in the feature space but also the coordinates of patches in the original space. The PatchConv operation can be described as

(1)

\displaystyle p^{\prime}_{i}={\max\limits_{j\in\mathcal{N}(i)}}{h(p_{i},p_{j}-p_{i},c_{i},c_{j}-c_{i})},

where $h$ : $\mathbb{R}^{2\times(D+3)}\to\mathbb{R}^{D+3}$ is a parameterized nonlinear function. Here we implement function $h$ with 1 $\times$ 1 convolution, which extracts the relationships between the central patch and neighboring patches. Finally, we adopt max-pooling as the aggregation operation to generate new patch features, which can capture long-range associations among multi-view images according to the neighborhood information of patches.

3.3. Adaptive Weighted View Layer

Now, each view image is represented by $7\times 7$ patch features, which contain information of neighboring patches. The average pooling layer with kernel size $7\times 7$ is employed to generate view-level features $F_{S}=\{f_{1},f_{2},...,f_{N}\}$ , where $f_{i}$ represents the $i^{th}$ view feature and $N$ denotes the number of view features. Since multi-view images are sequential, features of adjacent views are relatively similar. To further mine the relationships between adjacent view images, a 1-dimensional convolution with kernel size $3$ is utilized. In the aggregation stage of view features, max-pooling is a simple strategy to pool the view-wise features into a global model feature. The max-pooling operation only keeps the maximum activation, while the non-maximum values are ignored, which leads to sub-optimal performance. To better utilize discriminative information of different views, we design an attention mechanism to automatically assign different weights to all view images, and its pipeline is shown in Fig. 2. Concretely, the view-pooling feature $g$ is first obtained through max-pooling the view-wise features, which keeps the most prominent feature within all multi-view images. Then, we estimate the similarity score between each view feature and the view-pooling feature with cosine similarity, which is also called self-attention. The cosine similarity can be formulated as

(2)

\displaystyle s_{j}(f_{j},g)=\frac{\sum\nolimits_{k=1}^{D}f_{j}(k)\times g(k)}{\sqrt{\sum\nolimits_{k=1}^{D}f_{j}(k)^{2}}\times\sqrt{\sum\nolimits_{k=1}^{D}g(k)^{2}}},

where $f_{j}$ represents the feature of the $j^{th}$ view image. $D$ is the feature dimension of $f_{j}$ and $g$ , and $s_{j}(f_{j},g)$ denotes the cosine similarity between $g$ and $f_{j}$ . According to the similarity $s_{j}$ , the weight $\alpha_{j}$ of the $j^{th}$ view is calculated as follows:

(3)

\displaystyle\alpha_{j}=\frac{\exp\{s_{j}(f_{j},g)\}}{\sum\nolimits_{i=1}^{N}\exp\{s_{i}(f_{i},g)\}},

where $N$ is the number of all multi-view 2D images. Moreover, the weights of all multi-view 2D images are further indicated by $\alpha=\{\alpha_{1},...,\alpha_{N}\}$ . Finally, based on the self-attention weight of each view, we can obtain the $j^{th}$ weighted view feature $f^{\prime}_{j}$ by

(4)

\displaystyle f^{\prime}_{j}=\alpha_{j}\times f_{j},

where $f^{\prime}_{j}$ is the specific view feature of the self-attention weight embedded into the view feature. The fusion feature $g^{\prime}$ can be obtained by

(5)

\displaystyle g^{\prime}=\sum\limits_{j=1}^{N}\alpha_{j}f_{j},

where $g^{\prime}$ is the fusion feature, which is the weighted sum of the view features.

3.4. Discrimination Loss Function

In the 3D model retrieval task, the softmax loss (krizhevsky2012imagenet, ) is usually utilized as the basic loss to increase the feature discriminability. Recently, some loss functions, such as center loss (wen2016discriminative, ) and triplet-center loss (he2018triplet, ), have been employed in view-based 3D model retrieval methods to ensure that 3D model features from the same classes are pulled closer and that 3D model features from different classes are pushed away.

We note that only the 3D model feature is utilized when computing the loss values in these works, while the view features are ignored, which leads to the loss of view information. Although a single view feature cannot effectively represent the whole 3D model, it contains abundant information of the 3D model in a particular view, which can make the 3D model features easier to distinguish from other 3D models. Since each view feature can also describe the 3D model from other perspectives, we hope that these multi-view images can also be accurately recognized. To tackle this problem, a novel discrimination loss function is designed, which consists of softmax loss values generated by the fusion classifier and the specific classifier, to simultaneously classify the 3D model and each view image. To prove its effectiveness, only the simplest softmax loss is employed in the fusion classifier and specific classifier. The discrimination loss function can be defined as follows:

(6)

\displaystyle L_{Dis}=\beta\times L_{model}+\gamma\times L_{views}

where $L_{Dis}$ is the discrimination loss function. $L_{model}$ indicates the loss value generated by the fusion classifier when the model feature is utilized, and $L_{views}$ denotes the loss value generated by the specific classifier when all multi-view features are used. $\beta$ and $\gamma$ are the hyperparameters used to control the contributions of $L_{model}$ and $L_{views}$ .

In the calculation of $L_{views}$ , the mean of losses of all multi-view features is first used, and we name it as the average view loss (AVL), which is shown as follows:

(7)

\displaystyle L_{views}=\frac{1}{N}{\sum\nolimits_{j=1}^{N}}L_{views}^{j},

where $L_{views}^{j}$ represents the loss value of the $j^{th}$ view image. In the experiments, we observe that some images are difficult to classify, while some images are easy to recognize. If the network can correctly recognize these difficult images, it is easier for it to classify the 3D model. Thus, we let the network focus on recognizing these difficult images. In detail, the view weight $\alpha$ in Eq.(3) is first taken as the loss weight $\alpha^{\prime}$ , and then, the loss weight $\alpha^{\prime}_{j}$ of the $j^{th}$ view image is set to $(\frac{2}{N}-\alpha^{\prime}_{j})$ , where N denotes the number of views. In other words, when the view weight of the $j^{th}$ image is small, the image feature is very different from the view-pooling feature, and it is difficult for the specific classifier to recognize it. Therefore, we hope that the specific classifier can pay more attention to difficult views by giving their loss values larger weights. In this way, if the difficult images can be classified, the model can be easily recognized. Thus, the weighted value of $L_{views}$ can be calculated by

(8)

\displaystyle L_{views}={\sum\nolimits_{j=1}^{N}}\alpha^{\prime}_{j}L_{views}^{j}.

We name the weighted calculation of $L_{views}$ as the weighted view loss (WVL)

4. Experiments and Discussion

In this section, the datasets and implementation details are first respectively introduced. The experimental results for two 3D model retrieval datasets and comparison with state-of-the-art methods are then described. Moreover, ablation studies are performed.

4.1. Implementation Details

To evaluate the performance of the proposed PCNN, we have conducted 3D model retrieval experiments on two public 3D model datasets including ModelNet40 (wu20153d, ), and ModelNet10 (wu20153d, ). ModelNet40 contains a total of 12,311 models from 40 categories, and ModelNet10 contains a total of 4,899 models from 10 categories. In our experiments, we strictly follow the training/testing splits in MVCNN (su2015multi, ). In PCNN, ResNet34 is utilized as the backbone. When converting the 3D model to 2D images, we strictly follow the experimental setup of MVCNN (su2015multi, ), and 12 virtual cameras around the model are placed every 30 degrees to generate 12 2D images. To make the network learn spatial relationships among views, the generated views are sequential. To match the input of ResNet34, each 2D image is scaled to 224 $\times$ 224. In PatchConv, $k$ of the $k$ -nearest neighbors algorithm is set to 12, which means that each new patch feature aggregates the information of the 12 nearest neighbor patches. In the implementation of PatchConv, the dimension of patch features is 515 after introducing patch coordinates, and the dimension of edge features is 1,030. $1\times 1$ convolution is employed to extract relational information from edge features, whose input dimension is 1,030, and the output dimension is 515, followed by batch normalization and LeakyReLU activation function. The parameter of LeakyReLU is set to 0.2. In the optimization of PCNN, Adam (Kingma2014Adam, ) is employed as the network optimizer, where the momentum is set to 0.9. To maintain the training stability, we clip the gradients into the range [-0.01, 0.01]. The learning rate is set to $4\times 10^{-5}$ . We train the network for 30 epochs with a mini-batch size of 16. In the calculation of discrimination loss, since both the fusion classifier and specific classifier are very important, hyperparameters $\beta$ and $\gamma$ are empirically set to 0.5 and 0.5, respectively. Note that $\alpha$ doesn’t need to be optimized, and it can be automatically calculated by the cosine similarity between the view-pooling feature $g$ and the feature of $j^{th}$ view image, and we also rerank the retrieval list according to classification results. Besides, in our experiments, two public evaluation metrics including Mean average precision (mAP) and precision-recall (PR) curve (feng2018gvcnn, ; gao20113d, ) is utilized.

Table 1. Comparison with the SOTA methods on the ModelNet40 dataset

Methods	Training Config		Modality	Retrieval
Methods	Pre-train	Fine-tune		mAP
SPH(kazhdan2003rotation, )	$-$	$-$	$-$	$33.3$ %
LFD(Chen2003On, )	$-$	$-$	$-$	$40.9\%$
3D ShapeNets(wu20153d, )	ModelNet40	ModelNet40	Voxel	$49.2\%$
MVCNN(su2015multi, )	ImageNet1K	ModelNet40	View	$80.2\%$
GIFT(bai2016gift, )	$-$	ModelNet40	View	$81.94\%$
GVCNN(feng2018gvcnn, )	ImageNet1K	ModelNet40	View	$85.7\%$
SeqViews2SeqLabels(han2018seqviews2seqlabels, )	ImageNet1K	ModelNet40	View	$89.09\%$
View N-gram(he2019view, )	ImageNet1K	ModelNet40	View	$89.3\%$
3D2SeqViews(han20193d2seqviews, )	ImageNet1K	ModelNet40	View	$90.76\%$
MDPCNN(Gao2020Multiple, )	ImageNet1K	ModelNet40	View	$87.6\%$
IMVCNN(He2020Improved, )	ImageNet1K	ModelNet40	View	$90.1\%$
MLVCNN(jiang2019mlvcnn, )	ImageNet1K	ModelNet40	View	$92.84\%$
PCNN (ours)	ImageNet1K	ModelNet40	View	$\textbf{93.67}\%$

Table 2. Comparison with the SOTA methods on the ModelNet10 dataset

Methods	Training Config		Modality	Retrieval
Methods	Pre-train	Fine-tune		mAP
SPH(kazhdan2003rotation, )	$-$	$-$	$-$	$44.05$ %
LFD(Chen2003On, )	$-$	$-$	$-$	$49.82\%$
3D ShapeNets(wu20153d, )	ModelNet10	ModelNet10	Voxel	$68.26\%$
GIFT(bai2016gift, )	$-$	ModelNet10	View	$91.12\%$
SeqViews2SeqLabels(han2018seqviews2seqlabels, )	ImageNet1K	ModelNet10	View	$91.43\%$
3D2SeqViews(han20193d2seqviews, )	ImageNet1K	ModelNet10	View	$92.12\%$
PANORAMA-ENN(sfikas2018ensemble, )	-	ModelNet10	View	$93.28\%$
View N-gram(he2019view, )	ImageNet1K	ModelNet10	View	$92.8\%$
IMVCNN(He2020Improved, )	ImageNet1K	ModelNet10	View	$93.0\%$
PCNN (ours)	ImageNet1K	ModelNet10	View	$\textbf{96.23}\%$

4.2. Comparison with SOTA Methods

In this section, we conduct a comprehensive comparison and evaluation of PCNN. Specifically, we first compare PCNN with state-of-the-art methods on two widely used datasets, the ModelNet40 and ModelNet10 datasets. Then, we present some retrieval examples for the ModelNet40 dataset. Finally, the PR curves are also given. Their results are shown in Table 1, Table 2, Fig.3, and Fig. 4, from which we make four key observations.

I) As shown in Table 1, we compare our method with competing view-based 3D retrieval methods on the ModelNet40 dataset, including SPH (kazhdan2003rotation, ), LFD (Chen2003On, ), 3D ShapeNets (wu20153d, ), MVCNN (su2015multi, ), GIFT (bai2016gift, ), GVCNN (feng2018gvcnn, ), SeqViews2SeqLabels (han2018seqviews2seqlabels, ), View N-gram (he2019view, ), 3D2SeqViews (han20193d2seqviews, ), MDPCNN (Gao2020Multiple, ), IMVCNN (He2020Improved, ) and MLVCNN(jiang2019mlvcnn, ). When compared with these state-of-the-art methods, PCNN can obtain the best performance on both datasets. For example, in Table 1, when comparing PCNN with traditional methods, its performance is 60.37%, 52.77%, and 44.47% higher than that of SPH, LFD, and 3D ShapeNets, respectively. Similarly, when comparing PCNN with deep learning methods, its performance is 13.47%, 11.73%, 7.97%, 4.58%, 2.91%, 6.01%, 3.57% and 0.83% higher than that of MVCNN, GIFT, GVCNN, SeqViews2SeqLabels, 3D2SeqViews, MDPCNN, IMVCNN and MLVCNN, respectively.

II) As shown in Table 2, we compare our method PCNN with competing view-based 3D retrieval methods on the ModelNet10 dataset, including SPH (kazhdan2003rotation, ), LFD (Chen2003On, ), 3D ShapeNets (wu20153d, ), GIFT (bai2016gift, ), SeqViews2SeqLabels (han2018seqviews2seqlabels, ), 3D2SeqViews (han20193d2seqviews, ), PANORAMA-ENN (sfikas2018ensemble, ), View N-gram (he2019view, ) and IMVCNN (He2020Improved, ). When compared with state-of-the-art methods, PCNN also can obtain the best performance on both datasets. For example, in Table 2, when comparing PCNN with traditional methods, its performance is 52.18%, 46.41%, and 27.97% higher than that of SPH, LFD, and 3D ShapeNets, respectively. Similarly, when comparing PCNN with deep learning methods, its performance is 5.11%, 4.80%, 4.11%, 2.95%, 3.43%, and 3.23% higher than that of GIFT, SeqViews2SeqLabels, 3D2SeqViews, PANORAMA-ENN, View N-gram and IMVCNN, respectively.

III) In these competitors, the latent relationships among different views are explored in different ways, for example, the view-pooling operation is proposed in the MVCNN to generate the model feature, and the GVCNN distinguishes different views to mine relationships among view images. However, these issues are separately discussed, and the patch information of each view image is ignored. In our proposed PCNN, the patch convolution layer is designed to capture long-range associations among all multi-view images, and an adaptive weighted view layer is further embedded into the PCNN to automatically assign a weight to each view. Finally, a discrimination loss function is employed to extract the discriminative model feature. Most importantly, these three modules are integrated into a unified framework. Thus, the performance of PCNN is much better than those of MVCNN and GVCNN, and the improvements are 13.47% and 7.97%, respectively. Moreover, we can observe the same conclusions in Table 2 when the ModelNet10 dataset is utilized.

IV) In Fig. 3, the retrieval examples of PCNN and MVCNN are given, where the results of the search error are indicated with red boxes. We observe that these 3D models are very similar and that it is even difficult for people to identify them. For example, when the desk is the query (the first row in Fig.3), it is often confused with the table. Similarly, when the sofa is used as the query (the second row in Fig.3), it is often confused with the bed. Since PatchConv in PCNN is used to capture long-range associations among all multi-view images and a discrimination loss function is designed, the feature discriminability is improved. Thus, when PCNN is employed, the extracted features have very good discrimination. Although there are few retrieval errors in Top10, the number of red boxes is very small. In contrast, when MVCNN is utilized, the feature discriminability needs to be improved, and there are many retrieval errors in Top10. Thus, the proposed PCNN is very effective and efficient. Besides, the precision-recall (PR) curves of different modules on the MonelNet40 dataset are also given in Fig.4.

4.3. Benefits of PatchConv and AWV

In this section, we will assess the benifits of PatchConv and AWV. The network architecture of MVCNN is first employed, and then different backbones including VGG-m and ResNet34 are utilized in the MVCNN, and they are named MVCNN-VGG-m and MVCNN-ResNet34, respectively. To verify the validity of PatchConv and AWV, PatchConv, AWV, and the combination of both are separately embedded into the MVCNN. Finally, the performance is evaluated on the ModelNet40 dataset. The results are shown in Table 3. Note that when PatchConv, AWV, and the combination of both are embedded into MVCNN-ResNet34, they are named MVCNN-ResNet34 (PatchConv), MVCNN-ResNet34 (AWV), and PCNN (PatchConv+AWV), respectively. Note that only $L_{model}$ , which is generated by the fusion classifier, is used.

Table 3. Benefits of PatchConv and AWV

Retrieval Methods		mAP
MVCNN-VGG-m		80.20%
MVCNN-ResNet34		81.70%
MVCNN-ResNet34 (PatchConv)		86.28%
MVCNN-ResNet34 (AWV)		89.16%
PCNN (EdgeConv+AWV)		92.14%
PCNN (PatchConv+AWV)		92.65%

In Table 3, we observe that MVCNN-ResNet34 can outperform MVCNN-VGG-m, whose improvement reaches 1.5%. In other words, when ResNet34 is utilized as the backbone of MVCNN, its retrieval accuracy is slightly better than that of VGG-m. Based on MVCNN-ResNet34, PatchConv is utilized to mine the intrinsic relationships among all different views. From Table 3, we can see that the performance of MVCNN-ResNet34 (PatchConv) is much higher than that of MVCNN-ResNet34, and the improvement achieves 4.58%. Thus, PatchConv is very effective in mining the intrinsic relationships among different views according to patch features and patch coordinates. Also, an adaptive weighted view layer is embedded within MVCNN-ResNet34. From Table 3, we can also observe that the performance of the MVCNN-ResNet34 (AWV) is much higher than that of MVCNN-ResNet34, and the improvement reaches 7.46%. Thus, the AWV is very efficient to automatically assign a weight to each view by the similarity between the view feature and the view-pooling feature. Besides, when both PatchConv and AWV are embedded into the MVCNN-ResNet34, the performance of PCNN (PatchConv+AWV) can be further improved.

In PatchConv, the patch coordinates are also used. In fact, this is inspired by methods based on point clouds. The point cloud is another representation of 3D models. Related methods take the coordinates of points as input and project coordinates into high-dimensional space. In the process of multi-level feature extraction, these methods combine coordinates and low-level features to generate high-level features with a multilayer perceptron. In this work, the position of the object in different images will change due to view variations, which leads to the differences between patch features in the same position. With patch coordinates, we can obtain more information about spatial transformation caused by view variation. As shown in Table 3, the performance of PCNN (EdgeConv+AWV) is 92.14%, in which patch coordinates are not utilized. The performance of PCNN (PatchConv+AWV) is 92.65%, which is 0.51% higher than that of PCNN (EdgeConv+AWV).

4.4. Effectiveness of Discrimination Loss

To verify the validity of the discrimination loss, the model loss (ML for short) is first used in MVCNN-ResNet34 and PCNN, and they are marked as MVCNN-ResNet34 (ML) and PCNN (ML). The specific view loss is then added into MVCNN-ResNet34 (ML) and PCNN (ML), and their corresponding names are MVCNN-ResNet34 (ML+AVL) and PCNN (ML+AVL), where AVL indicates the average view loss in Eq.(7). Besides, when the weighted view loss (WVL) shown in Eq.(8) is utilized, we refer to it as PCNN (discrimination loss). The results are shown in Table 4. From the findings, we can observe that when AVL is used in MVCNN-ResNet34, the mAPs of MVCNN-ResNet34 (ML) and MVCNN-ResNet34 (ML+AVL) are 81.7% and 88.39%, respectively, and the improvement achieved is 6.69%. Similarly, when employed in PCNN, the mAPs of PCNN (ML) and PCNN (ML+AVL) are 92.65% and 93.24%, respectively. When the discrimination loss is further used, its performance can be further improved. This result, therefore, proves the effectiveness of discrimination loss. Also, when comparing PCNN (discrimination loss) with MVCNN-ResNet34, the improvement can reach 11.97%. This further proves the effectiveness of PatchConv, AWV and discriminative loss. These modules make the network able to capture the intrinsic relationships among different views, automatically fuse all multi-view images and extract more discriminative model features.

To further prove the effectiveness of discrimination loss, we also visualize the results of MVCNN-ResNet34, MVCNN-ResNet34 (ML+AVL), PCNN (ML), and PCNN (discrimination loss) on the ModelNet10 dataset, and their visualization results are shown in Fig. 5. The reason why we choose the ModelNet10 dataset is that the number of categories in ModelNet40 is 40, thus, it is very difficult to clearly show the results. From Fig. 5(b) and Fig. 5(d), we can observe that points from the same category are aggregative, but points from different categories are dispersed. Thus, these model features can be distinguished easily.

Table 4. Effectiveness of Discrimination Loss

Loss Function		mAP
MVCNN-ResNet34 (ML)		81.7%
MVCNN-ResNet34 (ML+AVL)		88.39%
PCNN (ML)		92.65%
PCNN (ML+AVL)		93.24%
PCNN (discrimination loss)		93.67%

4.5. Convergence Analysis

In this section, we assess the convergence of the proposed PCNN on the ModelNet40 dataset. Moreover, we compare it with state-of-the-art approaches, and their convergence curves are shown in Fig. 6. From the results, we can see that although PatchConv and AWV are embedded into MVCNN-ResNet34, the convergence speed is also slightly quicker than those of MVCNN-ResNet34 and MVCNN. Thus, the modules we proposed are very efficient. Further, when the view loss is added into PCNN (ML), its convergence speed is slightly slower than those of PCNN (ML), MVCNN-ResNet34, and MVCNN. The reason for this result is that we classify the 3D model correctly and that all multi-view images are required to be accurately recognized in PCNN (ML+AVL) and PCNN (discrimination loss). However, PCNN (ML+AVL) and PCNN (discrimination loss) can quickly converge after only 20 epochs, which further proves the effectiveness of PCNN.

5. Conclusion

In this work, a novel PCNN is proposed for view-based 3D model retrieval. In PCNN, PatchConv and AWV are designed to exploit intrinsic relationships among all multi-view images and automatically fuse different views. Moreover, a discrimination loss function is employed to improve the feature discriminability. Extensive experiments on two public 3D model retrieval benchmarks demonstrate that PCNN can outperform state-of-the-art approaches. PatchConv is very effective in capturing long-range associations of all multi-view images, and AWV can automatically assign a weight to each view according to the similarity between the view feature and the view-pooling feature. Moreover, accurately classifying view features is very helpful for improving the discriminability of model features. In future work, we will focus on how to capture the relationships among all views and how to effectively fuse different views.

6. Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (No.61872270, No.62020106004, No.92048301, No.62006142, No.61872267, No.61572357). Young creative team in universities of Shandong Province (No.2020KJN012), Jinan 20 projects in universities (No. 2020GXRC040, No.2018GXRC014). New Artificial Intelligence project towards the integration of education and industry in Qilu University of Technology (No.2020KJC-JC01). Shandong Provincial Key Research and Development Program (No.2019TSLH0202).

References

[1] Tarik Filali Ansary, Mohamed Daoudi, and J. P. Vandeborre. A bayesian 3-d search engine using adaptive views clustering. IEEE Transactions on Multimedia, 9:78–88, 2007.
[2] Song Bai, Xiang Bai, and Zhichao et al. Zhou. Gift: A real-time and scalable 3d shape search engine. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5023–5032, 2016.
[3] Ding Yun Chen, Xiao Pei Tian, Yu Te Shen, and Ouhyoung Ming. On visual similarity based 3d model retrieval. Computer Graphics Forum, 22(3):223–232, 2003.
[4] Guoxian Dai, Jin Xie, and Yi Fang. Siamese cnn-bilstm architecture for 3d shape representation learning. In international joint conference on artificial intelligence, pages 670–676, 2018.
[5] Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. Gvcnn: Group-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 264–272, 2018.
[6] Yue Gao, Qionghai Dai, Meng Wang, and Naiyao Zhang. 3d model retrieval using weighted bipartite graph matching. Signal Processing-image Communication, 26:39–47, 2011.
[7] Yue Gao, Jinhui Tang, Richang Hong, Shuicheng Yan, and Tat Seng Chua. Camera constraint-free view-based 3-d object retrieval. IEEE Transactions on Image Processing, 21(4):2269–2281, 2012.
[8] Zan Gao, Leming Guo, Weili Guan, An-An Liu, Tongwei Ren, and Shengyong Chen. A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition. IEEE Trans. Image Process., 30:767–782, 2021.
[9] Zan Gao, Yinming Li, and Shaohua Wan. Exploring deep learning for view-based 3d model retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16:1–21, 2020.
[10] Zan Gao, Haixin Xue, and Shaohua Wan. Multiple discrimination and pairwise CNN for view-based 3d object retrieval. Neural Networks, 125:290–302, 2020.
[11] Zhizhong Han, Honglei Lu, and Zhenbao et al. Liu. 3d2seqviews: Aggregating sequential views for 3d global feature learning by cnn with hierarchical attention aggregation. IEEE Transactions on Image Processing, 28(8):3986–3999, 2019.
[12] Zhizhong Han, Mingyang Shang, and Zhenbao et al. Liu. Seqviews2seqlabels: Learning 3d global features via aggregating sequential views by rnn with attention. IEEE Transactions on Image Processing, 28(2):658–672, 2018.
[13] Xinwei He, Song Bai, Jiajia Chu, and Xiang Bai. An improved multi-view convolutional neural network for 3d object retrieval. IEEE Trans. Image Process., 29:7917–7930, 2020.
[14] Xinwei He, Tengteng Huang, Song Bai, and Xiang Bai. View n-gram network for 3d object retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pages 7515–7524, 2019.
[15] Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. Triplet-center loss for multi-view 3d object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1945–1954, 2018.
[16] Jianwen Jiang, Di Bao, Ziqiang Chen, Xibin Zhao, and Yue Gao. Mlvcnn: Multi-loop-view convolutional neural network for 3d shape retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8513–8520, 2019.
[17] Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonic representation of 3d shape descriptors. In Symposium on geometry processing, volume 6, pages 156–164, 2003.
[18] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014.
[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[20] An-An Liu, Heyu Zhou, and Weizhi Nie et al. Hierarchical multi-view context modelling for 3d object classification and retrieval. Inf. Sci., 547:984–995, 2021.
[21] Meng Liu, Liqiang Nie, Xiang Wang, Qi Tian, and Baoquan Chen. Online data organizer: Micro-video categorization by structure-guided multimodal dictionary learning. IEEE Transactions on Image Processing, 28(3):1235–1247, 2019.
[22] Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. Cross-modal moment localization in videos. In Proceedings of the 26th ACM International Conference on Multimedia, page 843–851, 2018.
[23] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928, 2015.
[24] Weizhi Nie, Yue Zhao, An-An Liu, Zan Gao, and Yuting Su. Multi-graph convolutional network for unsupervised 3d shape retrieval. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, pages 3395–3403, 2020.
[25] Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas. Volumetric and multi-view cnns for object classification on 3d data. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5648–5656, 2016.
[26] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
[27] Konstantinos Sfikas, Ioannis Pratikakis, and Theoharis Theoharis. Ensemble of panorama-based convolutional neural networks for 3d model classification and retrieval. Computers & Graphics, 71:208–218, 2018.
[28] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
[29] Yue Wang, Yongbin Sun, and Ziwei et al. Liu. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5):1–12, 2019.
[30] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pages 499–515, 2016.
[31] Zhirong Wu, Shuran Song, and Aditya et al. Khosla. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
[32] Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. Tree-augmented cross-modal encoding for complex-query video retrieval. In SIGIR, pages 1339–1348, 2020.
[33] Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. Deconfounded video moment retrieval with causal intervention. In SIGIR, 2021.
[34] Tan Yu, Jingjing Meng, and Junsong Yuan. Multi-view harmonized bilinear network for 3d object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 186–194, 2018.
[35] Heyu Zhou, Weizhi Nie, Dan Song, Nian Hu, Xuanya Li, and An-An Liu. Semantic consistency guided instance feature alignment for 2d image-based 3d shape retrieval. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, pages 925–933, 2020.
[36] Jing Zhu, Fan Zhu, Edward K. Wong, and Yi Fang. Learning pairwise neural network encoder for depth image-based 3d model retrieval. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, pages 1227–1230, 2015.