U-Net with ResNet Backbone for Garment Landmarking Purpose
Abstract
We build a heatmap-based landmark detection model to locate important landmarks on 2D RGB garment images. The main goal is to detect edges, corners and suitable interior region of the garments. This let us re-create 3D garments in modern 3D editing software by incorporate landmark detection model and texture unwrapping. We use a U-net architecture with ResNet backbone to build the model. With an appropriate loss function, we are able to train a moderately robust model.
1 Introduction
In SaratiX, one of our project is to provide 3D-garment reconstruction from 2D-images. Specifically, we take a garment’s front and back images, then try to reconstruct it in a 3D virtual environment. Currently, we must know the garment type beforehand, then construct a template mesh for each garment type to be prepared for texture unwrapping. With this in mind, we develop a landmark detection model to find the contour of the garment and important parts of it, such as collar and armpit. Then, we use these information to perform geometric image transformation on the texture images so it will follows the UVmaps of our template 3D-garments (Figure 1).
We use a deep learning approach to detect important landmarks on various type of garments. In this article, we choose T-shirt for illustration purpose. We adopt U-net architecture with Residual neural network as our backbone (PyTorch Pretrained Resnet34 Model) to accelerate training.

2 Setup
We take T-shirt as an example, then label 52 points on it. In Figure 2, the line segments are only for cosmetic purpose, illustrating the contours of the T-shirt.

We prepare 250 images to train the landmarks on T-shirt, and the result looks good after training for 400 epoches (we will discuss the loss function soon).
3 Model Architecture
We use a U-net architecture with pretrained Resnet34 weights (from PyTorch) as a backbone to build our heatmap-based landmarking model (see Figure 3). Using a single RGB image input, we first resize the image to . Then, the model will output a tensor of shape responsible for storing the heatmaps, where stands for the number of landmarks we need.

Below (Figure 4) shows a visualization of the output, in the case of 52 landmarks on T-shirt. We use a grid to show the heatmap stack on top of the original T-shirt image. Our goal is to train the model for heatmap detection. We choose heatmap over regression because it simply works better in our case. On top of that, the heatmap outputs give us plenty of insights about the performance of our model compare to regression method, such as avoiding double attentions (Figure 5). Each heatmap is a matrix with real entries in . Pixel with value near 0 represent black color, while it is represented by red color around , and then yellow color around .


After getting the heatmaps (Figure 6), we calculate the index of maximum pixel to get the landmark coordinates.

4 Dataset and Loss function
Before training, we did augmentation for the dataset, and this augmentation is being done on the fly. To train a robust model, we did random shear and random rotation on the dataset.
Since we are using heatmap approach for landmark detection, we use a weighted loss function to make the model equally emphasize on the narrow landmark location and the comparably vast background region. Let be the total number of landmarks, then the model will output heatmaps where each is a matrix of size . Let the ground truth coordinates be , where each is the location of the landmark on the image. We will assume the landmarks are integers and have been resized to the region of , which means .
Let be the radius of our landmarks. We transform the set of landmarks to be the set of ground truth heatmaps , where satisfies the following: The th entry of is 1, and for those entry with distance from at least pixel equals 0. We then set to be interpolated linearly for those pixels with distance from less than 10. This is to mimic the heatmap representation of each landmark.
With the purpose of training towards , we purposely added Sigmoid function as the last layer of our model.
We could use a standard regression loss function:
where is the L1 norm for any matrix , to train our model. However, as background consists of majority of the heatmaps, the model will tend to predict heatmaps that are entirely black. Therefore, we need to weight the component of loss function. We define the indicator matrix of such that
In other words, is a binary mask that is white on the disk of radius 10 around , black on the remaining pixels. Moreover, we also define element-wise product of two matrices of same shape as
where .
We can now formally define our loss function as follows: Given ground truth landmarks , we can get the associated ground truth heatmaps and their indicators . For the heatmap predictions , the loss is defined by
where is recognized as the matrix of ones with shape . In this function, different weights are distributed to solve the unbalance issue from the background. Notice the normalization factors are used to make the value of the loss function falls between 0 and 1, enable us to judge the performance of the model easily. For example, if the loss is , we can conclude that the pixel values of the predictions differ averagely from the ground truths.
5 Training
With a batch size of 8, learning rate , we use Adam optimizer to train our model for 400 epoches. We use of our dataset for training, and the remaining for validation.
Figure 7 shows a graph of training and validation loss.

6 Discussion
We use the model to perform inferencing on some unseen images (Figure 8). There are still flaws within our model. For example, the yellow shirt on the left has some unstable prediction on the bottom-left corner. On the other two images, the model couldn’t predict occluded landmarks behind their left hands. We conjecture these are because the last layer of our model is too simple, and probably need some extension from that point.

There are some methods which could possibly improve the model. First, we could use Active Shape Model (similar to [1]) to better stabilize the predictions, so the result of the yellow T-shirt above would not happen; it could also detect occluded landmarks since the machine knows the overall contour of the landmarks. Furthermore, we could also add some hidden layers before the last layer to interchange information among the heatmaps, so that nearby landmarks can relate to each other, possibly improve the predictions.
Finally, our dataset is still small (currently 200 images for each model). The model could highly benefit by simply increase the dataset to 500 images.
7 Acknowledgments
The outcome of this deep learning model, datasets and the result from this paper is thanks to the support of the leaders and colleagues in SaratiX, Custlr Sdn. Bhd. Moreover, I also gained excellent knowledge on how to build deep neural networks in Recogine Sdn. Bhd. during my previous internship period. This work certainly cannot be done in such intergrity without the help of supportive and genius colleagues from these companies.
References
- [1] Ali P. F., Hojjat A., Mohammad M. (2021) ASMNet: a Lightweight Deep Neural Network for Face Alignment and Pose Estimation, CVPR 2021 Biometrics Workshop.