Unsupervised Manga Character Re-identification via Face-body and Spatial-temporal Associated Clustering

Zhimin Zhang, Zheng Wang, , Wei Hu Corresponding authors: Zheng Wang and Wei Hu. Zhimin Zhang and Zheng Wang are with the School of Computer Science, Wuhan University, No.299, Bayi Road, Wuchang District, Wuhan City, Hubei Province, China (e-mails: {zhangzhimin,wangzwhu}@whu.edu.cn). Wei Hu is with Wangxuan Institute of Computer Technology, Peking University, No. 128, Zhongguancun North Street, Beijing, China (e-mail: [email protected]).

Abstract

In the past few years, there has been a dramatic growth in e-manga (electronic Japanese-style comics). Faced with the booming demand for manga research and the large amount of unlabeled manga data, we raised a new task, called unsupervised manga character re-identification. However, the artistic expression and stylistic limitations of manga pose many challenges to the re-identification problem. Inspired by the idea that some content-related features may help clustering, we propose a Face-body and Spatial-temporal Associated Clustering method (FSAC). In the face-body combination module, a face-body graph is constructed to solve problems such as exaggeration and deformation in artistic creation by using the integrity of the image. In the spatial-temporal relationship correction module, we analyze the appearance features of characters and design a temporal-spatial-related triplet loss to fine-tune the clustering. Extensive experiments on a manga book dataset with 109 volumes validate the superiority of our method in unsupervised manga character re-identification.

I Introduction

Recent years have witnessed increasing attention in cartoon and manga (Japanese-style comics) [1], driven by the strong demands of industrial applications. E-Manga is also becoming more popular as people’s reading patterns have changed dramatically. For example, Amazon’s Kindle store has over 60,000 e-manga on sale¹¹1Amazon.com Kindle Comic. Retrieved on August 12, 2021, from http://amzn.to/1KD5ZBK. Manga character recognition is a crucial task in manga-related research, which plays an important role in areas such as character retrieval and character clustering [18, 30].

Nevertheless, most existing character recognition methods require large amounts of labeled data [33], which limits the wide usability and scalability in real-world application scenarios, as it is both expensive and difficult to manually label large datasets. Unsupervised approaches have broader applicability in the face of a plethora of manga characters, but remain relatively under-explored.

Refer to caption — Figure 1: (a) Taking advantage of the unique features of manga to make up for the challenges of the transfer task. (b) Why spatial-temporal relationship may help: exaggerated expressions such as B1 and A3 are difficult to recognize, but from the perspective of manga spatial-temporal relationship, A1-A3 and B1-B2 appear in sequences. (c) Why face-body combination may help: due to the limitation of painting style, different characters have similar faces, but different clothes constitute the signatures of the characters.

Faced with the booming demand for manga research and the large amount of unlabeled manga data, we raised a new task, called Unsupervised Manga character Re-identification (or UManga-ReID). A common strategy for UManga-ReID is the unsupervised domain adaptation approach, which transfers the learned knowledge from the source domain by optimizing with pseudo labels created by clustering algorithms to the target domain [7]. In view of the intrinsic similarity between the real-person identities and cartoon characters, most related studies [33] adopt real-person data as the source domain and cartoon data as the target domain. However, although this approach is effective to some extent, it is often plagued by the noise generated by pseudo-labeling. How to effectively reduce the noise and obtain better clustering performance is the key to this problem.

Although there exhibit many similarities between manga and the real world, manga has a more important task, which is narrative. Some manga-related research has focused on the content of manga. [13] points out that there are certain patterns of transition between manga frames. [33] discusses the influence of context on the accuracy of manga face recognition. These provide theoretical guidance for improving the clustering performance from the perspective of manga content.

Manga characters are often artistic creations of real people, so transferring models trained on real people data to manga characters is a feasible approach. However, as an art form, manga is expressed in a very different way from the real world, which poses many challenges to the transfer task (see Figure 1(a)):

•

Artistic expressions: For example, the exaggerated viewpoints and the exaggerated expressions of characters, etc.
•

Drawing limitations: For example, the limitations of the frame cause some characters’ bodies to be left out, and the limitations of the drawing style make characters in the same manga book look similar.

On the other hand, a more important task of manga is narration, which brings many characteristics that real data do not have and may further help us in the UManga-ReID study:

•

Temporal-spatial relationship: For example, the same characters tend to appear on adjacent pages one after another, characters in the same frame tend to belong to different identities, some characters tend to appear in pairs, and so on. The frame order of the manga tells the contents of the book in chronological order, and we define the relationships on the frame order as temporal relationships. Moreover, each frame in the manga often represents a scene, and we define the relationships between characters in the same frame as spatial relationships.
•

Characters’ unique signatures: Many characters have unique signatures, such as special accessories, hairstyles, costumes, etc. These signatures usually do not change in the same manga book.

In order to verify the feasibility of our conjecture, we conducted statistics on the spatial-temporal rules of manga characters. Figure 2 shows the spatial relationship of manga, revealing that characters appearing in the same frame are more likely to belong to different identities. Table I shows the temporal relationship, and the statistical results show that a character has a high probability of appearing in neighboring frames.

We are interested in knowing whether the unique features of manga can be exploited to compensate for the challenges of transfer tasks and thus achieve better results in UManga-ReID. Based on the above analysis, we have further conjectures: On the one hand, there are some potential patterns among the appearance order of manga book characters that enable us to constrain and complement the clustering from the perspective of spatial-temporal relationships when facing exaggerated or omitted artistic representations. On the other hand, different characters in the same manga book often look similar due to the constraints of the manga book style. However, since characters usually have their unique signatures, such as hairstyles and clothing, we can integrate the results of the character’s face and body to include more character identifiers, so as to compensate for similarities in painting styles. These two conjectures enlighten us that making good use of manga content-related features, such as temporal relationship and face-body combination, can help us overcome the difficulties of manga artistic expressions and drawing limitations to achieve good clustering results of manga characters.

Based on the above insights, our contribution includes the following three aspects:

•

To the best of our knowledge, we are the first to propose and study the unsupervised manga character re-identification (UManga-ReID) task. We propose a Face-body and Spatial-temporal Associated Clustering (FSAC) framework for UManga-ReID.
•

We design a temporal-spatial-related loss to fine-tune the pseudo-labeling of manga characters, so as to weaken the challenges posed by artistic expressions. We take advantage of the characters’ unique signatures and combine the face image with the body image of the same character for re-identification to overcome the effects of drawing limitations, etc.
•

Experimental results show that our approach achieves superior performance on both face, body and mixed manga image datasets on Manga109, outperforming the state-of-the-art by a large margin.

TABLE I: The proportion of characters that appear more than once in

k

-nearest frames. The data reveals that most of the characters tend to appear in neighboring frames.

	$k$ =1	$k$ =3	$k$ =5
proportion	70.26%	90.39%	94.77%

II Related Works

In line with the focus of our work, we briefly review previous works in the following areas: 1) manga-related works, 2) unsupervised person re-identification methods, and 3) manga character identification works.

II-A Manga-related works

Manga-related researches have increased with the popularity of e-manga. Some studies such as [23] focus on comic images, managing to generate comic-style images for real people using GAN. Some articles concentrate on manga translation. For example, [11] designs a fully automated manga translation system via a multi-modal and context-aware translation framework. There are also studies focusing on the content of comics. [13] analyzes the closure-driven narratives of the comics, and [9] and [24] focus on the extraction and order of manga frames(or storyboards). Some of the well-known cartoon-related datasets include Manga109 [20], Danbooru [22], iCartoonFace [33], etc. Manga109 provides 109 manga books with marks of the face and body images of characters, manga frames and texts, which provides a basis for us to study the clustering of manga characters from the perspective of the spatial-temporal relationship and face-body combination.

II-B Unsupervised person re-identification

Prior to our focus on the comics world, retrieval of the pedestrian images in the real world had received a lot of attention. Person re-identification aims to retrieve the image of a specific pedestrian across the camera and is widely used in surveillance scenes.

Among many deep learning methods, unsupervised domain adaptation (UDA) methods have attracted much attention because they do not rely on manual annotation. [5, 14] provide a baseline of this approach. These methods usually consist of three steps: 1) They extract features of images in the target domain using models pre-trained in the source domain, 2) the generated features are used for clustering to generate pseudo-labels, and 3) pseudo-labels are used to further fine-tune the model. However, this method produces hard labels with noise. How to solve the noise problem is the key. On this basis, [7] designs a mutual mean-teaching framework and achieved effective results in unsupervised domain adaptive problems with open sets (i.e., the number of classes of images in the target domain is unknown).

In addition, problems such as image occlusion, appearance change, and pose transformation also plague person re-identification research.

Various methods have been proposed to cope with the appearance change problem in person re-identification research. [12] designs a Domain-Transferred Graph Neural Network to obtain robust representation for the group image. The proposed transferred graph addressed the challenge of the appearance change, while the graph representation overcomes the challenge of the layout and membership change. [29] proposed a cloth-Clothing Change Aware Network (CCAN) and addressed the appearance change issue by separately extracting the face and body context representation. And similar idea is applied in [27].

Some patch-based methods learn the partial/regional aggregation properties to make them robust against pose transformation and occlusion. [31] segmented the human body through a patch generation network and developed a PatchNet to learn discriminative features from patches instead of the whole images. [6] proposed a self-similarity grouping (SSG) approach, which exploits the potential similarity (from the global body to local parts) of unlabeled samples to build multiple clusters from different views automatically. These methods of combining the part and the whole have achieved effective results, and also provide ideas for us to migrate in the manga.

II-C Manga character identification

To date, a limited amount of published literature has reported research on manga character identification. The existing research can be broadly classified into supervised and unsupervised methods. Some supervision methods, such as [33], draw on the experience of face recognition methods and propose a method that uses domain data of real-world faces to help classify hyperplanes in the manga domain. This supervised approach can achieve satisfactory performance, but it requires a lot of labeling work, which brings limitations in the manga context.

Unsupervised methods such as [18] and [19] proposed a multiscale histogram of edge directions for sketch-based manga retrieval. However, these methods do not consider the diversity among individual manga volumes. [25] improves the clustering performance by adapting the manga face representation to the target volume, while manual rules are used in constructing pseudo-label pairs, which lack some generality.

III The Proposed Formulation

In this section, we first introduce the preliminaries and revisit the generic clustering-based unsupervised domain adaptation (UDA) process in Sec. III-A. Then we present the proposed Face-body and Spatial-temporal Associated Clustering (FSAC) method in Sec. III-B. Further, we introduce two key modules in Sec. III-C and Sec. III-D: the face-body combination module and the spatial-temporal relationship correction module, respectively.

III-A Preliminary

In unsupervised re-identification based on clustering, given a target training set $X^{t}=\{x_{1}^{t},x_{2}^{t},...,x_{N^{t}}^{t}\}$ of $N^{t}$ images, the goal is to learn a feature embedding function $F(\theta;x_{i}^{t})$ from $X^{t}$ without any manual annotation, where parameters of $F$ are collectively denoted as $\theta$ .

For UDA methods, however, an additional source dataset $X^{s}=\{x_{1}^{s},x_{2}^{s},...,x_{N^{s}}^{s}\}$ of $N^{s}$ images is typically used for pre-training, aiming at learning domain-invariant features by both the source and target domains. The UDA approach applies the classifier $F(\cdot|\theta)$ learned from the source domain data to the target domain data with no or little labeling in the target domain. The source domain images’ and target domain images’ features encoded by the network are denoted as $\left\{F\left(x_{i}^{s}|\theta\right)\right\}|_{i=1}^{N_{s}}$ and $\left\{F\left(x_{i}^{t}|\theta\right)\right\}|_{i=1}^{N_{t}}$ , respectively.

To learn the feature embedding, traditional clustering-based UDA methods usually consist of three steps:

1) Features $\left\{F\left(x_{i}^{t}|\theta\right)\right\}|_{i=1}^{N_{t}}$ of images in the target domain are extracted using the model $F(\cdot|\theta)$ pre-trained in the source domain;

2) By using the generated features $\left\{F\left(x_{i}^{t}|\theta\right)\right\}|_{i=1}^{N_{t}}$ , $x_{i}^{t}$ images are clustered into $M^{t}$ classes. Let $\tilde{y}_{i}^{t}$ denote the pseudo label generated for image $x_{t}$ ;

3) Using the generated pseudo-labels, the model parameters $\theta$ and a learnable classifier $C^{t}:\rightarrow f^{t}\left\{1,\cdots,M_{t}\right\}$ of the target domain dataset are fine-tuned by the classification loss and triplet loss denoted as:

\displaystyle\mathcal{L}_{id}^{t}\left(\theta\right)=\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}{\begin{array}[]{c}\mathcal{L}_{ce}\left(C^{t}\left(F\left(x_{i}^{t}|\theta\right)\right),\tilde{y}_{i}^{t}\right)\\ \end{array}}

(1)

	$\displaystyle\mathcal{L}_{tri}^{t}(\theta)$	$\displaystyle=\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}\max(0,\\|F(x_{i}^{t}\|\theta)-F(x_{i,p}^{t}\|\theta)\\|+m$		(2)
		$\displaystyle-\\|F(x_{i}^{t}\|\theta)-F(x_{i,n}^{t}\|\theta)\\|),$		(2)

where $\left\|\cdot\right\|$ denotes $L2$ -norm distance, subscripts ${i,p}$ and ${i,n}$ represent the positive and negative feature index in each mini-batch for the sample $x_{i}^{t}$ , and $m$ denotes the triplet distance margin.

This feature embedding function can be applied to the target gallery set, $X_{g}^{t}=\{x_{g_{1}}^{t},x_{g_{2}}^{t},...x_{g_{N_{g}}^{t}}\}$ of $N_{g}^{t}$ images, and the target query set $X_{q}^{t}=\{x_{q_{1}}^{t},x_{q_{2}}^{t},...x_{q_{N_{q}}^{t}}\}$ of $N_{q}^{t}$ images. During the evaluation, we use the feature of a target query image $F(\theta;x_{q_{i}}^{t})$ to search similar image features from the target gallery set. The query result is a ranking list of all gallery images according to the Euclidean distance between the feature embedding of the target query and gallery data, i.e.,

d({x^{t}_{q}}_{i},{x^{t}_{g}}_{i}))=\left\|F(\theta;{x^{t}_{q}}_{i})-F(\theta;{x^{t}_{g}}_{i})\right\|.

(3)

The feature embeddings are supposed to assign a higher rank to similar images and keep the images of a different person a low rank.

III-B Our proposed method

In this section, we present the proposed Face-body and Spatial-temporal Associated Clustering (FSAC) method. As mentioned above, manga characters have special features, such as spatial-temporal relationships and unique signatures of the characters, which may help clustering.

Therefore, our core idea is to make use of the two special features to make up for the shortage of clustering. To achieve this goal, we propose two modules: the face-body combination module and the spatial-temporal relationship correction module.

Overall, the framework of our method is shown in Figure 3. The target dataset is divided into related face-set and body-set, where each face image is derived from a crop of body images and they are one-to-one correspondence. Formally, we denote the body data as $\mathbb{D}_{B}=(x_{i}^{B},k_{i}^{B})|_{i=1}^{N_{B}}$ , where $x_{i}^{B}$ denotes the $i$ -th body training sample, $k_{i}^{B}$ is the data associated with the spatial-temporal information of $x_{i}^{B}$ , indicating the manga frame index of the image $x_{i}^{B}$ , and $N_{B}$ stands for the number of body images. We also denote the face data as $\mathbb{D}_{F}=(x_{i}^{F},k_{i}^{F})|_{i=1}^{N_{F}}$ , where $x_{i}^{F}$ and $k_{i}^{F}$ denote the $i$ -th face training sample and the manga frame index of the training image, and $N_{F}$ stands for the number of face images.

For each part, we first extract features which are denoted as $\left\{F_{B}\left(x_{i}^{B}|\theta_{B}\right)\right\}|_{i}^{N_{B}}$ and $\left\{F_{F}\left(x_{i}^{F}|\theta_{F}\right)\right\}|_{i}^{N_{F}}$ through their networks, respectively. Then, we employ a clustering algorithm on the unlabeled data and the pseudo labels $\tilde{y}_{i}^{B^{\prime}}$ and $\tilde{y}_{i}^{F^{\prime}}$ are generated under the constraint of a face-body graph. After that, based on these pseudo labels, we optimize the network parameters $\theta$ and learnable classifier $C^{t}:f^{t}\rightarrow\left\{1,\cdots,M^{t}\right\}$ jointly by the classification loss and the spatial-temporal triplet loss described in Equation 9 and Equation 10. The above steps are repeated until convergence.

III-C The Face-body Combination Module

In the beginning of clustering, each image is assigned to a different cluster for initialization, and then similar images are gradually clustered together by distance. The cluster ID is used as the training set label, and we expect the network to minimize the intra-cluster variance and maximize the inter-cluster variance.

Following [28, 15, 16], we generate soft pseudo labels for face and body after clustering, which are expressed in terms of probability. For each image, the probability that it belongs to the $m$ -th class is defined as

\tilde{y}^{m}=p\left(y_{m}|x,\{a_{i}\}_{i=1}^{N}\right)=\frac{\exp\left(a_{m}^{\top}F\left(x|\theta\right)\right)}{\sum_{i=1}^{N}{\exp\left(a_{i}^{\top}F\left(x|\theta\right)\right)}},

(4)

where ${\{a_{i}\}}_{i=1}^{N}$ is the reference agent that stores the feature of each class and $N$ is the number of classes.

The core component of the face-body combination module is the face-body graph, which builds a map of the body and face images of the same character from the same manga frame. For face images $x_{i}^{F}$ and body images $x_{i}^{B}$ with mapping relationship, their soft labels are respectively expressed as $\tilde{y}_{i}^{F}$ and $\tilde{y}_{i}^{B}$ . The reference agents corresponding to the soft label with the maximum probability are denoted as $m_{i}^{F}$ and $m_{i}^{B}$ . The graph enables us to synthesize the face and body results in the following way:

\tilde{y}_{i}^{B^{\prime}}=\tilde{y}_{i}^{F^{\prime}}=\left\{\begin{aligned} m_{i}^{B},\tilde{y}_{i}^{B}>\tilde{y}_{i}^{F}\\ m_{i}^{F},\tilde{y}_{i}^{B}\leq\tilde{y}_{i}^{F}.\end{aligned}\right.

(5)

The refined labels $\tilde{y}_{i}^{B^{\prime}}$ and $\tilde{y}_{i}^{F^{\prime}}$ will be further used in the fine-tune network.

III-D The Spatial-temporal Relationship Correction Module

In this module, we add spatial-temporal constraints on the basis of the classic triplet loss [10] and optimize network parameters jointly with the classification loss (i.e., the cross-entropy loss).

The classic triplet loss and classification loss are shown in Equation 1 and Equation 2. In the traditional triplet loss, the positive and negative sample pairs are selected based on the feature distance from the anchor pictures. However, in the UManga-ReID task, we find that spatial-temporal information may help clustering, so we set up spatial-temporal correlation rules to help select positive and negative sample pairs.

In order to add spatial-temporal constraints on the triplet loss, we define a spatial-temporal distance $d^{st}$ to find the positive and negative pairs of the sample $x_{i}$ . The spatial-temporal distance between $x_{i}$ and $x_{j}$ is denoted as

d_{i,j}^{st}=\min\left(\sigma,\left|k_{i}-k_{j}\right|\right)+\eta,

(6)

where $\sigma$ indicates the threshold of manga frame index difference and $\eta$ denotes the penalty factor of characters in the same manga frame in the form of

\eta=\begin{cases}0,k_{i}\neq k_{j}\\ \eta,k_{i}=k_{j}.\\ \end{cases}

(7)

The sum of the spatial-temporal distance and the $L2$ -norm distance of images constitutes the total distance $d_{i,j}$ , denoted as

d_{i,j}=\left\|F\left(x_{i}\left|\theta\right.\right)-F\left(x_{j}\left|\theta\right.\right)\right\|+d_{i,j}^{st},

(8)

which is used to search for pseudo positive and negative pairs in the triplet loss.

Meanwhile, we still use the $L2$ -norm distance of the pseudo positive and negative pairs to calculate the triplet loss.

Taking the face dataset as an example, the spatial-temporal triplet loss and classification loss shown below are are still calculated in the classical way, but differ in some details.

\displaystyle\mathcal{L}_{id}^{F}\left(\theta\right)=\frac{1}{N_{F}}\sum_{i=1}^{N_{F}}{\begin{array}[]{c}\mathcal{L}_{ce}\left(C^{F}\left(F\left(x_{i}^{F}|\theta\right)\right),\tilde{y}_{i}^{F}\right)\\ \end{array}}

(9)

	$\displaystyle\mathcal{L}_{st-tri}^{F}(\theta)$	$\displaystyle=\frac{1}{N_{F}}\sum_{i=1}^{N_{F}}\max(0,\\|F(x_{i}^{F}\|\theta)-F(x_{i,p}^{F}\|\theta)\\|+m$		(10)
		$\displaystyle-\\|F(x_{i}^{F}\|\theta)-F(x_{i,n}^{F}\|\theta)\\|),$		(10)

In classification loss, $\tilde{y}_{i}^{F}$ uses the refined label after face-body combination, and in spatial-temporal triplet loss, positive and negative pairs are selected based on spatial-temporal distance $d_{st}$ .

TABLE II: Comparison with the state-of-the-art methods on Manga109-face and Manga109-body datasets. Baseline is the general UDA pipeline illustrated in Sec.III-A in [7].Strong-baseline is the updated version of baseline.

Datasets	Methods	mAP	rank-1	rank-5	rank-10
Face Dataset	Direct infer on resnet50	7.1%	22.5%	39.6%	59.1%
	Baseline [7] (in Sec. III-A)	23.9%	63.9%	78.9%	82.9%
	Strong Baseline (update of baseline)	24.4%	51.7%	66.1%	70.6%
	MMT [7] (in Sec. III-A)	21.3%	50.6%	62.9%	71.7%
	Our Method	32.2%	78.4%	88.1%	91.7%
Body Dataset	Direct infer on resnet50	0.6%	2.8%	7.4%	10.0%
	Baseline [7] (in Sec. III-A)	5.9%	32.0%	49.0%	57.1%
	Strong Baseline (update of baseline)	5.2%	28.8%	37.2%	46.0%
	MMT [7] (in Sec. III-A)	4.8%	19.4%	30.9%	36.3%
	Our Method	10.2%	40.3%	66.8%	71.1%
Face Dataset + Body Dataset	Direct infer on resnet50	0.5%	16.9%	21.1%	23.5%
	Baseline [7] (in Sec. III-A)	3.2%	31.4%	45.5%	53.7%
	Strong Baseline (update of baseline)	2.8%	33.1%	39.6%	51.9%
	MMT [7] (in Sec. III-A)	4.8%	38.8%	52.7%	60.0%
	Our Method	9.8%	52.8%	68.6%	73.1%

IV The Proposed Algorithm

Algorithm 1 The FSAC framework

Input: Unlabeled manga body data $\mathbb{D}_{B}=(x_{i}^{B},k_{i}^{B})|_{i=1}^{N}$ ;
Unlabeled manga face data $\mathbb{D}_{F}=(x_{i}^{F},k_{i}^{F})|_{i=1}^{N}$ ;
$F_{B}\left(\cdot|\theta_{B}\right)$ pre-trained on real human body data;
$F_{F}\left(\cdot|\theta_{F}\right)$ pre-trained on real human face data.
Output: The learned CNN model $F_{B}\left(\cdot|\theta_{B}\right)$ and $F_{F}\left(\cdot|\theta_{F}\right)$ .

1:Initialize the frame threshold

\sigma

and penalty factor

\eta

for Equation 6.

2:for

n

[1,num\_epochs]

3: Extract body feature

\left\{F\left(x_{i}^{B}|\theta\right)\right\}|_{i}^{N_{B}}

and face feature

\left\{F\left(x_{i}^{F}|\theta\right)\right\}|_{i}^{N_{F}}

;

4: Generate soft pseudo labels

\tilde{y}_{i}^{B}

for each sample

x_{i}^{B}

\mathbb{D}_{B}

and

\tilde{y}_{i}^{F}

for

x_{i}^{F}

\mathbb{D}_{F}

following Equation 4;

5: for each mini-batch

B

, iteration

T

6: Generate combined pseudo labels

\tilde{y}_{i}^{B^{\prime}}

and

\tilde{y}_{i}^{F^{\prime}}

via face-body graph following Equation 5;

7: Update parameters

\theta_{B}

and

\theta_{F}

via the spatial-temporal loss and classification loss following Equation 10 and 9 with the spatial-temporal distance.

8: end for

9:end for

As demonstrated in Figure 3, our framework consists of two parts: the face part and the body part, which are interconnected but do not interfere with each other. Our goal is to optimize their respective models by taking advantage of their common relationships and their internal spatial-temporal relationships.

The proposed FSAC algorithm consists of two steps: (1) pre-training phase, which initializes the CNN network by pre-training with the labeled source domain datasets; (2) loop iteration phase, which boosts clustering leveraging the proposed two modules. We elaborate on the two steps as follows.

IV-A Pre-training stage

We use the labeled real-person face and body datasets as source domains for pre-training. These two pre-trained models $F_{F}\left(\cdot|\theta_{F}\right)$ and $F_{B}\left(\cdot|\theta_{B}\right)$ will be used as the CNN initialization for the face and body in the subsequent loop iteration steps, respectively.

IV-B Loop iteration stage

In the iterative process, we first extract image features and generate soft pseudo-labels after clustering. Then, following Equation 5 in the face-body combination module, we generate the same hard pseudo-label $\tilde{y}_{i}^{B^{\prime}}=\tilde{y}_{i}^{F^{\prime}}$ for the corresponding face and body images. After that, we employ the spatial-temporal triplet loss and the classification loss to update their respective networks.

Such operations of generating pseudo labels by clustering and learning features with pseudo labels are alternated until the training converges. The entire process is summarized in Algorithm 1.

V Experiments

V-A Datasets

The experimental dataset includes four parts: the source face dataset and source body dataset for pre-training face and body networks, as well as the target face dataset and target body dataset for fine-tuning on the manga domain. The details are as follows.

V-A1 Market-1501 (source body dataset)

Market-1501[32] is a large-scale public benchmark dataset for person re-identification. It contains 1501 identities which are captured by six different cameras, and 32,668 pedestrian image bounding-boxes obtained using the deformable part models pedestrian detector. Each person has 3.6 images on average at each viewpoint. The dataset is split into two parts: 750 identities are utilized for training and the remaining 751 identities are used for testing. In the official testing protocol, 3,368 query images are selected as probe set to find the correct match across 19,732 reference gallery images. The division of data sets and specific data are shown in table III.

V-A2 CelebA-U2Net (source face dataset)

CelebFaces Attributes Dataset (CelebA) [17] is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including 10,177 identities, 202,599 number of face images, and 5 landmark locations, 40 binary attributes annotations per image.

To achieve better performance in the manga context, we use U2-Net [21] to acquire a sketch-style human face dataset, which is called CelebA-U2Net. U2-net is a simple yet powerful deep network architecture for salient object detection and provides an interesting application for human portrait drawing, which helps us transform real-face pictures into sketch-style pictures. This helps us to have more commonalities between the source dataset and target dataset, which is conducive to subsequent training and learning. In the process of style migration, we deleted images with undetectable faces. The new-obtained dataset is described in Table III.

V-A3 Manga109-body (target body dataset)

Manga109 [18, 20] is a dataset of a variety of 109 Japanese comic books publicly available for use for academic purposes. Frames, texts, and the face and body images of characters are defined and labeled in each manga book.

To improve the applicability of the network, we select the characters that appear more than 9 times for face and body data respectively. We find and match face and body images of the same character in the same manga frame, and build face and body datasets that map to each other. Generally, the face image is the face part of its mapped body image and the face dataset and body dataset share the same image number and character number.

After that, we have a face dataset and a body dataset with 186,817 images and 1506 characters each. We divided the datasets into training sets and gallery sets in a ratio of 2:1 and randomly selected one image from the gallery for each character as their respective query set. Therefore, the query set of face and body data each has 502 images of different characters. Face images and body images are resized to 112 × 112 and 128 × 256 correspondingly before being fed into the networks.

V-A4 Manga109-face (target face dataset)

Manga109-face takes the same processing as manga109-body but uses a subset of the faces. Relevant statistics are shown in Table III and sample images of the datasets can be found in Figure 4.

TABLE III: Statistics of datasets.

dataset	size	subdataset	images	identities
Market-1501	64*128	train	12937	750
		query	3368	751
		gallery	19732	751
CelebA-U2Net	112*112	\	194098	10172
Manga109-Body	64*128	train	137482	1004
		query	502	502
		gallery	48833	502
		total	186817	1506
Manga109-Face	112*112	train	137482	1004
		query	502	502
		gallery	48833	502
		total	186817	1506

V-B Implementation Details

V-B1 On pre-training stage

We use ResNet50 [8] pre-trained on real-human data for the convolution layers. Considering the huge differences between face and body images, we used different pre-trained models for them respectively.

Specifically, for the body part, we used the real-world pedestrian dataset Market-1501 [32] for pre-training. Market-1501 is the currently commonly used pedestrian re-recognition dataset, consisting of 12,937 training images (from 750 different people), and 19,732 test images (from another 751 different people). We pre-train the ResNet50 model on Market-1501 via the MMT method [7] and select one of its models as the body pre-trained model.

For the face part, the CelebA [17] dataset is selected while we find that sketch-style human images improve the clustering performance in the manga context. Therefore, we use the u2-net [21] to transfer the CelebA images into black and white sketch style for pre-training. Unlike the body part, ArcFace [4] is used as the classifier for face data pre-training, which is validated to be a better way of mining the features of faces. The pre-training of face and body networks are initialized with ImageNet [3] pre-trained weights. Figure 4(a) and 4(b) show some samples of Market-1501 and CelebA.

V-B2 On loop iteration stage

Based on the pre-trained models, we then fine-tune the network through face-body and spatial-temporal modules. On the clustering stage, we use DBSCAN as the clustering algorithm (k-means is also supported given the number of clustering classes). In the calculation of spatial-temporal distance $d^{st}$ , we set the threshold of manga frame index difference $\sigma=100$ and penalty factor $\eta=1000$ in Equation 6. In order to make full use of the spatial-temporal relationship, a no-shuffle method was selected during the sampling.

V-B3 Evaluation metrics

We employ the Mean Average Precision (mAP) and Cumulative Matching Characteristics [2] for evaluation. The mAP reflects the recall, while CMC scores reflect the retrieval precision. We report the rank-1, rank-5 and rank-10 scores to represent the CMC curve.

V-C Comparison with the State-of-the-Art

The comparisons with the state-of-the-art methods on Manga109 are shown in Table II.

The baseline method in the table uses the general Unsupervised Domain Adaptation(UDA) pipeline illustrated in Sec. III-A in [7], which adopts features extracted from the pre-trained network to cluster and uses the classification loss and classic triplet loss to update the network, excluding modules for face-body combination and temporal & spatial relationship modification. Strong-baseline is the follow-up update of this method, changing the steps of pre-training to synchronous training of shared weights.

Although these methods achieve outstanding performance in person re-identification, our manga-content-based method surpasses them by a large margin in the manga context in terms of mAP and CMC. On the face dataset, we achieve the best results with mAP of 32.2% and rank-1 score of 78.4%, which outperforms the state-of-the-art result by 7.8% and 14.5% respectively. On the body dataset, we also surpass the state-of-the-art result by 4.3% and 8.3% on mAP and rank-1 score respectively. When tested with a mixed dataset of Manga109Face and Manga109Body, we achieve 5.0% and 14.0% of improvement in rank-1 accuracy and mAP, respectively.

In addition, although [25] also studies this topic, their results were not compared because the dataset is pre-screened and we have no access to their code and experimental data.

V-D Ablation study

We conduct ablation studies on the two key parts: the spatial-temporal triplet loss and the face-body graph. For the model without the spatial-temporal triplet loss, the classic triplet loss [10] is used. As shown in Table IV, the comparison results on the two datasets validate the effectiveness of the spatial-temporal module, which can be seen from the mAP difference of 5.2% on the face dataset and 2.2% on the body dataset.

In addition, for the model without the face-body graph, we adopt the hard pseudo labels generated by clustering instead. The results in Table IV also show the effectiveness of the face-body combination module.

TABLE IV: Ablation studies. The model without spatial-temporal triplet loss uses classic triplet loss [10] instead. The model without the face-body graph uses the hard label generated by clustering instead.

Datasets	Spatial-temporal triplet loss	Face-body Graph	mAP	rank-1	rank-5	rank-10
Face Dataset			23.9%	63.9%	78.9%	82.9%
	✓		29.1%	72.5%	85.5%	89.4%
	✓	✓	32.2%	78.4%	88.1%	91.7%
Body Dataset			5.9%	32.0%	49.0%	57.1%
	✓		8.1%	38.8%	60.2%	65.3%
	✓	✓	10.2%	40.3%	66.8%	71.1%
Face + Body Dataset			3.2%	31.4%	45.5%	53.7%
	✓		9.3%	50.3%	62.6%	68.8%
	✓	✓	9.8%	52.8%	68.6%	73.1%

V-E Algorithmic Analysis

To make the experimental analysis more comprehensive and to explore the details of the effects of the spatial-temporal relationship, we conducted experimental tests on the parameters $\sigma$ and $\eta$ in Equation 6. The experimental results are displayed in Figure 7.

V-E1 Analysis of the parameters

The parameter $\sigma$ is the threshold of the manga frame distance, which intuitively determines the upper limit of the spatial-temporal distance between two images that are not in the same manga frame, and controls to some extent the weight of the manga spatial-temporal distance to the image similarity distance. The results show the performance for $\sigma$ ranging between 50 and 150 in Figure 7(a). Usually, a manga chapter contains 100 to 200 image frames, and the $\sigma$ range taken for our tests was thus determined. We observed that with the increase of $\sigma$ , the experimental results showed a weak trend of decreasing before increasing, but the mAP values are stable around 32.5%, and the overall values don’t fluctuate much.

The parameter $\eta$ is a penalty term for two images that are in the same frame. We tested the performance of $\eta$ when taking different orders of magnitude values, and the experimental results are shown in Figure 7(b). The results show that with the increase of $\eta$ , the model performance shows a trend of increasing and then decreasing, and the model has better performance when $\eta$ takes the value of about 1000.

V-E2 Analysis of the shuffle and no-shuffle training strategy

In addition, we also tested the comparison between the training strategy of shuffle and no shuffle on the datasets, and the test results for the face dataset and the body dataset are shown in Figure 7(c) and Figure 7(d). We observe that the training results using the no-shuffle strategy outperformed the shuffle strategy on both the face and body datasets. It indicates that the temporal order of manga character images is effective for clustering.

V-F Qualitative analysis

We visualize the query results of some images from the face dataset in Figure 5 and displayed the rank-1 to rank-5 results of the search image (or query image). Frame sequence indices below images express their temporal positions. From this figure, we may further discover the detailed impact of our approach and discuss the following questions:

Have spatial-temporal relationships helped clustering? As we can see, most negative results are far away from search images in the temporal dimension. From the perspective of method design, we shorten the distance of neighboring frame images while lengthening the distance of the images in the same frame. Therefore, we can avoid the clustering of remote images and images in the same frame as much as possible, so as to overcome the style limitation to a certain extent. The results of lines 1 and 2 in Figure 5 validate the effectiveness of our method.

Has face-body combination helped clustering? Characters in lines 3 and 4 in the Figure 5 have similar clothes, and they are wrongly clustered together in the baseline without face-body combination. The face-body combination module considers two parts of the body, making it less possible to miss the signatures of characters. Hence, the experimental results corroborate our idea.

Moreover, we utilize t-SNE [26] to visualize the feature embeddings of the clusters by plotting them to the 2D map, which is illustrated in Figure 6. Although it is still difficult to distinguish some very similar images, our method has a satisfactory effect on the whole.

VI Conclusion

In this paper, we proposed a content-related unsupervised manga character re-identification method to achieve better clustering performance in the manga. The key is to utilize manga spatial-temporal relationships and combine face-body information of characters when clustered to tackle the problem of artistic expressions and style limitations in manga creation. We designed a face-body graph to generate refined pseudo labels for each dataset. Moreover, we modified the triplet loss with spatial-temporal relationships to the fine-tune network. Experimental results on the Manga109 dataset validate the superiority of the proposed method.

References

[1] Manga market in japan hits record 612.6 billion yen in 2020. https://www.animenewsnetwork.com/news/2021-02-26/manga-market-in-japan-hits-record-612.6-billion-yen-in-2020/.169987, 2021.
[2] Ruud M Bolle, Jonathan H Connell, Sharath Pankanti, Nalini K Ratha, and Andrew W Senior. The relation between the roc curve and the cmc. In Fourth IEEE workshop on automatic identification advanced technologies (AutoID’05), pages 15–20. IEEE, 2005.
[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[4] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
[5] Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(4):1–18, 2018.
[6] Yang Fu, Yunchao Wei, Guanshuo Wang, Yuqian Zhou, Honghui Shi, and Thomas S Huang. Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6112–6121, 2019.
[7] Yixiao Ge, Dapeng Chen, and Hongsheng Li. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. In ICLR 2020 : Eighth International Conference on Learning Representations, 2020.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[9] Zheqi He, Yafeng Zhou, Yongtao Wang, and Zhi Tang. Sren: Shape regression network for comic storyboard extraction. In AAAI, pages 4937–4938, 2017.
[10] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
[11] Ryota Hinami, Shonosuke Ishiwatari, Kazuhiko Yasuda, and Yusuke Matsui. Towards fully automated manga translation. In AAAI, pages 12998–13008, 2020.
[12] Ziling Huang, Zheng Wang, Wei Hu, Chia-Wen Lin, and Shin’ichi Satoh. Dot-gnn: Domain-transferred graph neural network for group re-identification. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1888–1896, 2019.
[13] Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daume, and Larry S Davis. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pages 7186–7195, 2017.
[14] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8738–8745, 2019.
[15] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8738–8745, 2019.
[16] Yutian Lin, Lingxi Xie, Yu Wu, Chenggang Yan, and Qi Tian. Unsupervised person re-identification via softened similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3390–3399, 2020.
[17] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[18] Yusuke Matsui, Kota Ito, Yuji Aramaki, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76(20):21811–21838, 2017.
[19] Rei Narita, Koki Tsubota, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using deep features. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pages 49–53, 2017.
[20] Toru Ogawa, Atsushi Otsubo, Rei Narita, Yusuke Matsui, Toshihiko Yamasaki, and Kiyoharu Aizawa. Object detection for comics using manga109 annotations. arXiv preprint arXiv:1803.08670, 2018.
[21] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106:107404, 2020.
[22] Edwin Arkel Rios, Wen-Huang Cheng, and Bo-Cheng Lai. Daf:re: A challenging, crowd-sourced, large-scale, long-tailed dataset for anime character recognition. arXiv preprint arXiv:2101.08674, 2021.
[23] Hao Su, Jianwei Niu, Xuefeng Liu, Qingfeng Li, Jiahe Cui, and Ji Wan. Mangagan: Unpaired photo-to-manga translation based on the methodology of manga drawing. In AAAI, pages 2611–2619, 2021.
[24] Takamasa Tanaka, Kenji Shoji, Fubito Toyama, and Juichi Miyamichi. Layout analysis of tree-structured scene frames in comic images. In IJCAI’07 Proceedings of the 20th international joint conference on Artifical intelligence, pages 2885–2890, 2007.
[25] Koki Tsubota, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Adaptation of manga face representation for accurate clustering. In SIGGRAPH Asia 2018 Posters, page 15, 2018.
[26] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
[27] Fangbin Wan, Yang Wu, Xuelin Qian, Yixiong Chen, and Yanwei Fu. When person re-identification meets changing clothes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 830–831, 2020.
[28] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3415–3424, 2017.
[29] Jia Xue, Zibo Meng, Karthik Katipally, Haibo Wang, and Kees van Zon. Clothing change aware person identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2112–2120, 2018.
[30] Hideaki Yanagisawa, Takuro Yamashita, and Watanabe Hiroshi. Manga character clustering with dbscan using fine-tuned cnn model. In International Workshop on Advanced Image Technology (IWAIT) 2019, volume 11049, pages 305–310. SPIE, 2019.
[31] Qize Yang, Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng. Patch-based discriminative feature learning for unsupervised person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3633–3642, 2019.
[32] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015.
[33] Yi Zheng, Yifan Zhao, Mengyuan Ren, He Yan, Xiangju Lu, Junhui Liu, and Jia Li. Cartoon face recognition: A benchmark dataset. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2264–2272, 2020.


(a) Source Body	(b) Source Face	(c) Manga Body	(d) Manga Face