SWIS: Self-Supervised Representation Learning for Writer Independent Offline Signature Verification

Siladittya Manna
Saumik Bhattacharya Computer Vision and Pattern Recognition Unit
Indian Statistical Institute, Kolkata
Email: [email protected] Dept. of Electronics and Electrical Communication Engineering
Indian Institute of Technology Kharagpur
Email: [email protected] Soumitri Chattopadhyay
Umapada Pal Dept. of Information Technology
Jadavpur University, Kolkata
Email: [email protected] Computer Vision and Pattern Recognition Unit
Indian Statistical Institute, Kolkata
Email: [email protected]

Abstract

Writer independent offline signature verification is one of the most challenging tasks in pattern recognition as there is often a scarcity of training data. To handle such data scarcity problem, in this paper, we propose a novel self-supervised learning (SSL) framework for writer independent offline signature verification. To our knowledge, this is the first attempt to utilize self-supervised setting for the signature verification task. The objective of self-supervised representation learning from the signature images is achieved by minimizing the cross-covariance between two random variables belonging to different feature directions to zero and ensuring a positive cross-covariance between the random variables denoting the same feature direction. This ensures that the features are decorrelated linearly and the redundant information is discarded. Through experimental results on different data sets we obtained encouraging results. Although the results are not better than the supervised methods but our idea in this work is to introduce SSL in signature verification area so that researchers can explore this area in future to get better verification performance in limited data scenario.

I Introduction

Signature verification has been used as one of the most essential steps for identity verification of person-specific documents like forms, bank cheques, or even the individual themselves. Thus, signature verification is a critical task which requires utmost precision and accuracy in prediction. This makes signature verification an important task in domain of computer vision and pattern recognition. There are mainly two types of signature verification processes: (1) offline and (2) online. In offline signature verification, the input is basically a 2D image which is scanned from the original signature or captured into an image by some electronic device. Whereas, in online signature verification, the writer usually pens down his signature on an electronic tablet using a stylus and the information is recorded at some regular timestep along with the position of the stylus. Signature verification requires modelling of a lot of information like strokes, writing style, etc. A lot of research effort has been put in this field over the last few decades to make the algorithms more perfect.

Due to the limited amount of available information, offline signature verification is typically more challenging than online verification process. In offline signature verification, only an image is provided, which may be the case in many real-life scenarios. Furthermore, in offline scheme, only original signatures are available as references initially and that too in small numbers. It is generally not possible to obtain forged signatures beforehand and use it as reference.

Offline signature verification can again be divided into two types: (1) Writer dependent and (2) writer independent. In writer dependent scenario, the system needs to be updated and retrained for every new user signature that gets added to the system. This makes the process cumbersome and less feasible. However, in writer independent scenario, a generalized system needs to be built which can differentiate between genuine and forged signatures without repeated retraining.

In this work, we propose a self-supervised learning algorithm for offline writer-independent signature verification. Self-supervised learning is a sub-domain of unsupervised learning that aims at learning representations from the data without any ground truth or human annotations. In particular, we aim to use a learning framework which helps the encoder learn representations of signatures in terms of linearly decorrelated factors or dimensions in the feature space. Intuitively, this helps the encoder to represent a signature in terms of its several generative factors such that the redundant information content of each factor is minimized. This in turn creates a bottleneck which allows learning of useful representations from the input. As a skilled forgery is supposed to be very close to the genuine signature, it is necessary to distinguish between each constituting element of the signatures for correct classification. However, since it is not possible to obtain a large number of annotated genuine signatures from the individuals for training a large model, we use self-supervised learning for training the model to learn representations which are generalized for signatures over a large number of individuals. This work is the first of its kind to apply self-supervised learning framework for learning representations from signature images. Also, in the downstream stage we do not use any siamese type architecture in the downstream task for the offline signature verification task and show the capability of the pretrained encoder to effectively cluster the genuine signatures of the different unknown writers. This effectively proves that the pretrained encoder is capable of breaking down the signatures in terms of their building blocks or linearly decorrelated generative factors.

The main contributions of this work are as follows:

•

A novel self-supervised approach is introduced here for offline writer independent signature verification purpose.
•

To the best of our knowledge, this is the first work of the use of self-supervised learning in signature verification.
•

We have shown that the proposed SSL is better than the state-of-the art self-supervised contrastive learning approaches used in Computer vision and Medical image analysis areas.

The rest of the paper is organized as follows. Sec. II describes some of the works in literature on offline signature verification and self-supervised learning. Sec. III describes the self-supervised learning methodology that is used in this work for pre-training the encoder to be used in the downstream task for feature extraction. Sec. IV presents the details about the datasets we use and the experimental configurations used in the pre-training as well as the downstream phase. In Sec. V, we present the experimental results and the comparison with the base models. Finally, we conclude the paper in Sec. VI.

II Related Works

II-A Offline Signature Verification

Most researchers have leveraged supervised learning methods [rantzsch2016signature, dey2017signet, ruiz2020off, wan2021learning, parcham2021cbcapsnet, bhunia2019signature] for offline signature verification. While handcrafted feature analyses have comprised the bulk of studies in this domain [bhunia2019signature, alaei2017efficient, hafemann2017review, banerjee2021new], various deep learning-based methods have also been proposed, particularly dwelling on metric learning approaches [rantzsch2016signature, dey2017signet, ruiz2020off, wan2021learning]. Dey et al. [dey2017signet] introduced a contrastive loss based convolutional Siamese network for handwritten signature verification. Ruiz et al. [ruiz2020off] combined synthetic signature generation with siamese networks for the verification task. The authors of [shariatmadari2019patch] proposed a hierarchical CNN to learn features from patches of genuine signatures. Zhu et al. [zhu2020point] sought to tackle intra-writer variations by introducing a point-to-set metric for offline signature verification. Zois et al. [zois2017parsimonious] and Berkay et al. [berkay2018hybrid] explored sparse dictionary learning and hybrid two-channel CNNs respectively for signature verification, whereas the authors of [wan2021learning] proposed two triplet losses, each to tackle random and skilled forgeries respectively. Other works include an interval symbolic representation and fuzzy similarity measure [alaei2017efficient] based handcrafted feature engineering method; region-based metric learning [liu2021offline]; a neuromotor equivariance inspired model [diaz2016approaching]; a recurrent neural network architecture [ghosh2021recurrent] and a graph neural network based approach [roy2021offline]. Nevertheless, all the aforementioned works are fully supervised methods and therefore, share the common bottleneck of data scarcity. To this end, we demonstrate the first use of self-supervision for offline signature verification.

II-B Self-supervised Learning

Self-supervised learning [jing2020self] aims at developing a pre-training paradigm to learn a robust representation from an unlabelled corpus for generalization to any given downstream task. Widely studied in recent years, several pretext tasks have been proposed, such as solving jigsaw puzzles [noroozi2016unsupervised], image colorization [zhang2016colorful], super-resolution [ledig2017photo] and cross-modal translation [bhunia2021vectorization], to name a few. Contrastive learning based self-supervised pre-training [chen2020simple, he2020momentum] has also gained popularity, which aim at learning similarity between augmented views of the same image while distancing views from different images. SimCLR [chen2020simple] and MoCo [he2020momentum] are some of the state-of-the-art contrastive learning-based self-supervised algorithms in literature. In a different approach, the authors of [zbontar2021barlow] aimed at simultaneously maximizing similarity and minimizing redundancy between embeddings of two distorted views of an image.

III Methodology

In this section we discuss the pre-processing and the algorithm steps that were used to train the proposed encoder.

III-A Pretraining Methodology

In signature images, it is essential to capture the stroke information from the different authors as well as to learn the variations in the signatures of the same individual. This allows the model to learn representations which not only discriminates one author from another, but also helps differentiate between genuine and forged signatures of an individual. To feed the stroke information without any human supervision, we divided the signature images into patches of dimensions $32\times 32$ with an overlap of 16 pixels from a signature image reshaped to $224\times 224$ . This gives 169 patches from a single image of dimensions $32\times 32$ . As the base encoder we choose ResNet-18 [he2016deep]. When the patches are passed through the encoder, we obtain an output of $1\times 1\times 512$ from each patch. We arrange the patches into a grid of $13\times 13$ such that the output from a single image after rearranging is $13\times 13\times 512$ . After applying global average pooling, we obtain an output feature vector of dimension $1\times 512$ . This feature vector is then passed through a non-linear projector with 1 hidden layer and output dimension $512$ to obtain the final output.

The main objective behind dividing the signature image into overlapping patches is to extract stroke information without human annotations of the same. To facilitate stroke information learning, we intend to decorrelate the output dimensions of the encoder such that each output gives linearly uncorrelated information. Linearly uncorrelated information means that the encoder discards redundant features by minimizing the cross-covariance between each dimension and extracts only meaningful information from the input by creating a bottleneck of information flow from the input space to the feature space. We normalize the feature vectors such that they lie within an unit hypersphere $\mathcal{S}^{D}$ , where $D$ is the dimension of the feature vector. The diagonal terms of the cross-covariance matrix are optimised such that it equates to 1. This allows for the uncorrelated dimensions to have zero mean and unit variance.

For forming positive pairs, we augment a single signature image in two randomly chosen augmentations. The augmentation details are mentioned in Sec. IV-B. The images are then divided into patches as mentioned before and then passed through the encoder and the projector.

Thus, the proposed loss function has the form:

\begin{split}\@add@centering\centering\mathcal{L}_{C}&=\frac{1}{N}\sum_{i=1}^{D}\left(\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{D}\left(\sum_{k=1}^{N}z_{k}^{i}\cdot{z^{\prime}}_{k}^{j}\right)^{2}+\left(\sum_{k=1}^{N}z_{k}^{i}\cdot{z^{\prime}}_{k}^{i}-1\right)^{2}\right)\\ &=\frac{1}{N}\sum_{i=1}^{D}\left(\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{D}\left(\sum_{k=1}^{N}z_{k}^{i}\cdot{z^{\prime}}_{k}^{j}\right)^{2}\right)\\ &+\frac{1}{N}\sum_{i=1}^{D}\left(\left(\sum_{k=1}^{N}z_{k}^{i}\cdot{z^{\prime}}_{k}^{i}-1\right)^{2}\right)\end{split}

(1)

where $z_{k}^{i}$ is a scalar value at $i$ -th dimension of the $k$ -th centered and normalized feature vector $z_{k}$ . Thus, the pre-processing steps before feeding the feature vector $z_{k}^{i}$ to the loss function are as follows

\centering\begin{split}\overline{z}_{k}&=\frac{\widetilde{z}_{k}}{\sqrt{\sum_{k=1}^{N}(\widetilde{z}_{k})^{2}}}\forall i\in[1,D]\\ \mu_{z_{k}}&=\frac{1}{N}\sum_{k=1}^{N}\overline{z}^{i}_{k}\forall i\in[1,D]\\ z_{k}&=\overline{z}_{k}-\mu_{z_{k}}\forall i\in[1,D]\\ \end{split}\@add@centering

(2)

It is to be noted that $z^{i}_{k}$ and ${z^{\prime}}^{i}_{k}$ are not same and are obtained from the each element of a positive pair. Thus, the proposed loss function does not optimize the terms of a cross-covariance matrix in the true meaning of the term. We can refer to this matrix as a Pseudo cross-covariance matrix, and in Figure 1, we illustrate how this matrix is formed using the feature vectors. The diagonal elements (coloured white) constitute the second term of the loss function $\mathcal{L}_{C}$ , whereas the rest of the non-diagonal elements constitute the first one.

Refer to caption — Figure 1: Illustration of Pseudo Cross-Covariance Matrix. The same coloured blocks indicate outputs from elements of a positive pair with the same parent image. One element from each positive pair taken together form $z$ and $z^{\prime}$ .

From eq. 1, we can see that optimizing the proposed loss function allows us to decorrelate the dimensions of the output by converting the pseudo cross-covariance matrix into a diagonal matrix with positive diagonal values. This allows for positive correlation between the output in the same dimension for the samples in the positive pairs. It also linearly decorrelates one dimension from another such that the representations learnt by the encoder discard redundant information and maximize the useful information. In other words, we can describe the representation learning process as learning linearly decorrelated generative factors of the input. One thing to note is that, the cross-covariance and auto-covariance terms are not used in their true sense as the two samples used in a positive pair are not identical.

III-B Pretraining Model Architecture

The model architecture used in the pretraining phase is given in Figure 2. The diagram shows the input that is fed to the ResNet [he2016deep] encoder. The input is reshaped to $169\times 32\times 32\times 512$ before passing it through the encoder. Figure 2 also shows an example of the input that is used in the pretraining phase.

III-C Downstream Evaluation

For predicting whether a signature is forged or genuine, we are not training any siamese neural network model for the downstream task. Instead, we take 8 reference signature for each user and use these reference signatures to train a Support Vector Machine (SVM) classifier with radial basis function kernel. Since, the objective in our work is writer-independent signature verification, we assume that the user for which the signature is being verified is known. If the signature of a particular user is forged, it is expected to be mapped far from the cluster of the genuine reference signatures. Thus, we assume that the forged signature will be mapped outside the decision boundary of that particular user. If the user is predicted correctly and the signature is genuine, we count it as a correct prediction. Similarly, if the predicted user and the ground truth user is not the same and the signature is actually forged, then also it is counted as a correct prediction. In all the other cases, the prediction is considered as wrong.

By using a SVM classifier, we depend on the feature extraction capability of the pretrained encoder to express the input in terms of its linearly decorrelated generative factors. Whereas all the contemporary state-of-the-art supervised algorithms use siamese type architecture or supervised contrastive learning framework for the offline signature verification task. Our approach also reduces the downstream inference time by a several orders of magnitude.

IV Experimental Details

In this section, we are going to discuss the details of the datasets that were used in our experiments, and the configurations used for training our encoder in the pretext (or pretraining) task.

IV-A Datasets

In this work, we used two datasets, namely, BHSig260 [pal2016performance] and ICDAR 2011 [alvarez2016offline]. BHSig260 dataset contains signatures from 100 writers for Bengali and 160 writers for Hindi signatures. For each writer of both the languages, there are 24 genuine and 30 forged signatures. Among the 100 writers in the Bengali subset, we randomly select 50 writers for the training set and the rest 50 are used for testing. For the Hindi subset, we randomly selected 50 writers for self-supervised pretraining and the rest 110 writers were left for testing. Similarly, for ICDAR 2011 Signature Verification dataset, there are signatures for Dutch and Chinese languages. The subset of the Dutch signatures contains signatures from 64 writers. Among the 64, 10 writers are included in the training set as reference and the rest 54 are included in the Test set. The divisions are however given in the dataset. The Dutch subset contains 362 signatures in the training set and 1932 signatures in the test set. Among the 1932 signatures in the test set, 646 signatures are used as reference signatures for the writers and the rest as the unknown samples. Similarly, for the Chinese subset, the reference or training subset and the test set, both contain signatures from 10 writers. The number of signatures in the training and test set in the Chinese subset is 575 and 602, respectively. Among the 602 signatures in the test set, 115 signatures are used as reference signatures and the rest are used as the unknown samples. In the test set, however, there are 8 reference genuine signatures for each writer. To adhere to this structure, we randomly selected 8 genuine signatures from the test set of BHSig260 dataset for each writer and used it as the reference set, for both Bengali and Hindi languages.

The train and test set divisions used in our work is described in the table below:

Dataset	No. of writers
	Training	Test
ICDAR2011 Dutch	10	54
ICDAR2011 Chinese	10	10
BHSig260 Bengali	50	50
BHSig260 Hindi	50	110

TABLE I: Train and Test set divisions of the signature verification datasets

IV-B Pretraining Experiments Configuration

For the pretraining phase, we used different number of epochs for different datasets. The models were trained by optimizing the loss function given by 1 using LARS [you2017large] optimizer. We used a learning rate of 0.1 and a momentum value of 0.9. The batch-normalization and bias parameters were excluded from weight normalization. We decayed the learning rate following a cosine decay schedule with a linear warmup period of 10 epochs at the start. The decay was scheduler for 1000 epochs irrespective of the number of training epochs.

For the ICDAR datasets, we pretrained the model for 500 epochs. Whereas for the BHSig260 dataset, the pretraining was carried out for 200 epochs only. For both the datasets, the batch size used was 32.

To ensure that the pretrained models learn generalized and robust features, we applied several augmentations, such as, color jittering, affine transformation and random cropping to $224\times 224$ . The images obtained after augmentation were normalized to the range $[-1.0,+1.0]$ .

As not all images in the datasets contain perfectly cropped signature images, we needed to crop the signature images such that the input to the encoder contained is a tightly bounded signature image. To achieve this objective, we performed Otsu’s thresholding [otsu1979threshold] followed by finding the bounding box with least area containing all non-zero pixels around the centre of mass of the image. After this preprocessing step, the images were divided into 169 patches of dimension $32\times 32$ and fed to the encoder for training.

V Experimental Results

V-A Downstream Results

TABLE II: Comparison of the proposed method with state-of-the-art self-supervised learning baselines.

Method	ICDAR 2011 Dutch [alvarez2016offline]			ICDAR 2011 Chinese [alvarez2016offline]			BHSig260 Bengali [pal2016performance]			BHSig260 Hindi [pal2016performance]
Method	Accuracy (%)	FAR	FRR	Accuracy (%)	FAR	FRR	Accuracy (%)	FAR	FRR	Accuracy (%)	FAR	FAR
SimCLR [chen2020simple]	69.46	0.554	0.060	59.76	0.431	0.317	73.45	0.117	0.543	72.45	0.103	0.599
Proposed	77.62	0.316	0.133	64.68	0.278	0.583	72.04	0.367	0.116	72.43	0.104	0.598

The downstream task we considered in our work is the writer-independent classification of signatures into two classes: genuine or forged. The predictions were obtained using the procedure described in Section III-C. The results obtained by the proposed model in the downstream task on the datasets ICDAR 2011 and BHSig260 signature verification datasets are given in Table II. These two datasets contain large number of signatures in different languages to validate the effectiveness of the proposed model.

V-B Comparison with SOTA Self-supervised Algorithms

In this section, we show how the proposed loss function fares at training the encoder to learn representations from the data. As shown in Table II, in spite of trained in a self-supervised manner, the proposed framework performs satisfactorily on both the multilingual datasets. Table II also presents the comparative results of one of the state-of-the-art self-supervised algorithms (SimCLR) on the same data. From Fig. 3, we can see that the proposed algorithm performs better at producing distinct clusters for ICDAR 2011 Chinese and BHSig260 Bengali dataset, whereas the plots for ICDAR 2011 Dutch and BHSig260 Hindi datasets look equally well-clustered for both the proposed model and SimCLR. It should be mentioned here that the SimCLR algorithm was trained for 1000 epochs on the ICDAR 2011 dataset (both, Dutch and Chinese), whereas the proposed model was trained for only 500 epochs on the same datasets. Both the methods were trained for 200 epochs on the BHSig60 dataset.

V-C Comparison with Supervised Methods

To further validate our proposed self-supervised pipeline for offline signature verification, we compare its performance with some fully supervised methods in literature. The results have been tabulated in Table III. We observe that the proposed framework performs competitively against the fully supervised works on the BHSig260 datasets, outperforming [pal2016performance] by a large margin on the Bengali signature dataset. Moreover, the low FAR and FRR values obtained by the proposed method on the signature datasets affirm its potential in separating forged signatures from the genuine ones.

TABLE III: Comparison of the proposed method with supervised learning methods in literature.

Method	BHSig260 Bengali [pal2016performance]			BHSig260 Hindi [pal2016performance]
Method	Accuracy (%)	FAR	FRR	Accuracy (%)	FAR	FRR
Pal et al. [pal2016performance]	66.18	0.3382	0.3382	75.53	0.2447	0.2447
Dutta et al. [dutta2016compact]	84.90	0.1578	0.1443	85.90	0.1310	0.1509
Dey et al. [dey2017signet]	86.11	0.1389	0.1389	84.64	0.1536	0.1536
Alaei et al. [alaei2017efficient]	–	0.1618	0.3012	–	0.1618	0.3012
Proposed	72.04	0.367	0.116	72.43	0.104	0.598

VI Conclusion

In this work, we proposed a self-supervised representation learning framework where a novel loss function is used that aims at decorrelating the dimensions from each other to discard redundant features and encourage learning of linearly uncorrelated generative features of the input. We apply this loss function for learning stroke information from offline grayscale signature images to solve the task of writer-independent signature verification. Through tSNE plots we show that the proposed algorithm extracts better uncorrelated information from the input than the SOTA self-supervised methods on the same datasets. We compare the proposed method with SimCLR on the datasets ICDAR 2011 (Dutch and Chinese) and BHSig260 (Bengali and Hindi). From the comparative results, it is evident that the proposed method performs better than or at par with the state-of-the-art algorithm SimCLR. This work shows the extensive scope and applicability of the proposed method in the field of computer vision and paves a way for further research in this direction.


ICDAR 2011 Dutch

ICDAR 2011 Chinese

BHSig260 Bengali

BHSig260 Hindi