Variational Augmentation for Enhancing Historical Document Image Binarization

Avirup Dey Jadavpur UniversityKolkataWest BengalIndia700032 , Nibaran Das Jadavpur UniversityKolkataWest BengalIndia700032 and Mita Nasipuri Jadavpur UniversityKolkataWest BengalIndia700032

(2022)

Abstract.

Historical Document Image Binarization is a well-known seg- mentation problem in image processing. Despite ubiquity, traditional thresholding algorithms achieved limited success on severely degraded document images. With the advent of deep learning, several segmentation models were proposed that made significant progress in the field but were limited by the unavailability of large training datasets. To mitigate this problem, we have proposed a novel two-stage framework- the first of which comprises a generator that generates degraded samples using variational inference and the second being a CNN-based binarization network that trains on the generated data. We evaluated our framework on a range of DIBCO datasets, where it achieved competitive results against previous state-of-the-art methods.

Binarization, DIBCO, GANs

^†^†copyright: rightsretained^†^†doi: 10.475/123_4^†^†isbn: 123-4567-24-567/08/06^†^†conference: 13th Indian Conference on Computer Vision, Graphics and Image Processing; December 2022; Gandhinagar, India^†^†journalyear: 2022^†^†price: 15.00^†^†ccs: Computing methodologies Image segmentation

1. Introduction

One of the most important preprocessing steps for the analysis of document images is their binarization. The classification of image pixels into text and non-text regions greatly aids systems that reside on the other end of the pipeline, generally optical character recognition algorithms. While binarization is often performed with the use of a simple threshold that dictates how the image pixel would be mapped in the output image, advanced binarization algorithms are needed when the task involves the restoration of document images in conjunction with binarization to obtain an ideal text/non-text demarcation.

Over the years, many restoration algorithms have been proposed focussing on the removal of degradations found in document images. Traditional Image processing-based algorithms like Otsu(Otsu, 1979), Niblack (Niblack, 1985) and Sauvola(Sauvola and Pietikäinen, 2000) deal with minor degradations successfully and are particularly popular in this domain. These methods, however, do not perform successful binarization when the degradations are severe, typically those observed in historical document images. Binarization of severely degraded document images has been addressed recently with the help of deep learning algorithms, especially CNNs (LeCun et al., 2015), that have shown significant promise in this domain. Tenseymer and Martinez proposed a fully connected network to binarize document images in (Tensmeyer and Martinez, 2017), treating binarization as a special case of semantic segmentation. Calvo-Zaragoza and A.J. Gallego proposed a deep encoder-decoder architecture, outperforming previous binarization algorithms in (Calvo-Zaragoza and Gallego, 2019). Vo et al.(Vo et al., 2018) proposed a hierarchical deep learning framework that achieved state-of-the-art results on multiple datasets. Later He and Schomaker(He and Schomaker, 2019) proposed an iterative CNN framework and achieved similar results. Suh et al. (Suh et al., 2020) proposed a two-stage adversarial framework for binarizing RGB images. The first stage trains an adversarial network for each channel of the RGB image to extract foreground information from small local image patches by removing background information for document image enhancement. The second stage learns the local and global features to produce the final binarized output. This algorithm outperformed previous methods and achieved state-of-the-art performance on multiple datasets.

Although the previous methods can perform successful binarizations of degraded documents, they are constrained to grayscale inputs and require voluminous training data. To deal with the lack of adequate training data, Bhunia et al.(Bhunia et al., 2019) introduced a texture augmentation network (TANet) in conjunction with a binarization network that generates degraded document images with the help of style transfer. This method achieves competitive performance with previous methods but is constrained by the one-to-one mapping of style transfer-based approaches. Capobianco and Marinai (Capobianco and Marinai, 2017) proposed a toolkit to generate synthetic document images by combining text and degradations from a given palette. One could iteratively generate multiple images from a given text but this requires a lot of manual cherry-picking of degradation styles.

In this paper we propose a novel two-stage framework addressing both the data scarcity in this domain and the binarization of severely degraded document images containing multiple artefacts in the form of stains and bleed-through. In the first stage, we employ an augmentation network, Aug-Net, inspired by the BicycleGAN (Zhu et al., 2017a), that generates novel degraded samples from a single image-ground truth pair using variational inference. Combining the strengths from GANs (Goodfellow et al., 2014) and VAEs (Kingma and Welling, 2013), this network encodes the content and style of the input image into a probability distribution, allowing the model to generate multiple outputs from a single input by sampling from the distribution during inference. Additionally, it resolves the blurring effect of VAE by encouraging bijective similarity, making the generated samples more realistic.

In the following stage, we employ a paired image to image translation network, Bi-Net, inspired by Pix2Pix (Isola et al., 2017), to binarize the images. While previous methods performed binarization from grayscale images, we work with RGB images, attempting to capture the nature of the degradation from multiple channels. Our architecture further employs PixelShuffle (Shi et al., 2016) upsampling, replacing the traditional transpose convolutional layer. PixelShuffle uses Efficient Sub-Pixel Convolution (ESPC) that is computationally less expensive and has shown better results in upscaling images and videos.

In our experiments, we trained our network on DIBCO 2009 (Gatos et al., 2009), DIBCO 2010 (Pratikakis et al., 2010), DIBCO 2011 (Pratikakis et al., 2011), DIBCO 2013 (Pratikakis et al., 2013), and DIBCO 2017 (Pratikakis et al., 2017) datasets and tested it on DIBCO 2014 (Ntirogiannis et al., 2014), DIBCO 2016 (Pratikakis et al., 2016), DIBCO 2018(Pratikakis et al., 2018) and DIBCO 2019 (Pratikakis et al., 2019) using metrics like F-measure, pseudo F-measure, PSNR and DRD for evaluation of performance. Our method achieves competitive performance on the said datasets against several state-of-the-art models, including Mesquita et. al (Mesquita et al., 2015) (winner of DIBCO 2014), Kligger and Tal (Kligler et al., 2018) (winner of DIBCO 2016), Wei et. al. (Xiong et al., 2018) (winner of DIBCO 2018) and Sarkar et. al. (Bera et al., 2021) (winner of DIBCO 2019).

Our contributions can be summarised as follows:

•

We have used variational inference to generate novel images for training that can scale up the training data by many folds, thus eliminating the problem of limited training samples.
•

We have used PixelShuffle upsampling instead of transpose convolutions and short skip connections in the generator of our binarization model in addition to the long symmetric skip connections. These modifications not only yielded better results but also helped in faster training.
•

We have evaluated our model on multiple metrics against several state-of-the-art binarization algorithms, including the winners of previous DIBCO contests showing that our proposed method achieves significant improvements over existing algorithms.

The paper is organised as follows: Section 2 summarizes the relevant literature, Section 3 covers our proposed methodology comprising the network architectures and training details, Section 4 and Section 5 cover our experimental results and ablation studies, respectively, and Section 6 concludes the paper.

2. Related Works

The overall goal of document binarization is to convert an input image into a two-tone version, enabling easy demarcation of text and non-text regions to enhance the information that can be extracted from the same. Towards this goal, a plethora of methods have been proposed, broadly classified into two groups, classical image processing-based methods and deep learning-based algorithms.

2.1. Image Processing Based Methods

Traditional image processing based methods like those proposed in (Otsu, 1979), (Niblack, 1985), and (Sauvola and Pietikäinen, 2000) formed the baseline of research in the domain of document image binarization. One of the most popular algorithms was proposed by Otsu et al. in (Otsu, 1979), which computes a threshold that minimizes the intra-class variance and maximizes the inter-class variance of two pre-assumed classes. It selects the global threshold based on a histogram, and owing to its simplicity this algorithm is very fast. However, it is sensitive to deep stains, non-uniform backgrounds and bleed-through degradations.

In order to solve this problem, local adaptive threshold methods were proposed, such as Sauvola(Sauvola and Pietikäinen, 2000), Niblack(Niblack, 1985) and AdOtsu(Moghaddam and Cheriet, 2012). These methods compute the local threshold for each pixel based on the mean and standard deviation of a local area around the pixel. Following these methods, several binarization methods have been proposed. Gatos et al. (Gatos et al., 2006) used a Wiener filter to estimate the background and foreground regions in a method that employs several post-processing steps to remove noise in the background and improve the quality of foreground regions. Su et al. (Su et al., 2012) introduced a map combining the local image contrast and the local image gradient; the local threshold is estimated based on the values on the detected edges in a local region. Pai et al. (Pai et al., 2010) proposed an adaptive window-size selection method based on the foreground characteristics. Jia et al. (Jia et al., 2018) employed the structural symmetry of strokes to compute the local threshold. Xiong et. al. (Xiong et al., 2021) proposed an entropy-based formulation to segregate text from the image background and employed energy-based segmentation to binarize images. They achieved competitive results on DIBCO benchmarks.

However, these adaptive binarization methods require many empirical parameters that need to be manually fine-tuned and are still not satisfactory for use with highly degraded and poor-quality document images.

2.2. Deep Learning Based Methods

In recent years, convolutional neural networks (CNNs) (LeCun et al., 2015) have achieved several milestones in a variety of tasks in computer vision. Tenseymer and Martinez (Tensmeyer and Martinez, 2017) used a fully convolutional network for document image binarization at multiple image scales. A deep auto-encoder–decoder architecture model was proposed by Calvo-Zaragoza and A.J. Gallego. Vo et al. (Calvo-Zaragoza and Gallego, 2019) proposed a hierarchical deep supervised network to predict the full foreground map through the results of multi-scale networks; this method achieved state-of-the-art performance on several benchmarks. He and Schomaker (He and Schomaker, 2019) introduced an iterative CNN-based framework and achieved performance similar to that of (Vo et al., 2018)’s method. Soibgui et al. (Souibgui et al., 2022) developed a transformer (Vaswani et al., 2017) based encoder-decoder framework which achieved results comparable to the previous method.

Recently, generative adversarial networks (GANs) (Goodfellow et al., 2014) have emerged as a class of CNN models approximating real-world images, achieving significant performance image synthesis. Over the years, several variants have been proposed for image translation tasks as well. In GANs, a generator network competes against a discriminator network that distinguishes between generated and real images. Unlike the original GAN, cGAN (Mirza and Osindero, 2014) trains the generator not only to fool the discriminator but also to condition it on additional inputs, such as class labels, partial data, or input images. Based on this principle, Isola et al. (Isola et al., 2017) proposed Pix2Pix GAN for the general purpose of image-to-image translation. Bhunia et al. (Bhunia et al., 2019) proposed a texture augmentation network to augment the training datasets and handle image binarization using a cGAN structure. Zhao et al. (Zhao et al., 2019) proposed a cascaded network based on Pix2Pix GAN to combine global and local information. Suh et al. (Suh et al., 2020) proposed a two-stage GAN framework, employing adversarial networks to remove noise from each channel separately in the first stage and fine-tune the output using a similar network in the second stage.

3. Proposed Methodology

3.1. Overview

The application of deep learning algorithms in the field of historical document image binarization is limited due to the dearth of training data. In the past, researchers had to either curate and preprocess images manually from multiple sources as shown in (Capobianco and Marinai, 2017) or use style transfer-based augmentation as shown by Bhunia et. al. (Bhunia et al., 2019) and Kumar et. al. (Kumar et al., 2021). Despite their merits, both methods suffer from the lack of diversity of degradations real historical documents might have.

The strength of our method lies in the generation of novel synthetic data in situ while training the binarization model as shown in Fig.1. In the first stage, an augmentation network, inspired by Bicycle-GAN (Zhu et al., 2017a), generates multiple degraded samples from an input-ground truth pair via variational inference. These synthetic samples are, in turn, used to train our binarization network in the next stage. We have employed a paired image translation procedure, following the work of Isola et. al. (Isola et al., 2017).

Refer to caption — Figure 1. Training the Binarization Network: Stage-I comprises the variational generator of the augmentation network (Aug-Net) that synthesizes novel training samples from a given input. Stage-II is comprises the binarization network (Bi-Net) that is trained to transform the degraded images into binary masks.

3.2. Network Architectures

3.2.1. (A) Augmentation Network (Aug-Net)

The network comprises two components - cVAE-GAN (Larsen et al., 2016) and cLR-GAN (Dumoulin et al., 2016) which complement each other in the training process. cVAE-GAN learns the encoding from real data, but a random latent code may not yield realistic images at test time, and the KL loss may not be well optimized. More importantly, the discriminator does not have a chance to see results sampled from the prior during training. On the other hand, in cLR-GAN, the latent space is easily sampled from a simple distribution, but the generator is trained without the benefit of observing the ground truth input-output pairs. Thus, combining these helps us produce results that are diverse as well as realistic. Fig.2 outlines the training procedure of our augmentation network.

We use the trained generator of the cVAE-GAN to generate new degraded images while training our binarization network.

3.2.2. (B) Binarization Network (Bi-Net)

Our binarization network is inspired by the well-known Pix2Pix GAN (Isola et al., 2017) which was first proposed by Isola et. al. with some changes in the generator architecture. Firstly, we have built the U-Net generator on top of a ResNet (He et al., 2016) backbone that helps in stabilizing the training. Secondly, we have swapped the dilated convolutions in the upsampling blocks with PixelShuffle (Shi et al., 2016) that produce sharper outputs.
The PixelShuffle layer implements Efficient Sub-Pixel Convolution. The operation is described by the following equation:

(1)

I^{HR}\;=\;f^{L}(I^{LR})\;=\;PS(W_{L}\;*\;f^{L-1}(I^{LR})\;+\;b_{L})

where $I^{HR}$ denotes the high resolution image, $I^{LR}$ denotes the low resolution image, $f^{L}$ is the convolution kernel of the L-th layer and PS is an periodic shuffling operator that rearranges the elements of a $H\times W\times C\cdot r^{2}$ tensor to a tensor of shape $rH\times rW\times C$ . This has been illustrated in Figure 3.

We employed a PatchGAN (Li and Wand, 2016) discriminator with an output scale of $70\times 70$ .

3.3. Objective Functions

3.3.1. (A) For Augmentation Network

We train our augmentation network on a combination of losses that includes KL Divergence, adversarial loss and a regularizer $L_{1}$ loss as proposed by (Zhu et al., 2017a). The objective is given by:

G^{*},E^{*}\;=\;arg\;min_{G,E}\;max_{D}\>\mathcal{L}_{GAN}^{VAE}(G,D,E)\;+\;\lambda\mathcal{L}^{VAE}_{1}(G,E)\\ \;+\;\mathcal{L}_{GAN}(G,D)\;+\;\lambda_{latent}\mathcal{L}_{1}^{latent}(G,E)\;+\;\lambda_{KL}\mathcal{L}_{KL}(E)

$\lambda=10$ , $\lambda_{latent}=0.5$ and $\lambda_{KL}=0.01$ were the optimum hyper-parameters. The length of the latent vector was taken as $|z|=8$ .

3.3.2. (B) For Binarization Network

We train our binarization network on a combination of the conditional adversarial loss as proposed by Isola et. al. (Isola et al., 2017). and the $L_{1}$ loss. The conditional adversarial loss is given by:

(2)

\mathcal{L}_{cGAN}\;=\;\mathbb{E}_{x,y}[logD(x,y)]\;-\;\mathbb{E}_{x,z}[log(1-D(x,G(x,z)))]

Conditioning the discriminator term with the input, $x$ forces the generator to produce images that are perceptually similar in structure. It also reduces blurring artefacts. Adding an $L_{1}$ loss encourages the discriminator to force pixel-level accuracy in the generated images.
The overall objective can be written as:

(3)

G^{*}\;=\;arg\;min_{G}\;max_{D}\>\mathcal{L}_{cGAN}(G,D)\;+\;\lambda\mathcal{L}_{L1}(G)

$\lambda$ is a hyper-parameter that can be tuned to weight the loss terms. We have trained our model with $\lambda\;=\;100$ .

3.4. Training

Both the networks were trained with patches extracted from DIBCO 2009 (Gatos et al., 2009), 2010 (Pratikakis et al., 2010), 2011 (Pratikakis et al., 2011), 2013 (Pratikakis et al., 2013) and 2017 (Pratikakis et al., 2017) datasets.
The augmentation network was trained on $256\times 256$ patches extracted from the said datasets for 6 epochs using the Adam optimizer.
For training the binarization network, patches of size $512\times 512$ were extracted from each image to obtain 13320 samples. Each sample was further reduced to $256\times 256$ using augmentations like random crop and resize. The generator was pre-trained with just the $L_{1}$ loss for 5 epochs. The model was then trained adversarially for 20 epochs using the Adam optimizer ( $\beta_{1}=0.5$ and $\beta_{2}=0.999$ ) with a learning rate of $2e^{-4}$ .

4. Experimental Results

4.1. Datasets

All comparisons were carried out using DIBCO datasets. For evaluation purposes, we have used DIBCO 2014 (Ntirogiannis et al., 2014), DIBCO 2016 (Pratikakis et al., 2016), DIBCO 2018 (Pratikakis et al., 2018) and DIBCO 2019 (Pratikakis et al., 2019) datasets.

Data Preparation: Each DIBCO dataset has 10 degraded historical document images for evaluation. We extracted 5 random patches of dimension $256\times 256$ from each image, giving us a total of 50 samples from each dataset for evaluation.

4.2. Evaluation Metrics

•

F Measure:

(4) $FM\;=\;\frac{2\times Precision\times Recall}{Precision+Recall},$

where $Precision=\frac{TP}{TP+FN}$ and $Recall=\frac{TP}{TP+FP}$ . TP, FP and FN denote true positive, false positive and false negative respectively.

•

Pseudo-F-Measure:

(5)

pFM\;=\;\frac{2\times Precision\times pRecall}{Precision+pRecall},

where $pRecall$ denotes the fraction of the skeletonized ground truth image. (Pratikakis et al., 2010).

•

PSNR:

(6) $PSNR\;=\;10\;log(\frac{C^{2}}{MSE}),$

where $MSE$ is the mean squared error between the two images. $C$ is the maximum pixel value. The images were normalized so here, $C=1.$
PSNR is a widely used metric to measure the similarity between two images.
•

DRD:

(7) $DRD\;=\;\frac{\Sigma_{k}\;DRD_{k}}{NUBN}$

where $DRD_{k}$ is the distortion of the k-th flipped pixel, and $NUBN$ is the number of non-uniform $8\times 8$ blocks in the ground truth image. (Ntirogiannis et al., 2014).

4.3. Results and Discussion

From Table 1 we can see that our proposed model achieves competitive scores against previous state-of-the-art methods, and in some cases even surpasses them.
We also observe that, unlike the winning models of the DIBCO contests ((Kligler et al., 2018), (Xiong et al., 2018), (Bera et al., 2021)), our model not only maximises pixel-level accuracy (measured by F-Measure and PSNR) but also preserves perceptual quality of the binarized image (measured by DRD). This is a clear consequence of using a conditional GAN loss.
We compared our model to a range of traditional thresholding algorithms like Otsu (Otsu, 1979), Niblack (Niblack, 1985) and Sauvola (Sauvola and Pietikäinen, 2000) as well as recent state-of-the-art like Vo (Vo et al., 2018), Xiong (Xiong et al., 2021) and Suh (Suh et al., 2020). The winning models of DIBCO 14 (Ntirogiannis et al., 2014), DIBCO 16 (Pratikakis et al., 2016), DIBCO 18 (Pratikakis et al., 2018) and DIBCO 19 (Pratikakis et al., 2019) have also been included in our experiments.
Bhunia’s method (Bhunia et al., 2019), Suh’s method (Suh et al., 2020), Soibgui’s method (Souibgui and Kessentini, 2020) and Cycle-GAN (Zhu et al., 2017b) were trained on the datatsets we used for training our own model (DIBCO 2009 (Gatos et al., 2009), DIBCO 2010 (Pratikakis et al., 2010), DIBCO 2011 (Pratikakis et al., 2011), DIBCO 2013 (Pratikakis et al., 2013) and DIBCO 2017 (Pratikakis et al., 2017)) with the hyper-parameters set by their respective authors.

Bhunia et. al.’s model (Bhunia et al., 2019), Suh et. al.’s model (Suh et al., 2020), Vo et.al.’s model (Vo et al., 2018) and He’s (He and Schomaker, 2019) model were originally trained on PHIBD (Nafchi et al., 2013) dataset that has over 100 images in addition to multiple DIBCO datasets. On the other hand, our training dataset consisted of only 74 DIBCO images. In effect, we trained our model on almost half the number of training images used by previous deep learning-based methods Hence, it demonstrates the strength of variational augmentation while training the binarization model.

Table 1. Comparative results on DIBCO datasets.

Method	Evaluation Metrics
	F-Measure $\uparrow$	pF-Measure $\uparrow$	PSNR $\uparrow$	DRD $\downarrow$
	DIBCO 2014
Rank 1 (Mesquita et al., 2015)	96.880	97.650	22.660	0.902
Otsu (Otsu, 1979)	91.780	95.740	18.720	2.647
Niblack (Niblack, 1985)	86.010	88.040	16.540	8.260
Sauvola (Sauvola and Pietikäinen, 2000)	86.830	91.800	17.630	4.896
Vo (Vo et al., 2018)	95.970	97.420	21.490	1.090
He (He and Schomaker, 2019)	95.950	98.760	21.600	1.120
Xiong (Xiong et al., 2021)	96.770	97.730	22.470	0.950
Bhunia (Bhunia et al., 2019)	73.753	74.000	12.679	7.850
Suh (Suh et al., 2020)	96.360	98.870	21.960	1.070
Soibgui (Souibgui and Kessentini, 2020)	96.020	95.300	19.870	4.100
Cycle-GAN (Zhu et al., 2017b)	84.530	86.792	15.278	5.993
Ours	96.860	97.450	22.250	1.530
	DIBCO 2016
Rank 1 (Kligler et al., 2018)	87.610	91.280	18.110	5.210
Otsu (Otsu, 1979)	85.660	88.860	16.260	5.580
Niblack (Niblack, 1985)	72.570	73.510	13.260	24.650
Sauvola (Sauvola and Pietikäinen, 2000)	84.270	89.100	17.150	6.090
Vo (Vo et al., 2018)	90.010	93.440	18.740	3.910
He (He and Schomaker, 2019)	91.190	95.740	19.510	3.020
Xiong (Xiong et al., 2021)	89.640	93.560	18.690	4.030
Das (Das, 2019)	88.930	91.750	18.080	4.120
Bhunia (Bhunia et al., 2019)	65.525	65.145	12.595	8.270
Suh (Suh et al., 2020)	92.240	95.950	19.930	2.770
Soibgui (Souibgui and Kessentini, 2020)	88.760	87.230	19.450	7.380
Cycle-GAN (Zhu et al., 2017b)	81.564	87.434	14.681	7.143
Ours	94.333	96.081	20.086	1.316
	DIBCO 2018
Rank 1 (Xiong et al., 2018)	88.340	90.240	19.110	4.920
Otsu (Otsu, 1979)	51.450	53.050	9.740	59.070
Niblack (Niblack, 1985)	41.180	41.390	6.790	99.460
Sauvola (Sauvola and Pietikäinen, 2000)	67.810	74.080	13.780	17.690
Xiong (Xiong et al., 2021)	88.340	90.370	19.110	4.930
Bhunia (Bhunia et al., 2019)	59.254	59.178	11.797	9.555
Suh (Suh et al., 2020)	84.950	91.577	17.040	16.861
Soibgui (Souibgui and Kessentini, 2020)	73.700	75.610	16.990	10.210
Cycle-GAN (Zhu et al., 2017b)	72.972	77.391	13.462	129.277
Ours	89.751	93.141	17.439	3.824
	DIBCO 2019
Rank 1 (Bera et al., 2021)	72.875	72.150	14.475	16.235
Otsu (Otsu, 1979)	52.800	52.550	12.640	24.210
Niblack (Niblack, 1985)	51.510	53.860	10.540	31.050
Sauvola (Sauvola and Pietikäinen, 2000)	42.520	39.760	7.710	120.120
Bhunia (Bhunia et al., 2019)	53.340	55.995	11.779	9.256
Suh (Suh et al., 2020)	62.893	62.726	15.584	3.362
Soibgui (Souibgui and Kessentini, 2020)	70.330	71.470	12.220	8.910
Cycle-GAN (Zhu et al., 2017b)	74.916	75.189	14.307	6.814
Ours	75.130	75.101	14.802	5.248

5. Ablation Studies

Although our experiments show that the proposed method achieves competitive performance against existing state-of-the-art methods, we would like experimentally prove the benefits brought by the pivotal points of our method - generation of synthetic data while training and the modified Pix2Pix architecture suggested in Section 3.2 (B). Here, we perform our ablation study to verify the advantage of individual modules in the proposed model.

5.1. W/O Variational Augmentation

We train our binarization network, Bi-Net without Stage 1 and compare it with the proposed method in Table 2. We observe that there is a significant drop in performance of the model when we do not incorporate variational augmentation. Given the limited amount of training data, the model fails to learn the various degradations that might be present in real-world historical document images hence it justifies our case for simulating the degradations using a neural network.

Table 2. Ablation experiments w/o variational augmentation.

Method	Evaluation Metrics
	F-Measure $\uparrow$	pF-Measure $\uparrow$	PSNR $\uparrow$	DRD $\downarrow$
	DIBCO 2014
w/o Aug-Net	91.190	92.076	19.758	3.192
Proposed	96.860	97.450	22.250	1.530
	DIBCO 2016
w/o Aug-Net	87.268	90.844	17.348	4.142
Proposed	94.333	96.081	20.086	1.316
	DIBCO 2018
w/o Aug-Net	73.968	77.170	14.988	118.359
Proposed	89.751	93.141	17.439	3.824
	DIBCO 2019
w/o Aug-Net	64.809	64.802	14.458	6.594
Proposed	75.130	75.101	14.802	5.248

5.2. W/O Residual Blocks + PixelShuffle

We also compare Bi-Net with its parent network Pix2Pix and compare their performances in Table 3. We see that our proposed architecture outperforms Pix2Pix on DIBCO datasets by a significant margin. Bi-Net’s PixelShuffle upsampling preserves line low details better than conventional strided convolutions which is evident by the boost in F-Measure.

Table 3. Ablation experiments w/o changes in architecture of Bi-Net

Method	Evaluation Metrics
	F-Measure $\uparrow$	pF-Measure $\uparrow$	PSNR $\uparrow$	DRD $\downarrow$
	DIBCO 2014
Pix2Pix	73.928	74.606	18.548	9.059
Proposed	91.190	92.076	19.758	3.192
	DIBCO 2016
Pix2Pix	72.636	73.689	13.641	8.800
Proposed	87.268	90.844	17.348	4.142
	DIBCO 2018
Pix2Pix	68.135	68.516	9.324	55.791
Proposed	73.968	77.170	14.988	118.359
	DIBCO 2019
Pix2Pix	68.626	68.735	9.478	17.714
Proposed	64.809	64.802	14.458	6.594

6. Conclusion

In this paper, we have proposed a novel document binarization algorithm that successfully deals binarization of historical document images. Our algorithm couples the strengths of variational inference and paired image-to-image translation and hence is able to perform well in scenarios where training data is scarce. We demonstrate the efficacy of our methods by performing a quantitative analysis of the binarization performance using metrics like F-Measure, pseudo-F-Measure, PSNR and DRD. The experimental results show that the proposed method outperforms traditional and state-of-the-art methods on multiple metrics. Furthermore, we have conducted ablation experiments to prove the merits of our methodology. We observe, however, that our method fails to binarize images where the foreground text has faded away. Furthermore, since we do not analyze the textual information within the image we are unable to restore images where a part of the text is missing or areas where the ink has bloated, rendering the text unreadable. To fix these caveats, future work in this domain focusing on the integration of language processing algorithms within the binarization framework is necessary.

References

Otsu [1979] Nobuyuki Otsu. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 9(1):62–66, 1979.
Niblack [1985] Wayne Niblack. An introduction to digital image processing. Strandberg Publishing Company, 1985.
Sauvola and Pietikäinen [2000] Jaakko Sauvola and Matti Pietikäinen. Adaptive document image binarization. Pattern recognition, 33(2):225–236, 2000.
LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
Tensmeyer and Martinez [2017] Chris Tensmeyer and Tony Martinez. Document image binarization with fully convolutional neural networks. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 99–104. IEEE, 2017.
Calvo-Zaragoza and Gallego [2019] Jorge Calvo-Zaragoza and Antonio-Javier Gallego. A selectional auto-encoder approach for document image binarization. Pattern Recognition, 86:37–47, 2019.
Vo et al. [2018] Quang Nhat Vo, Soo Hyung Kim, Hyung Jeong Yang, and Gueesang Lee. Binarization of degraded document images based on hierarchical deep supervised network. Pattern Recognition, 74:568–586, 2018.
He and Schomaker [2019] Sheng He and Lambert Schomaker. Deepotsu: Document enhancement and binarization using iterative deep learning. Pattern recognition, 91:379–390, 2019.
Suh et al. [2020] Sungho Suh, Jihun Kim, Paul Lukowicz, and Yong Oh Lee. Two-stage generative adversarial networks for document image binarization with color noise and background removal. arXiv preprint arXiv:2010.10103, 2020.
Bhunia et al. [2019] Ankan Kumar Bhunia, Ayan Kumar Bhunia, Aneeshan Sain, and Partha Pratim Roy. Improving document binarization via adversarial noise-texture augmentation. In 2019 IEEE International Conference on Image Processing (ICIP), pages 2721–2725. IEEE, 2019.
Capobianco and Marinai [2017] Samuele Capobianco and Simone Marinai. Docemul: a toolkit to generate structured historical documents. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 1186–1191. IEEE, 2017.
Zhu et al. [2017a] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. Advances in neural information processing systems, 30, 2017a.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
Shi et al. [2016] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
Gatos et al. [2009] Basilis Gatos, Konstantinos Ntirogiannis, and Ioannis Pratikakis. Icdar 2009 document image binarization contest (dibco 2009). In 2009 10th International conference on document analysis and recognition, pages 1375–1382. IEEE, 2009.
Pratikakis et al. [2010] Ioannis Pratikakis, Basilis Gatos, and Konstantinos Ntirogiannis. H-dibco 2010-handwritten document image binarization competition. In 2010 12th International Conference on Frontiers in Handwriting Recognition, pages 727–732. IEEE, 2010.
Pratikakis et al. [2011] Ioannis Pratikakis, Basilis Gatos, and Konstantinos Ntirogiannis. Icdar 2011 document image binarization contest (dibco 2011). In 2011 International Conference on Document Analysis and Recognition, pages 1506–1510, 2011. doi: 10.1109/ICDAR.2011.299.
Pratikakis et al. [2013] Ioannis Pratikakis, Basilis Gatos, and Konstantinos Ntirogiannis. Icdar 2013 document image binarization contest (dibco 2013). In 2013 12th International Conference on Document Analysis and Recognition, pages 1471–1476. IEEE, 2013.
Pratikakis et al. [2017] Ioannis Pratikakis, Konstantinos Zagoris, George Barlas, and Basilis Gatos. Icdar2017 competition on document image binarization (dibco 2017). In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1395–1403, 2017. doi: 10.1109/ICDAR.2017.228.
Ntirogiannis et al. [2014] Konstantinos Ntirogiannis, Basilis Gatos, and Ioannis Pratikakis. Icfhr2014 competition on handwritten document image binarization (h-dibco 2014). In 2014 14th International conference on frontiers in handwriting recognition, pages 809–813. IEEE, 2014.
Pratikakis et al. [2016] Ioannis Pratikakis, Konstantinos Zagoris, George Barlas, and Basilis Gatos. Icfhr2016 handwritten document image binarization contest (h-dibco 2016). In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 619–623. IEEE, 2016.
Pratikakis et al. [2018] Ioannis Pratikakis, Konstantinos Zagoris, Panagiotis Kaddas, and Basilis Gatos. Icfhr 2018 competition on handwritten document image binarization (h-dibco 2018). pages 489–493, 08 2018. doi: 10.1109/ICFHR-2018.2018.00091.
Pratikakis et al. [2019] Ioannis Pratikakis, Konstantinos Zagoris, Xenofon Karagiannis, Lazaros Tsochatzidis, Tanmoy Mondal, and Isabelle Marthot-Santaniello. Icdar 2019 competition on document image binarization (dibco 2019). pages 1547–1556, 09 2019. doi: 10.1109/ICDAR.2019.00249.
Mesquita et al. [2015] Rafael G Mesquita, Ricardo MA Silva, Carlos AB Mello, and Péricles BC Miranda. Parameter tuning for document image binarization using a racing algorithm. Expert Systems with Applications, 42(5):2593–2603, 2015.
Kligler et al. [2018] Netanel Kligler, Sagi Katz, and Ayellet Tal. Document enhancement using visibility detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2374–2382, 2018.
Xiong et al. [2018] Wei Xiong, Xiuhong Jia, Jingjing Xu, Zijie Xiong, Min Liu, and Juan Wang. Historical document image binarization using background estimation and energy minimization. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 3716–3721. IEEE, 2018.
Bera et al. [2021] Suman Kumar Bera, Soulib Ghosh, Showmik Bhowmik, Ram Sarkar, and Mita Nasipuri. A non-parametric binarization method based on ensemble of clustering algorithms. Multimedia Tools and Applications, 80(5):7653–7673, 2021.
Moghaddam and Cheriet [2012] Reza Farrahi Moghaddam and Mohamed Cheriet. Adotsu: An adaptive and parameterless generalization of otsu’s method for document image binarization. Pattern Recognition, 45(6):2419–2431, 2012.
Gatos et al. [2006] Basilios Gatos, Ioannis Pratikakis, and Stavros J Perantonis. Adaptive degraded document image binarization. Pattern recognition, 39(3):317–327, 2006.
Su et al. [2012] Bolan Su, Shijian Lu, and Chew Lim Tan. Robust document image binarization technique for degraded document images. IEEE transactions on image processing, 22(4):1408–1417, 2012.
Pai et al. [2010] Yu-Ting Pai, Yi-Fan Chang, and Shanq-Jang Ruan. Adaptive thresholding algorithm: Efficient computation technique based on intelligent block detection for degraded document images. Pattern Recognition, 43(9):3177–3187, 2010.
Jia et al. [2018] Fuxi Jia, Cunzhao Shi, Kun He, Chunheng Wang, and Baihua Xiao. Degraded document image binarization using structural symmetry of strokes. Pattern Recognition, 74:225–240, 2018.
Xiong et al. [2021] Wei Xiong, Lei Zhou, Ling Yue, Lirong Li, and Song Wang. An enhanced binarization framework for degraded historical document images. EURASIP Journal on Image and Video Processing, 2021(1):1–24, 2021.
Souibgui et al. [2022] Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Fornés, Josep Lladós, and Umapada Pal. Docentr: An end-to-end document image enhancement transformer. arXiv preprint arXiv:2201.10252, 2022.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
Zhao et al. [2019] Jinyuan Zhao, Cunzhao Shi, Fuxi Jia, Yanna Wang, and Baihua Xiao. Document image binarization with cascaded generators of conditional generative adversarial networks. Pattern Recognition, 96:106968, 2019.
Kumar et al. [2021] Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, Partha Pratim Roy, and Umapada Pal. Udbnet: Unsupervised document binarization network via adversarial game. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 7817–7824. IEEE, 2021.
Larsen et al. [2016] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
Dumoulin et al. [2016] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Li and Wand [2016] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision, pages 702–716. Springer, 2016.
Souibgui and Kessentini [2020] Mohamed Ali Souibgui and Yousri Kessentini. De-gan: a conditional generative adversarial network for document enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
Zhu et al. [2017b] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017b.
Nafchi et al. [2013] Hossein Ziaei Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet. An efficient ground truthing tool for binarization of historical manuscripts. In 2013 12th International Conference on Document Analysis and Recognition, pages 807–811. IEEE, 2013.
Das [2019] Sayan Das. A statistical tool based binarization method for document images. Multimedia Tools and Applications, 78(19):27449–27462, 2019.