This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Variational Augmentation for Enhancing Historical Document Image Binarization

Avirup Dey Jadavpur UniversityKolkataWest BengalIndia700032 Nibaran Das Jadavpur UniversityKolkataWest BengalIndia700032  and  Mita Nasipuri Jadavpur UniversityKolkataWest BengalIndia700032
(2022)
Abstract.

Historical Document Image Binarization is a well-known seg- mentation problem in image processing. Despite ubiquity, traditional thresholding algorithms achieved limited success on severely degraded document images. With the advent of deep learning, several segmentation models were proposed that made significant progress in the field but were limited by the unavailability of large training datasets. To mitigate this problem, we have proposed a novel two-stage framework- the first of which comprises a generator that generates degraded samples using variational inference and the second being a CNN-based binarization network that trains on the generated data. We evaluated our framework on a range of DIBCO datasets, where it achieved competitive results against previous state-of-the-art methods.

Binarization, DIBCO, GANs
copyright: rightsretaineddoi: 10.475/123_4isbn: 123-4567-24-567/08/06conference: 13th Indian Conference on Computer Vision, Graphics and Image Processing; December 2022; Gandhinagar, Indiajournalyear: 2022price: 15.00ccs: Computing methodologies Image segmentation

1. Introduction

One of the most important preprocessing steps for the analysis of document images is their binarization. The classification of image pixels into text and non-text regions greatly aids systems that reside on the other end of the pipeline, generally optical character recognition algorithms. While binarization is often performed with the use of a simple threshold that dictates how the image pixel would be mapped in the output image, advanced binarization algorithms are needed when the task involves the restoration of document images in conjunction with binarization to obtain an ideal text/non-text demarcation.

Over the years, many restoration algorithms have been proposed focussing on the removal of degradations found in document images. Traditional Image processing-based algorithms like Otsu(Otsu, 1979), Niblack (Niblack, 1985) and Sauvola(Sauvola and Pietikäinen, 2000) deal with minor degradations successfully and are particularly popular in this domain. These methods, however, do not perform successful binarization when the degradations are severe, typically those observed in historical document images. Binarization of severely degraded document images has been addressed recently with the help of deep learning algorithms, especially CNNs (LeCun et al., 2015), that have shown significant promise in this domain. Tenseymer and Martinez proposed a fully connected network to binarize document images in (Tensmeyer and Martinez, 2017), treating binarization as a special case of semantic segmentation. Calvo-Zaragoza and A.J. Gallego proposed a deep encoder-decoder architecture, outperforming previous binarization algorithms in (Calvo-Zaragoza and Gallego, 2019). Vo et al.(Vo et al., 2018) proposed a hierarchical deep learning framework that achieved state-of-the-art results on multiple datasets. Later He and Schomaker(He and Schomaker, 2019) proposed an iterative CNN framework and achieved similar results. Suh et al. (Suh et al., 2020) proposed a two-stage adversarial framework for binarizing RGB images. The first stage trains an adversarial network for each channel of the RGB image to extract foreground information from small local image patches by removing background information for document image enhancement. The second stage learns the local and global features to produce the final binarized output. This algorithm outperformed previous methods and achieved state-of-the-art performance on multiple datasets.

Although the previous methods can perform successful binarizations of degraded documents, they are constrained to grayscale inputs and require voluminous training data. To deal with the lack of adequate training data, Bhunia et al.(Bhunia et al., 2019) introduced a texture augmentation network (TANet) in conjunction with a binarization network that generates degraded document images with the help of style transfer. This method achieves competitive performance with previous methods but is constrained by the one-to-one mapping of style transfer-based approaches. Capobianco and Marinai (Capobianco and Marinai, 2017) proposed a toolkit to generate synthetic document images by combining text and degradations from a given palette. One could iteratively generate multiple images from a given text but this requires a lot of manual cherry-picking of degradation styles.

In this paper we propose a novel two-stage framework addressing both the data scarcity in this domain and the binarization of severely degraded document images containing multiple artefacts in the form of stains and bleed-through. In the first stage, we employ an augmentation network, Aug-Net, inspired by the BicycleGAN (Zhu et al., 2017a), that generates novel degraded samples from a single image-ground truth pair using variational inference. Combining the strengths from GANs (Goodfellow et al., 2014) and VAEs (Kingma and Welling, 2013), this network encodes the content and style of the input image into a probability distribution, allowing the model to generate multiple outputs from a single input by sampling from the distribution during inference. Additionally, it resolves the blurring effect of VAE by encouraging bijective similarity, making the generated samples more realistic.

In the following stage, we employ a paired image to image translation network, Bi-Net, inspired by Pix2Pix (Isola et al., 2017), to binarize the images. While previous methods performed binarization from grayscale images, we work with RGB images, attempting to capture the nature of the degradation from multiple channels. Our architecture further employs PixelShuffle (Shi et al., 2016) upsampling, replacing the traditional transpose convolutional layer. PixelShuffle uses Efficient Sub-Pixel Convolution (ESPC) that is computationally less expensive and has shown better results in upscaling images and videos.

In our experiments, we trained our network on DIBCO 2009 (Gatos et al., 2009), DIBCO 2010 (Pratikakis et al., 2010), DIBCO 2011 (Pratikakis et al., 2011), DIBCO 2013 (Pratikakis et al., 2013), and DIBCO 2017 (Pratikakis et al., 2017) datasets and tested it on DIBCO 2014 (Ntirogiannis et al., 2014), DIBCO 2016 (Pratikakis et al., 2016), DIBCO 2018(Pratikakis et al., 2018) and DIBCO 2019 (Pratikakis et al., 2019) using metrics like F-measure, pseudo F-measure, PSNR and DRD for evaluation of performance. Our method achieves competitive performance on the said datasets against several state-of-the-art models, including Mesquita et. al (Mesquita et al., 2015) (winner of DIBCO 2014), Kligger and Tal (Kligler et al., 2018) (winner of DIBCO 2016), Wei et. al. (Xiong et al., 2018) (winner of DIBCO 2018) and Sarkar et. al. (Bera et al., 2021) (winner of DIBCO 2019).

Our contributions can be summarised as follows:

  • We have used variational inference to generate novel images for training that can scale up the training data by many folds, thus eliminating the problem of limited training samples.

  • We have used PixelShuffle upsampling instead of transpose convolutions and short skip connections in the generator of our binarization model in addition to the long symmetric skip connections. These modifications not only yielded better results but also helped in faster training.

  • We have evaluated our model on multiple metrics against several state-of-the-art binarization algorithms, including the winners of previous DIBCO contests showing that our proposed method achieves significant improvements over existing algorithms.

The paper is organised as follows: Section 2 summarizes the relevant literature, Section 3 covers our proposed methodology comprising the network architectures and training details, Section 4 and Section 5 cover our experimental results and ablation studies, respectively, and Section 6 concludes the paper.

2. Related Works

The overall goal of document binarization is to convert an input image into a two-tone version, enabling easy demarcation of text and non-text regions to enhance the information that can be extracted from the same. Towards this goal, a plethora of methods have been proposed, broadly classified into two groups, classical image processing-based methods and deep learning-based algorithms.

2.1. Image Processing Based Methods

Traditional image processing based methods like those proposed in (Otsu, 1979), (Niblack, 1985), and (Sauvola and Pietikäinen, 2000) formed the baseline of research in the domain of document image binarization. One of the most popular algorithms was proposed by Otsu et al. in (Otsu, 1979), which computes a threshold that minimizes the intra-class variance and maximizes the inter-class variance of two pre-assumed classes. It selects the global threshold based on a histogram, and owing to its simplicity this algorithm is very fast. However, it is sensitive to deep stains, non-uniform backgrounds and bleed-through degradations.

In order to solve this problem, local adaptive threshold methods were proposed, such as Sauvola(Sauvola and Pietikäinen, 2000), Niblack(Niblack, 1985) and AdOtsu(Moghaddam and Cheriet, 2012). These methods compute the local threshold for each pixel based on the mean and standard deviation of a local area around the pixel. Following these methods, several binarization methods have been proposed. Gatos et al. (Gatos et al., 2006) used a Wiener filter to estimate the background and foreground regions in a method that employs several post-processing steps to remove noise in the background and improve the quality of foreground regions. Su et al. (Su et al., 2012) introduced a map combining the local image contrast and the local image gradient; the local threshold is estimated based on the values on the detected edges in a local region. Pai et al. (Pai et al., 2010) proposed an adaptive window-size selection method based on the foreground characteristics. Jia et al. (Jia et al., 2018) employed the structural symmetry of strokes to compute the local threshold. Xiong et. al. (Xiong et al., 2021) proposed an entropy-based formulation to segregate text from the image background and employed energy-based segmentation to binarize images. They achieved competitive results on DIBCO benchmarks.

However, these adaptive binarization methods require many empirical parameters that need to be manually fine-tuned and are still not satisfactory for use with highly degraded and poor-quality document images.

2.2. Deep Learning Based Methods

In recent years, convolutional neural networks (CNNs) (LeCun et al., 2015) have achieved several milestones in a variety of tasks in computer vision. Tenseymer and Martinez (Tensmeyer and Martinez, 2017) used a fully convolutional network for document image binarization at multiple image scales. A deep auto-encoder–decoder architecture model was proposed by Calvo-Zaragoza and A.J. Gallego. Vo et al. (Calvo-Zaragoza and Gallego, 2019) proposed a hierarchical deep supervised network to predict the full foreground map through the results of multi-scale networks; this method achieved state-of-the-art performance on several benchmarks. He and Schomaker (He and Schomaker, 2019) introduced an iterative CNN-based framework and achieved performance similar to that of (Vo et al., 2018)’s method. Soibgui et al. (Souibgui et al., 2022) developed a transformer (Vaswani et al., 2017) based encoder-decoder framework which achieved results comparable to the previous method.

Recently, generative adversarial networks (GANs) (Goodfellow et al., 2014) have emerged as a class of CNN models approximating real-world images, achieving significant performance image synthesis. Over the years, several variants have been proposed for image translation tasks as well. In GANs, a generator network competes against a discriminator network that distinguishes between generated and real images. Unlike the original GAN, cGAN (Mirza and Osindero, 2014) trains the generator not only to fool the discriminator but also to condition it on additional inputs, such as class labels, partial data, or input images. Based on this principle, Isola et al. (Isola et al., 2017) proposed Pix2Pix GAN for the general purpose of image-to-image translation. Bhunia et al. (Bhunia et al., 2019) proposed a texture augmentation network to augment the training datasets and handle image binarization using a cGAN structure. Zhao et al. (Zhao et al., 2019) proposed a cascaded network based on Pix2Pix GAN to combine global and local information. Suh et al. (Suh et al., 2020) proposed a two-stage GAN framework, employing adversarial networks to remove noise from each channel separately in the first stage and fine-tune the output using a similar network in the second stage.

3. Proposed Methodology

3.1. Overview

The application of deep learning algorithms in the field of historical document image binarization is limited due to the dearth of training data. In the past, researchers had to either curate and preprocess images manually from multiple sources as shown in (Capobianco and Marinai, 2017) or use style transfer-based augmentation as shown by Bhunia et. al. (Bhunia et al., 2019) and Kumar et. al. (Kumar et al., 2021). Despite their merits, both methods suffer from the lack of diversity of degradations real historical documents might have.

The strength of our method lies in the generation of novel synthetic data in situ while training the binarization model as shown in Fig.1. In the first stage, an augmentation network, inspired by Bicycle-GAN (Zhu et al., 2017a), generates multiple degraded samples from an input-ground truth pair via variational inference. These synthetic samples are, in turn, used to train our binarization network in the next stage. We have employed a paired image translation procedure, following the work of Isola et. al. (Isola et al., 2017).

Refer to caption
Figure 1. Training the Binarization Network: Stage-I comprises the variational generator of the augmentation network (Aug-Net) that synthesizes novel training samples from a given input. Stage-II is comprises the binarization network (Bi-Net) that is trained to transform the degraded images into binary masks.

3.2. Network Architectures

3.2.1. (A) Augmentation Network (Aug-Net)

The network comprises two components - cVAE-GAN (Larsen et al., 2016) and cLR-GAN (Dumoulin et al., 2016) which complement each other in the training process. cVAE-GAN learns the encoding from real data, but a random latent code may not yield realistic images at test time, and the KL loss may not be well optimized. More importantly, the discriminator does not have a chance to see results sampled from the prior during training. On the other hand, in cLR-GAN, the latent space is easily sampled from a simple distribution, but the generator is trained without the benefit of observing the ground truth input-output pairs. Thus, combining these helps us produce results that are diverse as well as realistic. Fig.2 outlines the training procedure of our augmentation network.

Refer to caption
Figure 2. Training the Augmentation Network: cVAE-GAN starts from a ground truth target image B and encodes it into the latent space. The generator then attempts to map the input image A along with a sampled z back into the original image B. cLR-GAN randomly samples a latent code from a known distribution, uses it to map A into the output B, and then tries to reconstruct the latent code from the output.

We use the trained generator of the cVAE-GAN to generate new degraded images while training our binarization network.

3.2.2. (B) Binarization Network (Bi-Net)

Our binarization network is inspired by the well-known Pix2Pix GAN (Isola et al., 2017) which was first proposed by Isola et. al. with some changes in the generator architecture. Firstly, we have built the U-Net generator on top of a ResNet (He et al., 2016) backbone that helps in stabilizing the training. Secondly, we have swapped the dilated convolutions in the upsampling blocks with PixelShuffle (Shi et al., 2016) that produce sharper outputs.
The PixelShuffle layer implements Efficient Sub-Pixel Convolution. The operation is described by the following equation:

(1) IHR=fL(ILR)=PS(WLfL1(ILR)+bL)I^{HR}\;=\;f^{L}(I^{LR})\;=\;PS(W_{L}\;*\;f^{L-1}(I^{LR})\;+\;b_{L})

where IHRI^{HR} denotes the high resolution image, ILRI^{LR} denotes the low resolution image, fLf^{L} is the convolution kernel of the L-th layer and PS is an periodic shuffling operator that rearranges the elements of a H×W×Cr2H\times W\times C\cdot r^{2} tensor to a tensor of shape rH×rW×CrH\times rW\times C. This has been illustrated in Figure 3.

We employed a PatchGAN (Li and Wand, 2016) discriminator with an output scale of 70×7070\times 70.

Refer to caption
Figure 3. Efficient Sub-Pixel Convolution: A low resolution image is passed though convolution layers to generate r2r^{2} feature maps which are aggregated to give a higher resolution output.

3.3. Objective Functions

3.3.1. (A) For Augmentation Network

We train our augmentation network on a combination of losses that includes KL Divergence, adversarial loss and a regularizer L1L_{1} loss as proposed by (Zhu et al., 2017a). The objective is given by:

G,E=argminG,EmaxDGANVAE(G,D,E)+λ1VAE(G,E)+GAN(G,D)+λlatent1latent(G,E)+λKLKL(E)G^{*},E^{*}\;=\;arg\;min_{G,E}\;max_{D}\>\mathcal{L}_{GAN}^{VAE}(G,D,E)\;+\;\lambda\mathcal{L}^{VAE}_{1}(G,E)\\ \;+\;\mathcal{L}_{GAN}(G,D)\;+\;\lambda_{latent}\mathcal{L}_{1}^{latent}(G,E)\;+\;\lambda_{KL}\mathcal{L}_{KL}(E)

λ=10\lambda=10, λlatent=0.5\lambda_{latent}=0.5 and λKL=0.01\lambda_{KL}=0.01 were the optimum hyper-parameters. The length of the latent vector was taken as |z|=8|z|=8.

3.3.2. (B) For Binarization Network

We train our binarization network on a combination of the conditional adversarial loss as proposed by Isola et. al. (Isola et al., 2017). and the L1L_{1} loss. The conditional adversarial loss is given by:

(2) cGAN=𝔼x,y[logD(x,y)]𝔼x,z[log(1D(x,G(x,z)))]\mathcal{L}_{cGAN}\;=\;\mathbb{E}_{x,y}[logD(x,y)]\;-\;\mathbb{E}_{x,z}[log(1-D(x,G(x,z)))]

Conditioning the discriminator term with the input, xx forces the generator to produce images that are perceptually similar in structure. It also reduces blurring artefacts. Adding an L1L_{1} loss encourages the discriminator to force pixel-level accuracy in the generated images.
The overall objective can be written as:

(3) G=argminGmaxDcGAN(G,D)+λL1(G)G^{*}\;=\;arg\;min_{G}\;max_{D}\>\mathcal{L}_{cGAN}(G,D)\;+\;\lambda\mathcal{L}_{L1}(G)

λ\lambda is a hyper-parameter that can be tuned to weight the loss terms. We have trained our model with λ= 100\lambda\;=\;100.

3.4. Training

Both the networks were trained with patches extracted from DIBCO 2009 (Gatos et al., 2009), 2010 (Pratikakis et al., 2010), 2011 (Pratikakis et al., 2011), 2013 (Pratikakis et al., 2013) and 2017 (Pratikakis et al., 2017) datasets.
The augmentation network was trained on 256×256256\times 256 patches extracted from the said datasets for 6 epochs using the Adam optimizer.
For training the binarization network, patches of size 512×512512\times 512 were extracted from each image to obtain 13320 samples. Each sample was further reduced to 256×256256\times 256 using augmentations like random crop and resize. The generator was pre-trained with just the L1L_{1} loss for 5 epochs. The model was then trained adversarially for 20 epochs using the Adam optimizer (β1=0.5\beta_{1}=0.5 and β2=0.999\beta_{2}=0.999) with a learning rate of 2e42e^{-4}.

4. Experimental Results

Refer to captionRefer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to captionRefer to caption

Figure 4. From left to right, top to bottom: Input, Ground Truth, Otsu, Sauvola, Niblack, Suh, Bhunia, Ours

4.1. Datasets

All comparisons were carried out using DIBCO datasets. For evaluation purposes, we have used DIBCO 2014 (Ntirogiannis et al., 2014), DIBCO 2016 (Pratikakis et al., 2016), DIBCO 2018 (Pratikakis et al., 2018) and DIBCO 2019 (Pratikakis et al., 2019) datasets.

Data Preparation: Each DIBCO dataset has 10 degraded historical document images for evaluation. We extracted 5 random patches of dimension 256×256256\times 256 from each image, giving us a total of 50 samples from each dataset for evaluation.

4.2. Evaluation Metrics

  • F Measure:

    (4) FM=2×Precision×RecallPrecision+Recall,FM\;=\;\frac{2\times Precision\times Recall}{Precision+Recall},

    where Precision=TPTP+FNPrecision=\frac{TP}{TP+FN} and Recall=TPTP+FPRecall=\frac{TP}{TP+FP}. TP, FP and FN denote true positive, false positive and false negative respectively.

  • Pseudo-F-Measure:

    (5) pFM=2×Precision×pRecallPrecision+pRecall,pFM\;=\;\frac{2\times Precision\times pRecall}{Precision+pRecall},

    where pRecallpRecall denotes the fraction of the skeletonized ground truth image. (Pratikakis et al., 2010).

  • PSNR:

    (6) PSNR= 10log(C2MSE),PSNR\;=\;10\;log(\frac{C^{2}}{MSE}),

    where MSEMSE is the mean squared error between the two images. CC is the maximum pixel value. The images were normalized so here, C=1.C=1.
    PSNR is a widely used metric to measure the similarity between two images.

  • DRD:

    (7) DRD=ΣkDRDkNUBNDRD\;=\;\frac{\Sigma_{k}\;DRD_{k}}{NUBN}

    where DRDkDRD_{k} is the distortion of the k-th flipped pixel, and NUBNNUBN is the number of non-uniform 8×88\times 8 blocks in the ground truth image. (Ntirogiannis et al., 2014).

4.3. Results and Discussion

From Table 1 we can see that our proposed model achieves competitive scores against previous state-of-the-art methods, and in some cases even surpasses them.
We also observe that, unlike the winning models of the DIBCO contests ((Kligler et al., 2018), (Xiong et al., 2018), (Bera et al., 2021)), our model not only maximises pixel-level accuracy (measured by F-Measure and PSNR) but also preserves perceptual quality of the binarized image (measured by DRD). This is a clear consequence of using a conditional GAN loss.
We compared our model to a range of traditional thresholding algorithms like Otsu (Otsu, 1979), Niblack (Niblack, 1985) and Sauvola (Sauvola and Pietikäinen, 2000) as well as recent state-of-the-art like Vo (Vo et al., 2018), Xiong (Xiong et al., 2021) and Suh (Suh et al., 2020). The winning models of DIBCO 14 (Ntirogiannis et al., 2014), DIBCO 16 (Pratikakis et al., 2016), DIBCO 18 (Pratikakis et al., 2018) and DIBCO 19 (Pratikakis et al., 2019) have also been included in our experiments.
Bhunia’s method (Bhunia et al., 2019), Suh’s method (Suh et al., 2020), Soibgui’s method (Souibgui and Kessentini, 2020) and Cycle-GAN (Zhu et al., 2017b) were trained on the datatsets we used for training our own model (DIBCO 2009 (Gatos et al., 2009), DIBCO 2010 (Pratikakis et al., 2010), DIBCO 2011 (Pratikakis et al., 2011), DIBCO 2013 (Pratikakis et al., 2013) and DIBCO 2017 (Pratikakis et al., 2017)) with the hyper-parameters set by their respective authors.

Bhunia et. al.’s model (Bhunia et al., 2019), Suh et. al.’s model (Suh et al., 2020), Vo et.al.’s model (Vo et al., 2018) and He’s (He and Schomaker, 2019) model were originally trained on PHIBD (Nafchi et al., 2013) dataset that has over 100 images in addition to multiple DIBCO datasets. On the other hand, our training dataset consisted of only 74 DIBCO images. In effect, we trained our model on almost half the number of training images used by previous deep learning-based methods Hence, it demonstrates the strength of variational augmentation while training the binarization model.

Table 1. Comparative results on DIBCO datasets.
Method Evaluation Metrics
F-Measure \uparrow pF-Measure \uparrow PSNR \uparrow DRD \downarrow
DIBCO 2014
Rank 1 (Mesquita et al., 2015) 96.880 97.650 22.660 0.902
Otsu (Otsu, 1979) 91.780 95.740 18.720 2.647
Niblack (Niblack, 1985) 86.010 88.040 16.540 8.260
Sauvola (Sauvola and Pietikäinen, 2000) 86.830 91.800 17.630 4.896
Vo (Vo et al., 2018) 95.970 97.420 21.490 1.090
He (He and Schomaker, 2019) 95.950 98.760 21.600 1.120
Xiong (Xiong et al., 2021) 96.770 97.730 22.470 0.950
Bhunia (Bhunia et al., 2019) 73.753 74.000 12.679 7.850
Suh (Suh et al., 2020) 96.360 98.870 21.960 1.070
Soibgui (Souibgui and Kessentini, 2020) 96.020 95.300 19.870 4.100
Cycle-GAN (Zhu et al., 2017b) 84.530 86.792 15.278 5.993
Ours 96.860 97.450 22.250 1.530
DIBCO 2016
Rank 1 (Kligler et al., 2018) 87.610 91.280 18.110 5.210
Otsu (Otsu, 1979) 85.660 88.860 16.260 5.580
Niblack (Niblack, 1985) 72.570 73.510 13.260 24.650
Sauvola (Sauvola and Pietikäinen, 2000) 84.270 89.100 17.150 6.090
Vo (Vo et al., 2018) 90.010 93.440 18.740 3.910
He (He and Schomaker, 2019) 91.190 95.740 19.510 3.020
Xiong (Xiong et al., 2021) 89.640 93.560 18.690 4.030
Das (Das, 2019) 88.930 91.750 18.080 4.120
Bhunia (Bhunia et al., 2019) 65.525 65.145 12.595 8.270
Suh (Suh et al., 2020) 92.240 95.950 19.930 2.770
Soibgui (Souibgui and Kessentini, 2020) 88.760 87.230 19.450 7.380
Cycle-GAN (Zhu et al., 2017b) 81.564 87.434 14.681 7.143
Ours 94.333 96.081 20.086 1.316
DIBCO 2018
Rank 1 (Xiong et al., 2018) 88.340 90.240 19.110 4.920
Otsu (Otsu, 1979) 51.450 53.050 9.740 59.070
Niblack (Niblack, 1985) 41.180 41.390 6.790 99.460
Sauvola (Sauvola and Pietikäinen, 2000) 67.810 74.080 13.780 17.690
Xiong (Xiong et al., 2021) 88.340 90.370 19.110 4.930
Bhunia (Bhunia et al., 2019) 59.254 59.178 11.797 9.555
Suh (Suh et al., 2020) 84.950 91.577 17.040 16.861
Soibgui (Souibgui and Kessentini, 2020) 73.700 75.610 16.990 10.210
Cycle-GAN (Zhu et al., 2017b) 72.972 77.391 13.462 129.277
Ours 89.751 93.141 17.439 3.824
DIBCO 2019
Rank 1 (Bera et al., 2021) 72.875 72.150 14.475 16.235
Otsu (Otsu, 1979) 52.800 52.550 12.640 24.210
Niblack (Niblack, 1985) 51.510 53.860 10.540 31.050
Sauvola (Sauvola and Pietikäinen, 2000) 42.520 39.760 7.710 120.120
Bhunia (Bhunia et al., 2019) 53.340 55.995 11.779 9.256
Suh (Suh et al., 2020) 62.893 62.726 15.584 3.362
Soibgui (Souibgui and Kessentini, 2020) 70.330 71.470 12.220 8.910
Cycle-GAN (Zhu et al., 2017b) 74.916 75.189 14.307 6.814
Ours 75.130 75.101 14.802 5.248

5. Ablation Studies

Although our experiments show that the proposed method achieves competitive performance against existing state-of-the-art methods, we would like experimentally prove the benefits brought by the pivotal points of our method - generation of synthetic data while training and the modified Pix2Pix architecture suggested in Section 3.2 (B). Here, we perform our ablation study to verify the advantage of individual modules in the proposed model.

5.1. W/O Variational Augmentation

We train our binarization network, Bi-Net without Stage 1 and compare it with the proposed method in Table 2. We observe that there is a significant drop in performance of the model when we do not incorporate variational augmentation. Given the limited amount of training data, the model fails to learn the various degradations that might be present in real-world historical document images hence it justifies our case for simulating the degradations using a neural network.

Table 2. Ablation experiments w/o variational augmentation.
Method Evaluation Metrics
F-Measure \uparrow pF-Measure \uparrow PSNR \uparrow DRD \downarrow
DIBCO 2014
w/o Aug-Net 91.190 92.076 19.758 3.192
Proposed 96.860 97.450 22.250 1.530
DIBCO 2016
w/o Aug-Net 87.268 90.844 17.348 4.142
Proposed 94.333 96.081 20.086 1.316
DIBCO 2018
w/o Aug-Net 73.968 77.170 14.988 118.359
Proposed 89.751 93.141 17.439 3.824
DIBCO 2019
w/o Aug-Net 64.809 64.802 14.458 6.594
Proposed 75.130 75.101 14.802 5.248

5.2. W/O Residual Blocks + PixelShuffle

We also compare Bi-Net with its parent network Pix2Pix and compare their performances in Table 3. We see that our proposed architecture outperforms Pix2Pix on DIBCO datasets by a significant margin. Bi-Net’s PixelShuffle upsampling preserves line low details better than conventional strided convolutions which is evident by the boost in F-Measure.

Table 3. Ablation experiments w/o changes in architecture of Bi-Net
Method Evaluation Metrics
F-Measure \uparrow pF-Measure \uparrow PSNR \uparrow DRD \downarrow
DIBCO 2014
Pix2Pix 73.928 74.606 18.548 9.059
Proposed 91.190 92.076 19.758 3.192
DIBCO 2016
Pix2Pix 72.636 73.689 13.641 8.800
Proposed 87.268 90.844 17.348 4.142
DIBCO 2018
Pix2Pix 68.135 68.516 9.324 55.791
Proposed 73.968 77.170 14.988 118.359
DIBCO 2019
Pix2Pix 68.626 68.735 9.478 17.714
Proposed 64.809 64.802 14.458 6.594

6. Conclusion

In this paper, we have proposed a novel document binarization algorithm that successfully deals binarization of historical document images. Our algorithm couples the strengths of variational inference and paired image-to-image translation and hence is able to perform well in scenarios where training data is scarce. We demonstrate the efficacy of our methods by performing a quantitative analysis of the binarization performance using metrics like F-Measure, pseudo-F-Measure, PSNR and DRD. The experimental results show that the proposed method outperforms traditional and state-of-the-art methods on multiple metrics. Furthermore, we have conducted ablation experiments to prove the merits of our methodology. We observe, however, that our method fails to binarize images where the foreground text has faded away. Furthermore, since we do not analyze the textual information within the image we are unable to restore images where a part of the text is missing or areas where the ink has bloated, rendering the text unreadable. To fix these caveats, future work in this domain focusing on the integration of language processing algorithms within the binarization framework is necessary.

References

  • Otsu [1979] Nobuyuki Otsu. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 9(1):62–66, 1979.
  • Niblack [1985] Wayne Niblack. An introduction to digital image processing. Strandberg Publishing Company, 1985.
  • Sauvola and Pietikäinen [2000] Jaakko Sauvola and Matti Pietikäinen. Adaptive document image binarization. Pattern recognition, 33(2):225–236, 2000.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • Tensmeyer and Martinez [2017] Chris Tensmeyer and Tony Martinez. Document image binarization with fully convolutional neural networks. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 99–104. IEEE, 2017.
  • Calvo-Zaragoza and Gallego [2019] Jorge Calvo-Zaragoza and Antonio-Javier Gallego. A selectional auto-encoder approach for document image binarization. Pattern Recognition, 86:37–47, 2019.
  • Vo et al. [2018] Quang Nhat Vo, Soo Hyung Kim, Hyung Jeong Yang, and Gueesang Lee. Binarization of degraded document images based on hierarchical deep supervised network. Pattern Recognition, 74:568–586, 2018.
  • He and Schomaker [2019] Sheng He and Lambert Schomaker. Deepotsu: Document enhancement and binarization using iterative deep learning. Pattern recognition, 91:379–390, 2019.
  • Suh et al. [2020] Sungho Suh, Jihun Kim, Paul Lukowicz, and Yong Oh Lee. Two-stage generative adversarial networks for document image binarization with color noise and background removal. arXiv preprint arXiv:2010.10103, 2020.
  • Bhunia et al. [2019] Ankan Kumar Bhunia, Ayan Kumar Bhunia, Aneeshan Sain, and Partha Pratim Roy. Improving document binarization via adversarial noise-texture augmentation. In 2019 IEEE International Conference on Image Processing (ICIP), pages 2721–2725. IEEE, 2019.
  • Capobianco and Marinai [2017] Samuele Capobianco and Simone Marinai. Docemul: a toolkit to generate structured historical documents. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 1186–1191. IEEE, 2017.
  • Zhu et al. [2017a] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. Advances in neural information processing systems, 30, 2017a.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • Shi et al. [2016] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
  • Gatos et al. [2009] Basilis Gatos, Konstantinos Ntirogiannis, and Ioannis Pratikakis. Icdar 2009 document image binarization contest (dibco 2009). In 2009 10th International conference on document analysis and recognition, pages 1375–1382. IEEE, 2009.
  • Pratikakis et al. [2010] Ioannis Pratikakis, Basilis Gatos, and Konstantinos Ntirogiannis. H-dibco 2010-handwritten document image binarization competition. In 2010 12th International Conference on Frontiers in Handwriting Recognition, pages 727–732. IEEE, 2010.
  • Pratikakis et al. [2011] Ioannis Pratikakis, Basilis Gatos, and Konstantinos Ntirogiannis. Icdar 2011 document image binarization contest (dibco 2011). In 2011 International Conference on Document Analysis and Recognition, pages 1506–1510, 2011. doi: 10.1109/ICDAR.2011.299.
  • Pratikakis et al. [2013] Ioannis Pratikakis, Basilis Gatos, and Konstantinos Ntirogiannis. Icdar 2013 document image binarization contest (dibco 2013). In 2013 12th International Conference on Document Analysis and Recognition, pages 1471–1476. IEEE, 2013.
  • Pratikakis et al. [2017] Ioannis Pratikakis, Konstantinos Zagoris, George Barlas, and Basilis Gatos. Icdar2017 competition on document image binarization (dibco 2017). In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1395–1403, 2017. doi: 10.1109/ICDAR.2017.228.
  • Ntirogiannis et al. [2014] Konstantinos Ntirogiannis, Basilis Gatos, and Ioannis Pratikakis. Icfhr2014 competition on handwritten document image binarization (h-dibco 2014). In 2014 14th International conference on frontiers in handwriting recognition, pages 809–813. IEEE, 2014.
  • Pratikakis et al. [2016] Ioannis Pratikakis, Konstantinos Zagoris, George Barlas, and Basilis Gatos. Icfhr2016 handwritten document image binarization contest (h-dibco 2016). In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 619–623. IEEE, 2016.
  • Pratikakis et al. [2018] Ioannis Pratikakis, Konstantinos Zagoris, Panagiotis Kaddas, and Basilis Gatos. Icfhr 2018 competition on handwritten document image binarization (h-dibco 2018). pages 489–493, 08 2018. doi: 10.1109/ICFHR-2018.2018.00091.
  • Pratikakis et al. [2019] Ioannis Pratikakis, Konstantinos Zagoris, Xenofon Karagiannis, Lazaros Tsochatzidis, Tanmoy Mondal, and Isabelle Marthot-Santaniello. Icdar 2019 competition on document image binarization (dibco 2019). pages 1547–1556, 09 2019. doi: 10.1109/ICDAR.2019.00249.
  • Mesquita et al. [2015] Rafael G Mesquita, Ricardo MA Silva, Carlos AB Mello, and Péricles BC Miranda. Parameter tuning for document image binarization using a racing algorithm. Expert Systems with Applications, 42(5):2593–2603, 2015.
  • Kligler et al. [2018] Netanel Kligler, Sagi Katz, and Ayellet Tal. Document enhancement using visibility detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2374–2382, 2018.
  • Xiong et al. [2018] Wei Xiong, Xiuhong Jia, Jingjing Xu, Zijie Xiong, Min Liu, and Juan Wang. Historical document image binarization using background estimation and energy minimization. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 3716–3721. IEEE, 2018.
  • Bera et al. [2021] Suman Kumar Bera, Soulib Ghosh, Showmik Bhowmik, Ram Sarkar, and Mita Nasipuri. A non-parametric binarization method based on ensemble of clustering algorithms. Multimedia Tools and Applications, 80(5):7653–7673, 2021.
  • Moghaddam and Cheriet [2012] Reza Farrahi Moghaddam and Mohamed Cheriet. Adotsu: An adaptive and parameterless generalization of otsu’s method for document image binarization. Pattern Recognition, 45(6):2419–2431, 2012.
  • Gatos et al. [2006] Basilios Gatos, Ioannis Pratikakis, and Stavros J Perantonis. Adaptive degraded document image binarization. Pattern recognition, 39(3):317–327, 2006.
  • Su et al. [2012] Bolan Su, Shijian Lu, and Chew Lim Tan. Robust document image binarization technique for degraded document images. IEEE transactions on image processing, 22(4):1408–1417, 2012.
  • Pai et al. [2010] Yu-Ting Pai, Yi-Fan Chang, and Shanq-Jang Ruan. Adaptive thresholding algorithm: Efficient computation technique based on intelligent block detection for degraded document images. Pattern Recognition, 43(9):3177–3187, 2010.
  • Jia et al. [2018] Fuxi Jia, Cunzhao Shi, Kun He, Chunheng Wang, and Baihua Xiao. Degraded document image binarization using structural symmetry of strokes. Pattern Recognition, 74:225–240, 2018.
  • Xiong et al. [2021] Wei Xiong, Lei Zhou, Ling Yue, Lirong Li, and Song Wang. An enhanced binarization framework for degraded historical document images. EURASIP Journal on Image and Video Processing, 2021(1):1–24, 2021.
  • Souibgui et al. [2022] Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Fornés, Josep Lladós, and Umapada Pal. Docentr: An end-to-end document image enhancement transformer. arXiv preprint arXiv:2201.10252, 2022.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • Zhao et al. [2019] Jinyuan Zhao, Cunzhao Shi, Fuxi Jia, Yanna Wang, and Baihua Xiao. Document image binarization with cascaded generators of conditional generative adversarial networks. Pattern Recognition, 96:106968, 2019.
  • Kumar et al. [2021] Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, Partha Pratim Roy, and Umapada Pal. Udbnet: Unsupervised document binarization network via adversarial game. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 7817–7824. IEEE, 2021.
  • Larsen et al. [2016] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
  • Dumoulin et al. [2016] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Li and Wand [2016] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European conference on computer vision, pages 702–716. Springer, 2016.
  • Souibgui and Kessentini [2020] Mohamed Ali Souibgui and Yousri Kessentini. De-gan: a conditional generative adversarial network for document enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • Zhu et al. [2017b] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017b.
  • Nafchi et al. [2013] Hossein Ziaei Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet. An efficient ground truthing tool for binarization of historical manuscripts. In 2013 12th International Conference on Document Analysis and Recognition, pages 807–811. IEEE, 2013.
  • Das [2019] Sayan Das. A statistical tool based binarization method for document images. Multimedia Tools and Applications, 78(19):27449–27462, 2019.