Smart Inference for Multidigit Convolutional Neural Network based Barcode Decoding

Thao Do KAIST
Daejeon, South Korea
[email protected] Yalew Tolcha KAIST
Daejeon, South Korea
[email protected] Tae Joon Jun Asan Medical Center
Seoul, South Korea
[email protected] Daeyoung Kim KAIST
Daejeon, South Korea
[email protected]

Abstract

Barcodes are ubiquitous and have been used in most critical daily activities for decades. However, most traditional decoders require well-founded barcode under a relatively standard condition. While wilder conditioned barcodes such as underexposed, occluded, blurry, wrinkled and rotated are commonly captured in reality, those traditional decoders show weakness of recognizing. Several works attempted to solve those challenging barcodes, but many limitations still exist. This work aims to solve the decoding problem using deep convolutional neural network with the possibility of running on portable devices. Firstly, we proposed a special modification of inference based on the feature of having checksum and test-time augmentation, named Smart Inference (SI), in the prediction phase of a trained model. SI considerably boosts accuracy and reduces the false prediction for trained models. Secondly, we have created a large practical evaluation dataset of real captured 1D barcode under various challenging conditions to test our methods vigorously, publicly available for other researchers. The experiments’ results demonstrated the SI effectiveness with the highest accuracy of 95.85% which outperformed many existing decoders on the evaluation set. Finally, we successfully minimized the best model by knowledge distillation to a shallow model which is shown to have high accuracy (90.85%) with a good inference speed of 34.2 ms per image on a real edge device.

Index Terms:

barcode, convolutional neural network

I Introduction

Linear 1D barcodes appeared in the 1970s and are now become ubiquitous on almost all consumer products and for logistics due to its ease of identification. Some newer tagging technologies emerged over the last decades allowing more information (e.g. RFID, NFC) stored. However, none of them has fully replaced its role in the industry because of its legacy and its economy. The low cost of printing barcode and the durability of the tag under minor damages make it remain an industry standard (standardized by GS1) for the coming decades.

One essential property of the tagging technology is that it must be read quickly, robustly and accurately using readers. For the barcode case, the readers (or the scanners) are categorized into 3 types: laser-based, LED-based, and camera-based. In the first 2 types, the laser/LED ray needs to be close to the barcode, requires no stripe obscured on the ray line, and suffers the problem of emitter overheating. Camera-based readers have some advantages over laser/LED-based solutions. The first advantage is built on the fact that numerous smartphones with high-quality cameras integrated are already in use. With Internet connection, useful mobile applications were born by online retrieval of product information and giving out ingredients information, alerting allergies, calorie intake, comparing prices between sellers; or for retailers, they learn eye-catching products, have consumer feedback and so on (e.g. in [1]). Another advantage of the camera-based solution is the possibility of multiple and long-range recognition by the support of computer vision algorithms.

However, most current techniques (static image processing and pattern matching) being used in camera-based readers have flaws that limit their usability. The main problem with them is the need for well-framed flatbed-scanned-style input than normal captured. Wilder but common-captured conditions such as underexposed, occluded, blurry or curved, non-horizontal position (as in Figure 1) become unrecognizable. This requires the user correction which is unhandy and slows down the scanning process. There are 2 separate tasks in scanning barcode: detecting (i.e. locating) where the barcode region in the image and decoding detected region to barcode sequence. Recent works showed that the first task is nearly solved even in challenging conditions.

Refer to caption — Figure 1: Challenging conditions

On the other hand, decoding those challenging barcodes still needs to improve since existing works still have many limitations. Traditional methods presented in [2, 3, 4] apply traditional techniques like Hough transformations, scanline-based approach with thresholds for binarization based on certain assumptions of barcode characteristics while they are not always true. Many evaluated their tools on unpublished sets, some published their sets but small and not enough challenging conditions. With those limitations and the successes of convolutional neural network (CNN) in many applications, [5] was the first proposed work using CNN to decode these difficult codes. However, their work has some weak points making the performance much lower than CNN potential. Not only their CNN feature extractors are simple but also their input assumption is oversimplified. They only assumed the horizontal barcode as input; their test set is made from printed rectangle-shaped generated barcodes on plain papers while real-life barcode is printed in customized shapes (e.g. coca icon shape) on various materials with many kind of distortion and sometimes covered by film. They did not also consider the possibility of running the task on edge devices because of unoptimized models.

Therefore, in this study, we proposed a CNN-based method to solve decoding task with the following contributive points: (i) we proposed Smart Inference - 3 algorithms leveraging the feature of having checksum and test-time augmentation built on top of trained deep CNN models which considerably boost model accuracies and reduce false prediction; (ii) we made a challenging 2500-sample-cropped EAN13(UPC-A, ISBN13 are its subsets) barcode dataset from real captured images on various (included harsh) conditions on numerous products - this dataset is published for other researchers to evaluate their models and encourage more contribution on this task; (iii) lastly, we applied knowledge distillation technique with a target to have a lightweight model from the best model which is suitable on handheld devices, the experimental result consequently confirmed the possibility by a good inference speed on a real edge board.

II Related work

Regarding the barcode locating task, there are some methods presented with improving performance over the past decade. In 2011, Lin et al [6] presented the first multiple and rotation tolerated barcode recognition methods. This work focused more on detecting problem by using several image processing schemes such as Gaussian smoothing filtering to segment out barcode regions, enhanced the stripes, rotated the regions to horizontal angle and put into a decoder with voting. Although the method did well on lottery barcodes (printed on plain papers), it was still slow and didn’t get high accuracy on a dataset of merchandise products which was unclear about the challenging level. Katona et al [7] in 2012 proposed a method using morphological operations to also segment out 1D and 2D barcode under blurry, noise, shear and various rotated conditions with good performance. Soros et al [8] continued dealing with blur using structure matrix and saturation from HSV color system to detect blurry barcodes better but with the expense of lowing speed in 2013. Recently, Creusot et al. [9] proposed a faster method for blurry barcodes based on Line Segment Detector after their previous work [10] using Maximal Stable Extremal Region shown sensitive to blur. In another way, Hansen [11] first tried to apply an object detection deep learning model (YOLO) on both 1D and 2D codes with the best bounding box detection rate.

While the task of barcode detection nearly reached its saturation, several works on decoding had been proposed sparsely since 1990s. Early works [2, 3] achieved their goal by techniques as Hough transformation, wavelet-based peak location on their simple (scanned-style) inputs. Wachenfeld et al [4] proposed a scanline-based approach accompanied by an EAN13 dataset (so-called MuensterDB). However, since their method was based on scanline approach at that time, it just worked well on slightly rotated (±15 degrees) barcode, stronger rotated or distorted would be problematic. Similar to [7], Zamberletti et al [12] also tackled the problem of out-of-focus (blurry) barcode by using multilayer perceptron model to find parameters of adaptive thresholding (instead of standard binarization) to restore blurry image to clearer image and put it into Zxing [13] to decode. Nonetheless, this approach is simple with low recall and time-consuming (2 steps). Recently, Yang et al [14] tried to address 2 tasks on 5 rigorous datasets. The work outperformed all other methods very well on EAN13 barcodes, but since the method was heavily based on scanline-base and hand-crafty featuring and analysis for each challenging condition, it’s a less scalable solution to extend the all other 1D barcode types (even though all 1D codes use stripes, they might differ in guard bar layouts, in how black-and-white stripe mixing) and the case of double-obscured condition (Figure 1) would be inapplicable. Lastly, Fridborn [5] in 2017 first leveraged the power of CNN to directly extract features and predict to 13 outputs (correspond to 13 digits) simultaneously (similar to [15] in Street View House Numbers problem). Compared with traditional and hand-crafty featuring methods, CNN-based approach is relatively more straightforward and data-driven rather than case-by-case analysis. One obvious example is the double-obscured condition which is problematic for scanline-based approaches but could be easily learnt and overcome by CNN classifier. Thus, our work is also CNN-based, however, differs from [5] by following points: (i) we use more advanced CNN feature extracting models; (ii) our input assumption is more practical as well as our training set and evaluation set covered more cases; (iii) we proposed Smart Inference exploiting the checksum attribute of barcode sequences to enhance model accuracies; (iv) we considered minimizing and verify the possibility of CNN-based approach on a real edge device.

Referring to test-time augmentation we used in this work for enhancing the inference accuracy a model, the technique is commonly used in deep learning as this survey [16] and can be found in AlexNet paper [17], ResNet paper [18]. While train-time data augmentation gives more variants of the dataset to let the model also learn all possible variants; test-time augmentation also applies some proper modifications to original samples, let the model give multiple predictions on those modified versions and pick the most suitable one among these predictions by voting mechanism. How to augment data for better performance is also one trendy topic in deep learning now with such papers like AutoAugment [19], Smart Augmentation [20]. In our work, we integrated test-time augmentation into Smart Inference quite effectively.

On the topic of model compression and deep learning applicability on mobile, Cheng et al [21] categorized methods into 4 types: Parameter pruning and sharing, low-rank factorization, compact convolutional filters and knowledge distillation. The first one reduces redundant parameters which are not sensitive to the performance while the second one uses matrix decomposition to estimate the informative parameters. The third one builds special filters to save parameters for only convolutional layers whereas the last technique trains a compact neural network with knowledge distilled from a large model which is so-called teacher model. For simplicity, in this paper, we used the original knowledge distillation (KD) method proposed by Hinton et al [22].

III Methodology

The base approach we use in this work is to train a probabilistic model of decoding barcode sequences given barcode images as [5]. Let $\mathbf{D}$ represent the barcode sequence and $X$ represent the input barcode image. The goal is to learn a model of $P(\mathbf{D}|X)$ by maximizing $\log P(\mathbf{D}|X)$ on the training set. $\mathbf{D}$ is modelled as a collection of $n$ random variables $D_{1},...,D_{n}$ representing $n$ digits of the decoded sequence. To simplify, assume that the value of the each digits are independent from each other, so that the probability of a sequence $d=d_{1},...,d_{n}$ is given by $P(\mathbf{D}=\mathbf{d}|X)=\prod^{n}_{i=1}P(D_{i}=d_{i}|X)$ . Each of the digits is discrete and has 10 possible values (0 to 9). This means each digit could be represented with a softmax classifier that receives as input features extracted from $X$ by a CNN. This type of model is originally proposed by [15], so we call it Multidigit CNN. During the training phase, the loss is calculated by the sum of all cross-entropy losses of digits as usual. However, in the inference phase, instead of normal inference, we propose a modification named Smart Inference (SI), which is one of the main contributions in this work. The detail of SI is described next paragraphs. The overall model is shown in Figure 2.

Input : Trained Multidigit CNN (MDCNN), Barcode Image (BI), Maximum Iteration (Max), Voting status (Voting)

Output : Barcode Digit Combination/s (BDC)

Compute logit[ ][ ] using MDCNN given BI;

for $K\leftarrow 1$ to $N$ do

prob^{K}[\ ]\leftarrow softmax(logit^{K}[\ ])

;

Descending sorting

Prob^{K}[\ ]

;

diff^{K}\leftarrow Prob^{K}[0].val-Prob^{K}[1].val

;

digit^{K}\leftarrow Prob^{K}[0].index

;

append

gap\{=diff^{k}

} and

position\{=K\}

gap\_list

;

end for

Ascending sorting

gap\_list

with element

gap

;

initial\_combination\leftarrow

new combination with

digit[]

;

append

initial\_combination

combination\_list

;

initialize

iter

to zero for each $gap$ $\in$ gap_list do

increment

iter

by one;

if $iter$ is greater than $Max$ then

if voting then

return

voting\_combinations

;

else

return null;

end if

K\leftarrow

gap.position;

new\_digit\leftarrow Prob^{K}[1].index

;

for each $combination\in combination\_list$ do

new\_combination\leftarrow

modify

combination

at position

K

with

new\_digit

;

append

new\_combination

new\_combination\_list

;

end for

for each $combination\in new\_combination\_list$ do

status\leftarrow

Compute checksum test for

combination

;

if status then

if voting then

append

combination

voting\_combinations

;

else

return

combination

;

end if

end for

return

null

Algorithm 1 Modified Prediction Algorithm (MPA)

Input : Trained Multidigit CNN (MDCNN), Barcode Image (BI), Maximum Iteration (Max)

Output : Barcode Digit Combination (BDC)

degree\_list\leftarrow

append degrees [90,180,270];

image\_list\leftarrow

rotate image BI with

degree\_list

;

for each $image\in image\_list$ do

combination\leftarrow

Prediction using Algorithm 1 given MDCNN,

image

, and voting = False ;

if $combination$ is not null then

return

combination

;

end if

end for

return null

Algorithm 2 MPA with Augmentation

Input : Trained Multidigit CNN (MDCNN), Barcode Image (BI), Maximum Iteration (Max)

Output : Barcode Digit Combination (BDC)

degree\_list\leftarrow

append degrees [90,180,270];

image\_list\leftarrow

rotate image BI with

degree\_list

;

for each $Image\in Image\_list$ do

combinations\leftarrow

Prediction with Algorithm 1 given MDCNN,

image

, and voting = True ;

if $combination$ is not null then

append

combinations

combination\_list

;

end if

end for

if $combination\_list$ is not empty then

group combinations which are similar;

combination\leftarrow

select combination with highest count;

return

combination

;

else

return null;

end if

Algorithm 3 MPA with Augmentation and Voting

III-A Smart Inference

Normally after getting logits from the model given barcode images, we apply softmax function to get the probabilities of each value (0 to 9) for all $n$ digits; then, we pick the value with the highest probability. By this way, we finally have $n$ -digit sequences from values having the highest probabilities. However, the value with the highest probability is not always the correct value. Instead, the correct value maybe the value with the second-highest or third-highest probability. Besides, since most 1D barcodes have a characteristic of checksum satisfaction as [23]. Let $D$ is the barcode sequence, $D[i]$ is the digit $i^{th}$ of the sequence from left to right, $L$ is the length of the barcode sequence (e.g. length of EAN13 is 13) (so first digit is $D[1]$ ), the checksum attribute could be summarized as this equation:

\begin{split}(D[L-0]*\mathbf{1}+D[L-1]*\mathbf{3}+\\ D[L-2]*\mathbf{1}+D[L-3]*\mathbf{3}+...+\\ D[L-2i]*\mathbf{1}+D[L-2i-1]*\mathbf{3}+...+\\ D[1]*\mathbf{1}\>\text{if}\>L\bmod 2==0\>\text{else}\>\mathbf{3})\bmod 10=0\end{split}

(1)

Leverage this characteristic, our initial idea was to make more than one predicted sequence from not only the value with the highest probability but also from value having $2^{nd}$ probability (or $3^{rd}$ ) highest for each digit of $n$ digits; then, we verify those combinations by equation (1). Intuitively, the bigger gap between the value having the highest probability and the value having $2^{nd}$ (or $3^{rd}$ ) highest probability is , the more confident the model predicts the value having the highest probability and vice versa. Therefore, it is a priority to consider those digits having the smallest gaps where the model is more confused and less certain in only value having the highest probability. Let $V$ is the number of values having the highest probabilities and Maxim Iteration is the number of digits considered more than one value (as in Algorithm 1). In this work, because the bigger $V$ or Maxim Iteration is, the larger number of combination created causes inference downtime, we only picked $V$ = 2 (i.e. we only consider 2 values having 2 highest probabilities), and conduct experiments with Maxim Iteration from $1$ to $4$ (i.e. the value of each of ( $n-\emph{Maxim Iteration}$ ) other digits is the value having highest probability, we have $2^{\emph{Maxim Iteration}}$ combinations). Lastly, we sort candidate combinations by order of larger to smaller probability and test the equation satisfaction (1) one by one, stop at the first satisfying combination for fast inference. This process is clearly described in Algorithm 1.

The Algorithm 1 is enhanced by applying test-time augmentation in 2 ways: fast-track as in Algorithm 2 and voting as Algorithm 3. For simplicity and fast inference which is important in this application, we only used 3 rotation operations to augment each input image. Algorithm 2 iterates through original input and 3 its variants step-by-step calling Algorithm 1 and stops as soon as Algorithm 1 gets the first equation satisfying combination, otherwise, no decoded sequence is returned. On the other hand, Algorithm 3 collects satisfying combinations from all iterations (original input & variants) and picks the most frequent combinations.

One thing we need to emphasize this idea compared to [14] is our proposed techniques are more scalable for other types of 1D barcodes (EAN, UPC, ITF barcode family) with a few changes. This technique could be applicable for multiple barcode types in one model, we just need to add a few more nodes, some to categorize barcode types, some to fill up the length of the longest barcode types (each digit now having 11 values, 0-9 and NA), the equation 1 still applicable to all other EAN, UPC codes.

III-B Minimize Deep Model

To minimize deep models to have a more suitable model for edge devices, we use original knowledge distillation technique in [22] to distill knowledge from the best (deep) model to small shallow models by replacing original loss function by combined loss:

L=(1-\alpha)*L_{H}+\alpha*L_{KL}

Where $L_{H}$ is the cross-entropy loss from the hard labels, $L_{KL}$ is the Kullback–Leibler divergence loss from the teacher labels (soft label) and $\alpha$ is hyperparameter.

IV Experiments

IV-A Datasets

Our real collected set is comprised of 1055 samples from extended MuensterDB, 408 samples from [12] and 1037 our self-collected. Totally we have 2500 samples after we drew bounding box, labeled the decoded sequences ([4] and [12] had not finished both tasks in their datasets). Our self-collected samples are captured from 5 supermarkets (1 in France, 2 in South Korea, 2 in Vietnam) both indoors and outdoors for 2 weeks. A wide range of products from food and edible product packages, books, kitchenware, office stationery, clothes tags on various material such as metal cans, wine bottles, food plastic bags, cardboard box under various light sources (florescent light, incandescent bulb, morning and afternoon sunlight) and conditions (auto-focus off, handshaking, long-distance, obscured by fingers, wrinkled, distort, cornered); also 195 printed barcodes on plain papers with occluded and wrinkled conditions. This set is available at resl.kaist.ac.kr/doc/datasets

TABLE I: Synthesized conditions

Condition(s)	Number of samples
norm	30000
dark	30000
occluded	20000
occluded+dark	20000
rotated & perspective transformed (RPT)	20000
RPT + dark	20000
cylindered & curvy warped (CCW)	20000
CCW + dark	20000
occluded + RPT	5000
blur	5000
RPT + blur	5000
CCW + blur	5000
upside down	6000
upside down + dark	6000
upside down + blur	6000
upside down + CCW	6000
upside down + occluded	6000
heavy noise + rotated	2000
overexposed + occluded + RPT + CCW	6000
dark + occluded + RPT + CCW	6000
occluded + RPT + CCW	6000

Our training set consists of 250000 synthesized samples (without decoded text under stripes) with conditioned described in Table I (note that 40000 samples are randomly added noise) and 20000 samples augmented from 500 real samples chosen randomly from the real collected set. Some of the synthesized samples are shown in Figure 3. The rest 2000 samples of the real collected set are used as test set.

IV-B Experimental Setups

To demonstrate our proposed model’s performance improvement, we have done experiments on enterprise solutions and base models with deep learning. Zxing [13] is an open-source tool used by most developers, while Google Barcode API is a commercialized version of Zxing. On the other hand, Cognex and Dynamsoft are two large corporations with a long history of developing products using machine vision for industrial uses. For these 4 tools, to make fair evaluations, we all applied test-time augmentation just the same way with our deep learning models in Algorithm 2 and set them to work specifically for EAN13. Note that we used Zxing source and Google API latest versions, Dynamsoft and Cognex demo web-based API were used to evaluated so we also deduced round-trip message duration in measuring inference time. We already considered comparing our results with [4, 6, 12, 14] methods, but since we could not get any source or binary, we just stopped at using the listed public tools.

Regarding deep neural network models, Fridborn-similar model is similar to what described in their work [5], (since their input was 196x100x1 while ours is 285x285x3, so that 4 convolutional blocks results in exceeding GPU resource) what we had to change are 32 kernels instead of 256 kernels for last convolutional layer and 2048 nodes instead of 4096 nodes for each of 2 top FC layers). Next, we modeled non-residual model with 8 convolutional blocks and 2 FC layers having many fewer parameters compared to Fridborn-similar one. Other models using SOTA CNN feature extractors such as ResNet50, ResNet34 [18], MobileNetV2 [24], DenseNet169 [25] just have original feature extractor parts come directly before 13 output nodes as in Figure 2. Various batch sizes were tried but in our empirical observation, 32 might be the best number. All models were trained from scratch without pretrained knowledge. Note that we had to train models with only synthesized set first until the loss reduce to around 1 (i.e. models converged to a certain level) before training on full training set (with 20000 real-collected augmented samples) because directly training on full training set results in a very high loss (even NaN). The training processes were made using NVIDIA Titan RTX with 24 GB VRAM. Our CPU evaluation experiments were conducted on desktop using Intel Core i9 9900KF processor, 32GB RAM while low-computational experiments were run on a CUDA-enabled NVIDIA Jetson Nano board using NVIDIA TensorRT models (converted from PyTorch).

IV-C Evaluation

We have 2 metrics to clarify here: accuracy and errors. Basically, a tool would have 3 outcome states given a barcode image: correct (i.e. match the ground truth) decoded sequence, incorrect decoded sequence and no barcode existed (or no checksum-satisfied sequence for our proposed models). Accuracy metric in this work is calculated by

Accuracy=\frac{\#\>of\>correct\>decoded\>sequences}{\#\>of\>total\>barcode\>images}

while the number of error = $\#\>of\>the\>incorrects$ . This means a good model is the one that achieves higher accuracy and fewer errors. Another figure that needs to be mentioned in this section is the inference time which is average inference time per one image since each image takes a different amount of time for decoders.

TABLE II: Tool & models without MPA performances

Model	Accuracy	CPU (ms)	# of params (M)
Zxing	58.25%	7.65	NA
Dynamsoft	93.10%	978.8	NA
Google API	82.45%	211.9	NA
Cognex	84.60%	111.9	NA
ResNet50	93.35%	66.5	99.5
MobiletNetV2	72.25%	32.4	15.7
MobiletNetV2_kd	83.45%	32.4	15.7
DenseNet169	84.90%	75.65	30
ResNet34	88.70%	38.3	40.7
ResNet34_kd	89.20%	38.3	40.7
Fridborn-similar	31.85%	104.9	403.3
Non-residual	80.80%	103.2	78.5

TABLE III: Using MPA performances

	Model	nonMPA	max=1	max=2	max=3	max=4
Accuracy	ResNet50	0.9335	0.942	0.9435	0.9445	0.9445
	MobiletNetV2	0.8345	0.8595	0.869	0.868	0.8645
	DenseNet169	0.849	0.8645	0.876	0.8775	0.874
	ResNet34	0.892	0.9075	0.911	0.912	0.911
	Non-residual	0.808	0.8295	0.844	0.8455	0.841
# of errors	ResNet50	133	31	48	77	106
	MobiletNetV2	331	59	106	164	241
	DenseNet169	302	54	105	176	230
	ResNet34	216	41	73	116	157
	Non-residual	384	55	104	197	285

TABLE IV: Using MPA & Augmentation performances

	Model	nonMPA	max=1	max=2	max=3	max=4
Accuracy	ResNet50	0.9335	0.958	0.956	0.951	0.946
	MobiletNetV2	0.8345	0.906	0.8975	0.8855	0.8705
	DenseNet169	0.849	0.9155	0.9075	0.8915	0.877
	ResNet34	0.892	0.9375	0.9315	0.9235	0.916
	Non-residual	0.808	0.89	0.879	0.861	0.8445
# of errors	ResNet50	133	56	73	95	108
	MobiletNetV2	331	121	182	225	259
	DenseNet169	302	113	172	216	246
	ResNet34	216	95	129	152	168
	Non-residual	384	145	212	274	311

TABLE V: Using MPA & Augmentation with Voting performances

	Model	nonMPA	max=1	max=2	max=3	max=4
Accuracy	ResNet50	0.9335	0.9585	0.9585	0.9525	0.95
	MobiletNetV2	0.8345	0.9085	0.906	0.899	0.8945
	DenseNet169	0.849	0.9125	0.8985	0.8785	0.8545
	ResNet34	0.892	0.933	0.93	0.9215	0.911
	Non-residual	0.808	0.8855	0.8735	0.853	0.8355
# of errors	ResNet50	133	55	68	92	100
	MobiletNetV2	331	116	165	198	211
	DenseNet169	302	119	190	242	291
	ResNet34	216	104	132	156	178
	Non-residual	384	154	223	290	329

Regarding inference time, we should note that it is hard to have good perfect evaluation since Dynamsoft, Google API and Cognex were tested via APIs which are run on their own servers that are not matched our configured desktop. Since some of the tools are not using machine learning (except Dynamsoft, Cognex use DNN on their many other products so theirs might be DNN model also), their processing time is relatively smaller than deep learning-based techniques but with low accuracy of prediction. The basic evaluation is presented in Table II. Our base models outperform other models with reasonable computation time for prediction with accuracy more than 0.93.

To show the performance gained by applying Smart Inference during testing time, we have performed three different experiments. The first experiment is done with Algorithm 1. As depicted in Table III, the result shows that the performance is improved compared to models without it. The result also shows that the performance improves when the number of gaps considered is increased up to some level, it then shows degradation. The second experiment is done to demonstrate how (Algorithm 2) fast-track augmentation can improve MPA performance. It clearly shows a performance improvement over Algorithm 1 as depicted in Table IV and significant improvements compared to basic approach. However, this time, just after considering one pair having the smallest gap, we already reached the best results. Like MPA without augmentation (Algorithm 1), the number of errors in Algorithm 2 increases as the number of considering gap is increased. Nevertheless, Algorithm 1 still has shown a smaller number of errors when compared to predictions with Algorithm 2. The third experiment corresponded to Algorithm 3 which is done with more cost. As shown in table V. It sometimes slightly outperforms the experiments based on Algorithm 2 with similar behavior when we change parameter $Max$ . This suggests that the voting scenario is not always a good choice for our models and the models might already be relatively robust to the original input image and only need help after they failed in the first place.

As we mentioned in the last section, to demonstrate the possibility of the solution on portable devices, we distilled knowledge from the best model (ResNet50) to 2 considered small models: ResNet34 and MobileNetV2. The result in Table II clearly shows that knowledge distillation does help in gaining higher performance compared with training by the normal loss function. MobileNetV2 jumps considerable from 72.25% to 83.45% while the improvement in ResNet34 model is not much. This could be because ResNet34 is still deep (40.7 million parameters compared to 15.7 million parameters of MobileNetV2) and so their learning ability from itself is still robust enough to reach high performance without the guidance from the teacher model. Finally, our experimental tests on the NVIDIA Jetson Nano board show that MobileNetV2, ResNet34 achieved average speeds of 34.2, 45.6 milliseconds per images respectively. This speed is equivalent to a smooth frame-per-second experience with the robustness of the model, we expect it is comfortable for users.

V Conclusion

In this work, we have proposed Smart Inference for Multidigit CNN based models to improve the performance of 1D barcode decoding. We have collected multiple real barcodes with label data to train and test the proposed model. We have also added better synthesized data to strengthen the training and testing process. The algorithms proposed during testing time boosted the performance over the base models. It not only outperforms the base model in terms of accuracy but also has small inference time which makes it efficient. The Multidigit CNN based approach with Smart Inference is also a scalable solution as it could extend to decode more than one barcode type. We have also shown that distillation technique transfers effectively knowledge from the best model to the shallower model to run on low computational edge devices and performs clearly better than training with normal loss function.

Even though the performance is better in terms of accuracy, the proposed model has a limitation in predicting false records (Dynamsoft also predicts 3 errors). Another limitation of this approach is that it is not applicable for non-fixed length barcode types such as Code39. In the future, the problem of false predicting can be mitigated by applying product recognition techniques.

VI Acknowledgement

This work was supported and funded by the Ministry of Science and ICT (MSIT) under the Korea-EU Joint Research Support Project of National Research Foundation of Korea (NRF-2016K1A3A7A0395205414), Main Research Program (E0162502) of the Korea Food Research Institute(KFRI), and Grand Information Technology Research Center support program (IITP-2020-0-01489) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

References

[1] W. P. Fernandcz, Y. Xian, and Y. Tian, “Image-based barcode detection and recognition to assist visually impaired persons,” in 2017 IEEE 7th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER). IEEE, 2017, pp. 1241–1245.
[2] E. Joseph and T. Pavlidis, “Bar code waveform recognition using peak locations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 630–640, 1994.
[3] R. Muniz, L. Junco, and A. Otero, “A robust software barcode reader using the hough transform,” in Proceedings 1999 International Conference on Information Intelligence and Systems (Cat. No. PR00446). IEEE, 1999, pp. 313–319.
[4] S. Wachenfeld, S. Terlunen, and X. Jiang, “Robust recognition of 1-d barcodes using camera phones,” in 2008 19th International Conference on Pattern Recognition. IEEE, 2008, pp. 1–4.
[5] F. Fridborn, “Reading barcodes with neural networks,” 2017. [Online]. Available: http://liu.diva-portal.org/smash/record.jsf?pid=diva2:1164104
[6] D.-T. Lin, M.-C. Lin, and K.-Y. Huang, “Real-time automatic recognition of omnidirectional multiple barcodes and dsp implementation,” Machine Vision and Applications, vol. 22, no. 2, pp. 409–419, 2011.
[7] M. Katona and L. G. Nyúl, “A novel method for accurate and efficient barcode detection with morphological operations,” in 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems. IEEE, 2012, pp. 307–314.
[8] G. Sörös and C. Flörkemeier, “Blur-resistant joint 1d and 2d barcode localization for smartphones,” in Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, 2013, pp. 1–8.
[9] C. Creusot and A. Munawar, “Low-computation egocentric barcode detector for the blind,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 2856–2860.
[10] ——, “Real-time barcode detection in the wild,” in 2015 IEEE winter conference on applications of computer vision. IEEE, 2015, pp. 239–245.
[11] D. K. Hansen, K. Nasrollahi, C. B. Rasmussen, and T. B. Moeslund, “Real-time barcode detection and classification using deep learning.” in IJCCI, 2017, pp. 321–327.
[12] A. Zamberletti, I. Gallo, M. Carullo, and E. Binaghi, “Decoding 1-d barcode from degraded images using a neural network,” in International Conference on Computer Vision, Imaging and Computer Graphics. Springer, 2010, pp. 45–55.
[13] S. Owen et al., “Zxing,” Zebra Crossing, 2013.
[14] H. Yang, L. Chen, Y. Chen, Y. Lee, and Z. Yin, “Automatic barcode recognition method based on adaptive edge detection and a mapping model,” Journal of Electronic Imaging, vol. 25, no. 5, p. 053019, 2016.
[15] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multi-digit number recognition from street view imagery using deep convolutional neural networks,” arXiv preprint arXiv:1312.6082, 2013.
[16] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, vol. 6, no. 1, p. 60, 2019.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[19] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation strategies from data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 113–123.
[20] J. Lemley, S. Bazrafkan, and P. Corcoran, “Smart augmentation learning an optimal data augmentation strategy,” Ieee Access, vol. 5, pp. 5858–5869, 2017.
[21] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv preprint arXiv:1710.09282, 2017.
[22] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[23] Anonymous, “How to calculate a check digit manually - services,” Dec 2014. [Online]. Available: https://www.gs1.org/services/how-calculate-check-digit-manually
[24] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
[25] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.