Fine-Grained Image Generation from Bangla Text Description using Attentional Generative Adversarial Network

Md Aminul Haque Palash¹, Md Abdullah Al Nasim², Aditi Dhali³ and Faria Afrin⁴ Department of Research and Development; Pioneer Alpha^1,2,3,4 ; Dhaka, Bangladesh
[email protected]¹, [email protected]², [email protected]³ and [email protected]⁴

Abstract

Generating fine-grained, realistic images from text has many applications in the visual and semantic realm. Considering that, we propose Bangla Attentional Generative Adversarial Network (AttnGAN) that allows intensified, multi-stage processing for high-resolution Bangla text-to-image generation. Our model can integrate the most specific details at different sub-regions of the image. We distinctively concentrate on the relevant words in the natural language description. This framework has achieved a better inception score of 3.58 ± .06 on the CUB dataset. For the first time, a fine-grained image is generated from Bangla text using attentional GAN. Bangla has achieved 7th position among 100 most spoken languages. This inspires us to explicitly focus on this language, which will ensure the inevitable need of many people. Moreover, Bangla has a more complex syntactic structure and less natural language processing resource that validates our work more.

Index Terms:

GAN, Bangla, text-to-image, Fine-grained image.

I Introduction

In recent years, image processing has become a promising sector to explore. Its vast application area has encouraged more and more researchers every day working in this field. Nowadays, its application is not limited to only processing daily camera-captured images. Rather, image processing is leading a very important role in the medical field, agricultural sector, and even in educational areas. Combining Natural image processing and digital image processing, a new thriving sector is emerging, which impacts highly in the research areas. In this paper, we work on the ”Bangla” natural language and generate high-resolution images (text-to-image converter) using attention GAN, a useful element of image processing.

The application of Generative Adversarial Networks (GANs) [1] has increased dramatically in recent years that can generate well-constructed information at different sub parts of the captured image using a unique attentional generative network. In the deep learning filed, it has gained a lot of attention for the variety of its application and its recent popularity such as images [2],[3],[4], texts [5],[6], [7],[8] and generating images based on discrete labels [9],[10].

Generative Adversarial Networks (GANs) [1] are the latest text-to-image generation algorithms to be suggested and employed to produce life-like images based on text depiction (see Fig. 1). It has introduced a new mechanism for capturing structured word level and multi-staged text-to-image synthesis; we propose Bangla Attentional Generative Adversarial Network (AttnGAN) for synthesizing intricate scenes. Multi-stage approaches [6], [7], [11] provide low-resolution beginning images, which are then refined to high-resolution images.

Refer to caption — Figure 1: Experiment with our proposed model’s outcomes. The first line displays low-to high-resolution images, while the second and third lines display the top five most-visited terms..

In our paper, we propose a Bangla attentional GAN for well-constructed text-to-image creation that supports attention-driven and several-stage processing inspired by this work [7]. Experiments were performed to synthesize images from text descriptions, including the two most-used components AttnGAN and DAMSM, by translating the CUB [12] dataset into Bangla or Bengali. To experimentally evaluate the proposed Bangla AttnGAN, a comprehensive investigation is carried out, which outperforms the previous state-of-the-art model and enhances the quality of generated images. Our contributions are as follows:

•

We generate a text-to-image converter which generates high-definition (256x256), fine-grained image
•

We are the first to work on Bangla language text-to-image generation

II Related Works

Creating images analyzing from natural language is one of the essential uses of ongoing conditional generative models. In art generation, image editing (brightening, enhancing), computer-aided design, and many others are the amazing applications of generating well-constructed, life-like, high-resolution images from text descriptions. Recently, incredible advancement has been accomplished toward this path with the rise of deep generative models. A combination of deep architecture and deep convolutional generative adversarial networks (GANs) formulation can effectively bridge the advancement in text and image modeling as GAN generate highly compelling images [5]. MirrorGAN considered both global and local attention, and they preserve semantic coherence to generate text-to-image [13]. A semantic text embedding module (STEM), a global-local collaborative attentive module (GLAM), and a semantic text regeneration and alignment module (STREAM) are used to build this model. AttnGAN is another proposed model to generate fine-grained text-to-image [7]. This model then is improved by Naveen et al. by solving the extraction of semantic information from the text descriptions combining BERT, GPT2, XLNet with AttnGAN [14]. Additionally, it permits attention-driven, multi-stage refinement for this type of image generation. Sharma et al. [15] came up with a new idea of adding dialogue after text-to-image generation for significant progress in the inception score. ControlGAN is word-level spatial and channel-wise attention-driven text-to-image generator [16]. Yin et al. [17] proposed another photo-realistic text-to-image generator that fulfills both the high-level and low-level semantic consistency. Attention loss and diversity loss are used to enhance the sensitivity of the GAN [18].

On the other hand, the Zero-Shot framework models the text and image tokens as a single stream of data [19]. In the ManiGAN framework, an image is semantically edited part by part matching a given text focusing on the desired attributes [20]. TediGAN was proposed by Xia et al. for multi-modal image generation. This model can effectively manipulate textual descriptions [21]. Another new model of generating face images from text is modeled by khan et al. [22]. Schulze et al. generate text-to-image using combined GAN. It can initiate photo-realistic images described as textual descriptions [23].

III Methodology

GAN is a generative model that is trained consisting of two neural network models named ”generator” or ”generative network” and ”discriminator” or ”discriminative network” model in most cases. The ”generative network” model successfully measures how to generate fresh conceivable samples. The ”discriminative network” model, on the other hand, specifies how to distinguish between fabricated and actual cases.

The attentional generative network and the deep attentional multi-modal similarity model are two distinct components of our proposed Bangla Attentional Generative Adversarial Network (AttnGAN).

III-A Attentional Generative Adversarial Networks

The Attention Generative Adversarial Network enables attention-driven, long-range dependency modeling for picture generation tasks from Bangla Text. In wide-range resolution image analysis, conventional convolutional GANs provide fine-grained nuances as a component of only spatially local points. Subtleties can be constructed using signals from all attribute locations in the Attention Generative Adversarial Network. Furthermore, the discriminator is used to ensure the consistency of the fine-detailed feature pieces in far-flung portions of the image. The prerequisite for image generation in current GAN-based models for text-to-image generation is to encode the full sentence text portrayal into a single vector, although high-resolution word-level data is required. The generative networks in our model are able to create distinct subregions of the picture based on the phrases that are most relevant to those subregions. The three generators in our proposed Bangla AttnGAN accept the hidden states as input and generate images on small to big scales as output. The Text Encoder encodes phrase and word features as they flow through the model at various stages. Conditioning Augmentation is used for the conversion between the sentence vector and the conditioning vector.

The word features and the image features from the preceding hidden layer are both inputs to the attention model. By introducing a new perceptron layer, the word features are first translated into the common semantic space of the image data. Then, depending on the image’s hidden features, a word-context vector is computed for each sub-region. At the next stage, with the combination of features extracted from the images and associated text features, images are generated. The attentional generating network’s final objective function is defined as to create realistic visuals with several levels (i.e., word level and sentence level) of circumstances; the attentional generative network’s final objective function is defined as:

L=L_{G}+\beta L{{}_{DAMSM}}

(1)

where,

L_{G}=\sum_{i=0}^{n-1}L_{G_{i}}

(2)

Here, $beta$ is a hyperparameter that balances the two terms of Eq. (1). The first part represents the GAN loss that approximates conditional and unconditional distributions together. Along with its corresponding discriminator $D_{i}$ , the adversarial loss for $G_{i}$ is defined with the combination of unconditional loss and conditional loss. Unconditional loss is utilized to identify whether the image is natural or counterfeit, and the conditional loss evaluates the similarity of the image and the sentence.

Our Bangla Attention Generative Adversarial Network finds the high-resolution output from Bangla language text and sentences. Our overall architecture is shown in Fig. 2.

III-B Deep Attentional Multimodal Similarity Model

The DAMSM investigates two neural networks that design sub-regions of natural language text from a sentence and an image to a common semantic space, calculating image-sentence matching compatibility at the word level to determine a well-measured loss for image production.

III-B1 Text encoder

The text encoder is a two-way system Long Short-Term Memory (LSTM) [24] that extracts grammatical vectors from the text representation. Hence, we link its two states (hidden) to address the linguistic translation of a word.

III-B2 Image encoder

Image encoder is Convolutional Neural Network (CNN) for mapping captured images into the grammatical structure. In this case, images are partitioned into several sub-regions. The intermediary CNN layers work with the local attributes of various sub-parts of the prime image. On the other hand, the last layers explore the global attributes of the image.

III-B3 Attention-driven image-text matching score

The attention-driven image-text matching score aims to quantify the coordination of an image-text pair as a result of a text-to-image attention model. Finally, images for the following step are created by merging image attributes with the relevant text-context data.

Then, we do the loss minimization to group the input into the realm of natural or fake images according to AttnGAN [7].

III-B4 DAMSM Loss

When comparing images and text descriptions, we use the DAMSM loss [7]. The DAMSM loss is important because it improves the condition of generated images with text descriptions.

IV Machine configuration

We do our experiment on a computer with GPU Core 1, CPU Core 4, Ram 61 GB. Details of our used machines are presented in Table I

TABLE I: Machine details

Name	GPUs	vCPUs	RAM (GiB)	Network Bandwidth
p2.xlarge	1	4	61	High

V Experimental Result

We do extensive quantitative and qualitative evaluations to validate our method. We have done a large-scale experimental analysis effectively to evaluate our proposed model Bangla AttnGAN. Firstly, we explore AttnGAN and the DAMSM. After that, we make a comparison with previous GAN models mentioned earlier for text-to-image analysis. We trained DAMSM model 200 epochs and AttnGAN model 600 epochs.

V-A Dataset

Same as previous text-to-image methods [25], [5], [6], [11], [7], CUB [12] is used to evaluate our proposed method. This dataset contains 200 categories bird images and 10 captions per image [25]. Table II lists the details of this dataset. By using test:train data ration 70:30, we get our best result. In that case, we achieve FID score of 41.08 which is better than the other.

Some other methodology has used Coco dataset [26]. We are still in the process of modifying this dataset suitable to use in the Bangla language.

TABLE II: Statistics of dataset

Dataset	CUB dataset
	Split Ratio: 70:30		Split Ratio: 80:20		Split Ratio:60:40
	Train	Test	Train	Test	Train	Test
#Samples	8,855	2,933	9500	2288	7200	4588
Caption/image	10	10	10	10	10	10
FID Score	41.08		52.34		58.27

V-B Preprocessing

For the implementation, we have used Natural Language Toolkit (NLTK) based on the Python platform for statistical natural language processing (NLP). To translate the text from English to Bengali, we used Google Translator.

V-C Google translator

Google Translator is google provided API which can conveniently translate between so far 108 languages by typing. It is fast, free, and has an amazing interface that can recognize almost all existing languages. We use this to translate our input text collected from the CUB dataset. Though it does not provide a fully accurate translation, it was successful for almost 95% time.

V-D Evaluation Metric

The evaluation of generative models has gotten a lot of attention recently, and numerous quantitative and qualitative methods have been presented so far. The performance of GANs needs to be closely monitored. A few metrics that can be used to verify GANs are as follows:

V-D1 Quantitative Evaluation

Inception Score: The Inception Score [27] is a metric for evaluating the quality of produced images, especially synthetic images generated by generative adversarial network models. It was created in order to reduce subjective human image judgment. The inception score uses a pre-trained deep learning neural network model for image classification to classify the generated images.

TABLE III: Inception scores by other GAN models [5, 6, 11, 25] and our Bangla AttnGAN on the CUB test set.

Dataset	GAN-INT-CLS [5]	StackGAN [6]	StackGAN-v2 [11]	GAWWN [25]	Bangla AttnGAN
CUB	2.88 ± .04	3.70 ± .04	3.82 ± .06	3.62 ± .07	3.58 ± .06

TABLE IV: Performance of FID and Inception scores by other GAN models [5, 28, 6, 11, 25] and our Bangla-AttnGAN on CUB datasets with Bangla text data.

Medium	References	Dataset	Technique	Measurement
Medium	References	Dataset	Technique	FID	Inception Score
Bangla	[5]	CUB	GAN-INT-CLS	134.23	2.01 ± .02
	[6]	CUB	StackGAN	73.02	2.94 ± .09
	[11]	CUB	StackGAN-v2	58.33	3.16 ± .02
	[25]	CUB	GAWWN	98.91	2.59 ± .06
		CUB	Bangla AttnGAN	41.08	3.58 ± .06

FID Score: Because of its concordance with human inspection and sensitivity to tiny changes in the real distribution, which is based on the Inception v3 Network, Frechet Inception Distance (FID) [29] has achieved significant usage. It is used to ensure that the images created are of high quality and uniformity. As the FID drops, the quality improves. To put it another way, authentic and produced images are strikingly similar. FID score can be calculated using:

{\displaystyle{\text{FID}}=|\mu-\mu_{w}|^{2}+\operatorname{tr}(\Sigma+\Sigma_{w}-2(\Sigma\Sigma_{w})^{1/2}).}

(3)

Squaring Wasserstein metric between two multidimensional Gaussian distributions: ${\displaystyle{\mathcal{N}}(\mu,\Sigma)}{\displaystyle{\mathcal{N}}(\mu,\Sigma)}$ , FID score is calculated. The distribution of some neural network elements of the pictures produced by the GAN and ${\displaystyle{\mathcal{N}}(\mu_{w},\Sigma_{w})}{\displaystyle{\mathcal{N}}(\mu_{w},\Sigma_{w})}$ And the distribution of similar neural network highlights from the real-world images that were used to create the GAN When the produced and real photos are given as input to the Inception network, it is measured from the mean and covariance of the initiations.

V-D2 Qualitative evaluation

Generalization Ability: Through our experiments, we have demonstrated the AttnGAN’s ability to generalize by creating images from text descriptions. Fig. 3 shows a few examples. The difference between real and model-generated images can be seen here. It illustrates that these images are modified following our text description. On the other hand, we also notice that images generated by AttnGAN are sharp with great detail and very close to realism. Moreover, as shown in Fig. 5, after modifying some of the most frequently used words in the text descriptions to see how sensitive the results are to reforms in the inputs, it demonstrates that our proposed model is capable of detecting subtle semantic variations in text descriptions. We see the AttnGAN’s intermediary outcomes with attention to better grasp what the AttnGAN has learned. As shown in Fig. 4, The AttnGAN’s initial stage just paints the basic size, structure, and tinctures of objects and provides poor-detailed images. After that, the next stages learn how to fix flaws in previous stage outcomes and add more information to create fine-detailed photos.

To summarize, our proposed Bangla AttnGAN’s generalization ability is demonstrated by the findings provided in Fig. 3, 5 and 4.

V-D3 Comparison with previous methods

Text-to-image generation on the CUB dataset with Bangla text description is compared between Bangla AttnGAN and previous state-of-the-art GAN models. Based on the CUB dataset, as shown in Table II and III, our Bangla AttnGAN achieves a 3.58 ± .06 Inception score and 41.08 FID score. On the other hand, Table IV demonstrates that FID and Inception score performance on CUB datasets containing Bangla text data using state-of-the-art GAN models [5, 28, 6, 11, 25] and our Bangla-AttnGAN. On the CUB dataset, our Bangla AttnGAN can generate images of 256x256 pixels for various scenarios. It has been shown in the experiments that our Bangla AttnGAN model produces more complex images of higher quality than other approaches for its novel attention mechanism.

VI Conclusion and Future Work

We present Bangla AttnGAN for high-resolution Bangla text-to-image analysis and generation in this study. To generate high-quality photos, we first develop a one-of-a-kind attentional generating network. Image quality is ensured using a multi-stage procedure. Second, a deep attentional multi-modal similarity model is presented for training the generator, which can compute the generated exquisite image-text matching loss. We outperformed the prior GAN model for text-to-image synthesis by 3.58 ± .06, outperforming the best-announced inception score. Moreover, this is one of the first works collaborating the GAN model with the Bangla language. Our large-scale experimental analysis effectively exemplifies the potency and efficacy of our proposed model. We have a plan to work on a more complex COCO dataset as no such work has been processed in this area on the Bangla language. Moreover, we will look forward to analyzing more complex structure text in Bangla.

References

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
[2] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
[3] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
[4] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-image translation,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2849–2857.
[5] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International Conference on Machine Learning. PMLR, 2016, pp. 1060–1069.
[6] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5907–5915.
[7] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324.
[8] Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image synthesis with a hierarchically-nested adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6199–6208.
[9] T. Miyato and M. Koyama, “cgans with projection discriminator,” arXiv preprint arXiv:1802.05637, 2018.
[10] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” in International conference on machine learning. PMLR, 2017, pp. 2642–2651.
[11] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan++: Realistic image synthesis with stacked generative adversarial networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1947–1962, 2018.
[12] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
[13] T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-to-image generation by redescription,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1505–1514.
[14] S. Naveen, M. R. Kiran, M. Indupriya, T. Manikanta, and P. Sudeep, “Transformer models for enhancing attngan based text to image generation,” Image and Vision Computing, p. 104284, 2021.
[15] S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio, “Chatpainter: Improving text to image generation using dialogue,” arXiv preprint arXiv:1802.08216, 2018.
[16] B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Controllable text-to-image generation,” arXiv preprint arXiv:1909.07083, 2019.
[17] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics disentangling for text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2327–2336.
[18] T. Hu, C. Long, and C. Xiao, “Crd-cgan: Category-consistent and relativistic constraints for diverse text-to-image generation,” arXiv preprint arXiv:2107.13516, 2021.
[19] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” arXiv preprint arXiv:2102.12092, 2021.
[20] B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Manigan: Text-guided image manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7880–7889.
[21] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, “Tedigan: Text-guided diverse face image generation and manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2256–2265.
[22] M. Z. Khan, S. Jabeen, M. U. G. Khan, T. Saba, A. Rehmat, A. Rehman, and U. Tariq, “A realistic image generation of face from text description using the fully trained generative adversarial networks,” IEEE Access, vol. 9, pp. 1250–1260, 2020.
[23] H. Schulze, D. Yaman, and A. Waibel, “Cagan: Text-to-image generation with combined attention gans,” arXiv preprint arXiv:2104.12663, 2021.
[24] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
[25] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” Advances in neural information processing systems, vol. 29, pp. 217–225, 2016.
[26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
[27] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, pp. 2234–2242, 2016.
[28] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski, “Plug & play generative networks: Conditional iterative generation of images in latent space,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4467–4477.
[29] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.