Synthesizing spoken descriptions of images

I Introduction

Automatically describing visual scenes with natural language has great potential in many scenarios, e.g., for helping visually-impaired people interact with their surroundings. Recent years, many literatures [xu2015show, chen2018boosted, li2019entangled, huang2019attention, cornia2020meshed, rennie2017self, chen2017show, keneshloo2019deep] have been published in the field of image captioning that aims to automatically synthesize textual descriptions of images. This task, inspired by the architecture of neural machine translation and benefited from the development of attention mechanism, has achieved impressive results. However, this textual language-based method avoid people who use unwritten languages from benefiting this image describing technology. In fact, a written form is inaccessible for nearly half of the world’s languages, which are unwritten languages. Therefore, it is necessary to develop a technology to automatically describe visual scenes bypassing text, making speakers of any languages can benefit from the image captioning system.

Hasegawa-Johnson et al. [hasegawa2017image2speech] first proposed the image-to-speech task that tries to synthesize spoken descriptions of images without using textual descriptions. In their method, the image-to-speech was decomposed into two stages. The first stage is to generate speech units, e.g., phonemes, with image as input. The second stage is to perform the phoneme-to-speech synthesis process. Different speech units, i.e., L1 phonemes transcribed by an ASR that trained with the same language, L2 phonemes transcribed by an ASR that trained with another language, and Pseudo phones that generated by an unsupervised acoustic unit discovery system. The L2 phonemes and Pseudo phones based methods make the image-to-speech system can be used for an unwritten language, while the L1 phonemes based method provides the convenient of comparisons for possible newly proposed methods due to that the L2 phonemes and Pseudo phones are highly depend on the unsupervised speech unit discovery method which is an important and independent topic to be investigated.

This image-to-speech method [hasegawa2017image2speech] provides a chance for people that using unwritten languages can benefit from the image describing system. However, the image-to-speech task, as a new task, still has many gaps to be filled.

First, although the image-to-speech is a new task, the image-to-speech unit process shares the same idea as that in the text-based image captioning methods, both of which follow the basic structure of neural machine translation. Compared with the image-to-speech task, there are much more models have been proposed in the field of image captioning. However, there has not been any work to show whether those image captioning method can be used for image-to-speech synthesis, and how well they can perform in the image-to-speech task.

Besides, in the original image-to-speech paper [hasegawa2017image2speech], the evaluation was performed for the speech unit synthesis process, and the BLEU score and unit error rates were adopted as the evaluation metrics. However, no methods or experiments have been proposed to show how to appropriately evaluate the image-to-speech unit process, i.e., which evaluation metrics are suitable to be used to evaluate the image-to-speech unit methods.

Moreover, the performance of this speech unit based method depends on the quality of speech unit discovery. The L1 phonemes that transcribed from the well-trained same language ASR can bring good performance for the image-to-speech system, but they cannot be used for unwritten languages. The L2 phonemes and Pseudo phones can be used for unwritten languages, but how to get good quality L2 phonemes and Pseudo phones are still quite challenging tasks, and non-L1 phonemes adopted in [hasegawa2017image2speech] showed much worse performance compared to the L1 phoneme based method.

In this paper, we try to fill these gaps. In order to show whether models designed for image captioning task can be used in the speech unit-based image-to-speech system, and how well they perform in the image-to-speech system, we implemented several representative image captioning models in the image-to-speech system. The image-to-speech unit model proposed in [hasegawa2017image2speech] was re-implemented to work as the baseline that compared to those image captioning methods.

To give an evaluation of how effective different evaluation metrics on the image-to-speech unit are, we conduct a human rating experiment on the results produced by several different models. Then, the human ratings were correlated with different evaluation metrics to show which metric has the best consistency with the human ratings. So that we can determine which evaluation metric could be the best choice to evaluate the image-to-speech unit task.

Last, considering that L1 phonemes cannot be used for the unwritten languages, while L2 phones and Pseudo phones perform very worse on the image-to-speech task, and developing an unsupervised system to get good enough speech units are still quite challenging. In order to make the image-to-speech system able to bypass the dependency on both text and speech units, we proposed an end-to-end method which can synthesize spoken descriptions directly from images.

Preliminary works were presented in [van2020evaluating, wang2020show], in which we investigated how an image-to-phoneme system could be evaluated objectively on the basis of the re-implemented model of [hasegawa2017image2speech] and proposed an end-to-end image-to-speech model, respectively. In this paper, more models proposed for the image captioning task were re-implemented to the image-to-speech task, and the human rating experiments to investigate how to objectively evaluate the image-to-phoneme system were performed on more methods. The contributions of this work are as follows:

•

Experiments of the image-to-phoneme task on the basis of various image captioning models proved that image captioning methods can have a good performance on the image-to-phoneme task.
•

Analysis of the correlation between various evaluation metrics and human rating gave an insight into how to correctly evaluate the image-to-phoneme system.
•

An end-to-end image-to-speech method, for the first time, was proposed, which demonstrates that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

The rest of the paper is organized as follows: Section II reviews some related works, including image captioning and visual-speech multi-modal learning. Section III introduces several image captioning models that were re-implemented in the image-to-speech task. Several evaluation metrics and the human rating method are also introduced in this section. Section IV describes the proposed end-to-end method. Section V presents the results both of the image-to-phoneme task and the end-to-end method. Section VI discusses the limitation of the current image-to-speech methods. Finally, Section VII concludes this paper.

II Related works

As we would like to investigate whether image captioning methods can be used in the image-to-speech system, the related works on image captioning will be reviewed in this section. Besides, the image-to-speech is a visual-speech cross-modal task and the related works on the visual-speech cross-modal task will also be reviewed.

II-A Image captioning

To do

II-B Visual-speech cross-modal learning

To do

III Image-to-phoneme

Introduction of this section: To do

III-A Re-implementation of the image-to-phoneme model

Following the original paper of the image-to-phoneme [hasegawa2017image2speech], the eXtensible Neural Machine Translation Toolkit (XNMT) [neubig2018xnmt] was adopted to re-implement the image-to-phoneme model. The image-to-phoneme model is an attention-guided encoder-decoder architecture. The encoder takes image features as input, and the decoder outputs the predicted phoneme sequence. The encoder uses 3 layers pyramidal LSTM with 128 units. The attender uses a multi-layer perceptron with a state dimension of 512 and a hidden dimension of 128. The decoder is a 3 layers LSTM with 512 units followed by a multi-layer perceptron with a hidden dimension of 1024 working as a transformation between the outputs of LSTM and a final softmax layer. Compared to the original image-to-phoneme model [hasegawa2017image2speech], we made slight changes that an increase of the encoder layers from 1 to 3 and an increase of the attender state dimension from 128 to 512, by doing which the performance can achieve an increase. For convenience, this re-implemented image-to-phoneme model is referred to as I2P hereafter. Image features: To do

III-B Image captioning methods

To do: Why choose those methods

SAT [xu2015show] : To do

Att2in [rennie2017self]: To do

Updown model [anderson2018bottom] combines the bottom-up and top-down attention that enables attention to be calculated at the level of objects and other salient image regions. The "bottom-up" refers to the purely visual feed-forward attention mechanisms and the "top-down" refers to attention mechanisms driven by non-visual or task-specific context. In this model, the bottom-up mechanism is based on the faster R-CNN to detect the interesting regions of the images, and the top-down mechanism is to determine the feature weights of different image regions.

Attention on Attention model (AoANet) [huang2019attention] takes a multi-head attention mechanism that similar to the attention mechanism in the transformer to encode the image features. Different from the original transformer encoder, in AoANet, the feed-forward layer is replaced by the proposed Attention on Attention module (AoA). In the decoder, the AoA module is incorporated with the LSTM to predict word sequences. The AoA module is designed to determine the relevance between the attention results and the query.

$\mathcal{M}^{2}$ Transformer [cornia2020meshed] is a transformed-based method. Compared to the vanilla transformer architecture, in the $\mathcal{M}^{2}$ Transformer, the encoding and decoding layers are connected in a mesh-like structure to exploit both low-level and high-level contributions. Specifically, the cross-attention in the $\mathcal{M}^{2}$ Transformer, called meshed cross-attention, attends to all encoding layers, instead of attending only the last encoding layer. So that, multi-level contributions can be attended by this meshed cross-attention operator.

III-C Training and inferring methods

All the models originally designed for image captioning adopted in this paper were trained with a phoneme-level cross-entropy loss. To investigate whether the reinforcement learning and beam search strategies, both of which show good performance in the image captioning task, work on the image-to-phoneme task or not, these strategies were also implemented in the image-to-phoneme experiments. Specifically, reinforcement learning was adopted to further fine-tune the trained models. During the inferring process, the performance of the beam search was also analyzed. Details of reinforcement learning and beam search method will be described in this section.

III-C1 Reinforcement learning

In the image captioning task, models are usually trained using the cross-entropy loss, while they are evaluated using discrete and non-differentiable NLP metrics such as BLEU, ROUGE, METEOR, or CIDEr. Therefore, a discrepancy could exist between the training objective function and evaluation metrics. Reinforcement learning that directly optimizes metrics shows good performance to mitigate this discrepancy in the image captioning. Here, reinforcement learning was also investigated in the image-to-phoneme task. To do: method.

III-C2 Beam search

During the inferring process, the auto-regressive model normally greedily selects the most probable output of the next step. Rather than only selecting one output, the beam search method will maintain a list of the N most probable sub-sequences generated so far, generate posterior probabilities for the next word of each of these sub-sequences, and then again prune down to the N-best sub-sequences. The beam search method provides a boost in the performance of image captioning. Here, we will test whether it works in the image-to-phoneme task or not.

III-D Evaluation metrics

To do

TABLE I: Image-to-phoneme results (By Justin). bs means that the beam search is adopted during the test stage. rf_c meana the model is fine-tuned using reinforcement learning with CIDEr as reward. rf_b means the model is fine-tuned using reinforcement learning with BLEU4 as reward.

Model	Extra	BLEU1	BLEU2	BLEU3	BLEU4	BLEU5	BLEU6	BLEU7	BLEU8	METEROR	ROUGE-L	CIDEr	PER
I2P (Justin)	—	82.6	62.3	46.4	36.1	24.6	18.2	13.7	9.3	29.4	49.3	42.4	71.4
SAT [xu2015show]	—	83.2	60.9	45.4	35.5	28.4	23.3	19.3	16.1	28.2	47.2	51.6	74.2
	bs	83.8	64.2	49.9	40.7	34.0	28.9	24.7	21.1	27.1	48.1	50.1	70.5
	rf_c	85.7	64.8	49.2	39.1	31.8	26.3	21.9	18.4	27.2	47.9	57.8	70.1
	rf_b	76.1	62.4	51.4	43.7	37.5	32.4	28.1	24.4	24.2	47.7	39.0	68.1
	rf_c + bs	85.5	64.6	49.0	39.0	31.7	26.1	21.7	18.2	27.2	47.9	57.9	70.2
	rf_b + bs	75.8	62.1	51.1	43.4	37.2	32.1	27.8	24.1	24.2	47.5	38.2	68.3
Att2in [rennie2017self]	—	83.4	64.5	51.3	42.2	35.3	29.8	25.3	21.6	30.8	51.5	64.1	72.7
	bs	85.1	68.0	55.4	46.6	39.9	34.5	29.9	25.9	29.3	51.7	64.2	68.2
	rf_c	85.8	68.1	55.0	46.0	39.1	33.6	29.0	25.1	28.6	51.7	65.2	67.5
	rf_b	81.2	66.6	54.9	46.6	40.0	34.7	30.3	26.4	25.8	50.0	46.5	66.6
	rf_c + bs	84.6	68.4	56.3	47.9	41.5	36.2	31.7	27.7	27.8	51.5	60.9	66.2
	rf_b + bs	77.3	64.2	53.6	46.2	40.2	35.3	31.2	27.5	24.8	49.2	42.9	66.7
Updown [anderson2018bottom]	—	85.6	66.4	53.1	44.0	37.0	31.5	27.0	23.2	30.0	51.0	64.4	71.3
	bs	83.6	66.9	55.1	46.9	40.4	35.0	30.4	26.5	27.7	50.3	59.5	67.6
	rf_c	86.3	63.0	46.4	35.8	28.0	22.2	17.9	14.6	27.0	46.3	57.8	73.4
	rf_b	77.3	64.7	54.0	46.2	40.0	34.8	30.4	26.5	24.7	49.0	41.7	67.1
	rf_c + bs	86.4	63.2	46.6	36.0	28.2	22.3	18.0	14.7	27.0	46.5	57.7	73.2
	rf_b + bs	77.0	64.5	54.0	46.3	40.0	34.9	30.5	26.7	24.6	49.1	41.7	67.1
AoANet [huang2019attention]	—	84.8	66.5	53.7	44.8	37.9	32.4	27.9	24.1	30.4	51.6	67.3	70.7
	bs	80.9	65.8	55.0	47.3	41.2	36.1	31.6	27.8	27.2	50.2	57.5	67.2
	rf_c	86.1	70.4	58.7	50.3	43.7	38.1	33.4	29.3	29.7	53.5	72.9	65.3
	rf_b	81.0	68.5	59.0	51.9	46.0	40.9	36.4	32.4	26.7	52.0	55.1	63.9
	rf_c + bs	86.1	71.0	59.8	51.7	45.2	39.8	35.1	31.0	29.0	53.4	70.0	64.7
	rf_b + bs	79.8	68.0	58.8	52.0	46.2	41.2	36.8	32.8	26.3	51.5	52.7	64.1
xlan [pan2020x]	—	85.5	67.5	55.0	46.1	39.0	33.3	28.5	24.5	30.2	51.6	67.6	70.0
	bs	84.0	67.5	56.0	47.8	41.2	35.8	31.1	27.1	28.3	50.7	64.3	67.6
	rf_c	86.6	70.2	58.4	49.9	43.1	37.5	32.7	28.6	29.3	53.1	71.5	65.6
	rf_b	82.2	68.3	58.1	50.6	44.4	39.2	34.6	30.6	27.7	52.2	62.2	64.6
	rf_c + bs	85.1	69.8	58.8	50.8	44.4	38.9	34.2	30.1	28.5	52.9	67.4	64.5
	rf_b + bs	80.5	67.4	57.6	50.5	44.6	39.6	35.2	31.3	26.9	51.6	57.5	64.6
xtransformer [pan2020x]	—	83.6	65.6	52.9	44.0	37.0	31.3	26.7	22.7	31.2	51.7	66.3	73.0
	bs	85.9	68.8	56.5	47.9	41.2	35.6	30.9	26.7	29.7	52.2	67.9	67.7
	rf_c	86.8	70.6	58.7	50.2	43.3	37.6	32.8	28.6	29.0	53.3	69.5	65.5
	rf_b	79.0	66.2	56.3	49.2	43.4	38.4	34.1	30.2	26.4	51.6	53.6	64.7
	rf_c + bs	82.8	68.6	57.7	50.0	43.7	38.4	33.8	29.7	27.3	52.5	58.5	64.3
	rf_b + bs	75.3	63.5	54.3	47.7	42.2	37.5	33.4	29.8	25.3	50.4	47.5	65.1

TABLE II: Image-to-phoneme results (By Xinsheng). bs means that the beam search is adopted during the test stage. rf_c meana the model is fine-tuned using reinforcement learning with CIDEr as reward. rf_b means the model is fine-tuned using reinforcement learning with BLEU4 as reward.

Model	Extra	BLEU1	BLEU2	BLEU3	BLEU4	BLEU5	BLEU6	BLEU7	BLEU8	METEROR	ROUGE-L	CIDEr	PER
I2P (Justin)	—	82.6	62.3	46.4	36.1	24.6	18.2	13.7	9.3	29.4	49.3	42.4	71.4
01_SAT	_	78.6	57.0	42.5	33.2	26.7	21.8	18.1	15.1	29.6	49.8	53.5
	bs	84.1	63.9	49.7	40.6	34.0	28.9	24.7	21.1	29.0	52.2	64.4
	rf_c	86.2	64.4	48.9	39.1	31.9	26.4	22.0	18.5	29.3	52.0	75.7
	rf_b	85.9	69.8	57.5	48.9	42.0	36.2	31.4	27.2	27.1	54.0	66.8
	rf_c + bs	86.0	64.2	48.8	38.9	31.7	26.3	21.9	18.4	29.3	52.0	76.1
	rf_b + bs	85.6	69.5	57.1	48.5	41.7	35.9	31.1	26.9	27.0	53.8	65.5
07_att2in	_	78.3	59.9	47.4	38.8	32.3	27.2	23.0	19.5	31.5	52.9	57.5
	bs	84.4	66.9	54.4	45.8	39.1	33.7	29.2	25.2	31.1	55.3	74.4
	rf_c	85.5	67.1	54.2	45.2	38.4	33.0	28.4	24.5	30.4	55.2	76.9
	rf_b	87.6	71.1	58.5	49.7	42.7	37.0	32.3	28.1	28.4	55.3	72.5
	rf_c + bs	87.0	69.6	57.2	48.7	42.0	36.6	31.9	27.8	30.1	56.0	81.1
	rf_b + bs	86.7	71.3	59.5	51.2	44.6	39.2	34.5	30.4	27.6	55.3	71.0
13_updown	_	81.2	62.2	49.5	40.9	34.3	29.1	24.9	21.3	31.3	53.1	61.9
	bs	85.9	68.0	55.9	47.5	40.8	35.3	30.6	26.6	29.9	54.6	76.3
	rf_c	85.8	62.2	46.0	35.5	27.9	22.1	17.9	14.6	29.0	50.4	71.2
	rf_b	87.7	72.7	60.5	51.9	44.9	39.1	34.1	29.8	27.4	55.1	68.8
	rf_c + bs	86.2	62.6	46.3	35.8	28.1	22.3	18.0	14.7	29.0	50.6	72.0
	rf_b + bs	87.5	72.6	60.6	52.0	45.0	39.2	34.3	29.9	27.4	55.1	68.7
19_aoa	_	80.8	62.7	50.3	41.8	35.3	30.1	25.8	22.2	31.6	53.7	67.3
	bs	84.6	68.1	56.8	48.7	42.3	37.0	32.3	28.3	29.4	54.7	76.0
	rf_c	86.6	70.0	58.2	49.7	43.0	37.5	32.7	28.6	31.6	56.9	86.5
	rf_b	87.4	73.2	62.8	55.1	48.7	43.2	38.4	34.0	29.5	57.7	84.2
	rf_c + bs	87.8	71.8	60.2	51.9	45.3	39.7	34.9	30.7	31.2	57.5	88.9
	rf_b + bs	87.4	73.7	63.5	55.9	49.6	44.1	39.3	35.0	29.2	57.4	82.7
25_xlan	_	81.8	63.8	51.7	43.2	36.5	31.1	26.6	22.8	31.7	54.4	70.1
	bs	85.6	68.1	56.3	48.0	41.5	36.0	31.3	27.3	30.4	54.9	77.5
	rf_c	87.0	69.7	57.7	49.2	42.5	36.9	32.2	28.1	31.3	56.9	85.1
	rf_b	87.6	72.1	61.1	53.1	46.6	41.0	36.2	31.9	30.4	57.2	86.6
	rf_c + bs	87.6	71.1	59.6	51.4	44.8	39.3	34.5	30.3	30.9	57.2	88.1
	rf_b + bs	87.3	72.4	61.7	54.0	47.7	42.3	37.6	33.4	29.7	57.1	85.0
31_xtransformer	_	78.7	60.9	48.8	40.2	33.6	28.3	24.0	20.3	32.1	53.3	60.2
	bs	84.0	66.3	54.2	45.8	39.2	33.8	29.2	25.2	31.5	55.6	75.2
	rf_c	87.1	70.0	57.8	49.1	42.3	36.6	31.8	27.5	31.0	56.8	81.3
	rf_b	87.3	72.4	61.3	53.4	47.0	41.5	36.7	32.4	29.2	57.0	79.6
	rf_c + bs	88.3	72.2	60.5	52.2	45.6	39.9	35.1	30.8	29.9	57.3	82.9
	rf_b + bs	85.8	71.7	61.1	53.6	47.3	42.0	37.3	33.1	28.3	56.5	76.7

TABLE III: Image-to-phoneme results. bs means that the beam search is adopted during the test stage. rf_c meana the model is fine-tuned using reinforcement learning with CIDEr as reward. rf_b means the model is fine-tuned using reinforcement learning with BLEU4 as reward.

Model	Extra	BLEU1 by Justin	BLEU1 by Xinsheng	PER by Justin
I2P (Justin)	—	82.6	62.3	71.4
SAT [xu2015show]	—	83.2	78.6	74.2
	bs	83.8	84.1	70.5
	rf_c	85.7	86.2	70.1
	rf_b	76.1	85.9	68.1
	rf_c + bs	85.5	86.0	70.2
	rf_b + bs	75.8	85.6	68.3
Att2in [rennie2017self]	—	83.4	78.3	72.7
	bs	85.1	84.4	68.2
	rf_c	85.8	85.5	67.5
	rf_b	81.2	87.6	66.6
	rf_c + bs	84.6	87.0	66.2
	rf_b + bs	77.3	86.7	66.7
Updown [anderson2018bottom]	—	85.6	81.2	71.3
	bs	83.6	85.9	67.6
	rf_c	86.3	85.8	73.4
	rf_b	77.3	87.7	67.1
	rf_c + bs	86.4	86.2	73.2
	rf_b + bs	77.0	87.5	67.1
AoANet [huang2019attention]	—	84.8	80.8	70.7
	bs	80.9	84.6	67.2
	rf_c	86.1	86.6	65.3
	rf_b	81.0	87.4	63.9
	rf_c + bs	86.1	87.8	64.7
	rf_b + bs	79.8	87.4	64.1
xlan [pan2020x]	—	85.5	81.8	70.0
	bs	84.0	85.6	67.6
	rf_c	86.6	87.0	65.6
	rf_b	82.2	87.6	64.6
	rf_c + bs	85.1	87.6	64.5
	rf_b + bs	80.5	87.3	64.6
xtransformer [pan2020x]	—	83.6	78.7	73.0
	bs	85.9	84.0	67.7
	rf_c	86.8	87.1	65.5
	rf_b	79.0	87.2	64.7
	rf_c + bs	82.8	88.3	64.3
	rf_b + bs	75.3	85.7	65.1

TABLE IV: Relation with human ratings

	Justin			SOTA1			SOTA2			SOTA3			SOTA4
	$r$	$r_{action}$	$r_{object}$	$r$	$r_{action}$	$r_{object}$	$r$	$r_{action}$	$r_{object}$	$r$	$r_{action}$	$r_{object}$	$r$	$r_{action}$	$r_{object}$
MTurk	—-	0.569	0.627
BLEU1	0.155	0.214	0.195
BLEU2	0.355	0.388	0.411
BLEU3	0.425	0.446	0.486
BLEU4	0.435	0.449	0.494
BLEU5	0.429	0.435	0.484
BLEU6	0.410	0.406	0.451
BLEU7	0.378	0.373	0.423
BLEU8	0.340	0.319	0.376
METEROR	0.258	0.265	0.322
ROUGE-L	0.425	0.416	0.485
CIDEr	0.272	0.305	0.315
PER	-0.361	-0.363	-0.381

IV End-to-end image-to-speech

To do.

V Results

V-A Dataset

To do

V-B Image-to-phoneme

To do

V-C End-to-end image-to-speech

To do

VI Discussion

To do

VII Conclusion

To do