This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Lifter training and sub-band modeling for computationally efficient and high-quality voice conversion using spectral differentials

Abstract

In this paper, we propose computationally efficient and high-quality methods for statistical voice conversion (VC) with direct waveform modification based on spectral differentials. The conventional method with a minimum-phase filter achieves high-quality conversion but requires heavy computation in filtering. This is because the minimum phase using a fixed lifter of the Hilbert transform often results in a long-tap filter. One of our methods is a data-driven method for lifter training. Since this method takes filter truncation into account in training, it can shorten the tap length of the filter while preserving conversion accuracy. Our other method is sub-band processing for extending the conventional method from narrow-band (16 kHz) to full-band (48 kHz) VC, which can convert a full-band waveform with higher converted-speech quality. Experimental results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/161/16 without degrading the converted-speech quality and 2) the proposed sub-band-processing method for full-band VC can improve the converted-speech quality than the conventional method.

Index Terms—  Voice conversion, spectral differentials, deep neural network, minimum-phase filter, sub-band processing

1 Introduction

Voice conversion (VC) is a method for converting the characteristics of a source speech into those of a target speech, while keeping the linguistic information unchanged [1]. VC has the potential to achieve speech communication beyond the physical constraints of the human vocal organs [2]. The most common VC method is statistical VC, which constructs an acoustic model that converts speech features of a source speaker into those of a target speaker. Deep neural network (DNN)-based VC [3, 4] has been widely studied, and many models for achieving higher-converted-speech quality have been proposed. From a practical point of view, real-time VC methods based on a Gaussian mixture model [5] and DNN [6] have also been studied. They achieve online high-quality conversion of narrow-band (16 kHz) speech using a single CPU on a laptop PC. However, their computational cost is still high, and we need to reduce this cost towards portable (e.g., VC using a low-power CPU on a smart phone) or full-band (48 kHz) VC.

VC consists of three steps: feature analysis, feature conversion, and waveform synthesis. We particularly focus on the last step and use a spectral-differential VC [7] that performs VC in the waveform-domain by applying a spectral differential filter to the source speech waveform. This 1) achieves high-quality conversion by avoiding vocoder errors and 2) incurs less computational cost than neural vocoders [8, 9, 10] that use large DNNs and require sample-by-sample heavy computation. The spectral-differential VC originally used a mel-log spectrum approximation (MLSA) filter [11] to filter a source speech, but Suda et al. found that using a minimum-phase filter achieved higher converted-speech quality than using the MLSA filter [12]. In the case of the minimum-phase filter, an acoustic model (e.g., DNN) outputs a real cepstrum of the converted speech, and the Hilbert transform using a lifter with fixed parameters determines the phases of the filter from the real cepstrum. These processes are suitable for our aim because their computational costs (i.e., filter design) are very small. However, since the minimum-phase filter is not guaranteed to have a short tap length (i.e., the number of samples of the filter), it increases the computational cost of filtering. The practical way to reduce the cost is to truncate the filter in the time domain [13], e.g., using the first half taps instead of full taps. However, such filter truncation degrades converted-speech quality.

Refer to caption
Fig. 1: Comparison of conventional and proposed lifter-training methods

Therefore, we propose a lifter-training method for reducing computational cost without degrading converted-speech quality. Our method jointly trains not only a DNN-based acoustic model but also a lifter with trainable parameters. Since parameters of the DNNs and the lifter are optimized to maximize conversion accuracy with the consideration of a truncated (i.e., short-tap) filter, our method can reduce the computational cost while preserving conversion accuracy. The main difference between our method and a conventional one using a minimum-phase filter is with the lifter to determine the phase of the filter, as shown in Fig. 1. Whereas the lifter of the minimum-phase filter is fixed, that of our method is trained from speech data to determine the phases of a truncated filter. Furthermore, in this paper we extend the conventional method from narrow-band (16 kHz) to full-band (48 kHz) VC. Since fluctuations in wider-band voices are difficult to model with statistical models, the quality of the converted speech is relatively bad. Thus, we also propose a sub-band-processing method for improving the converted-speech quality. This method is designed to statistically convert the lower frequency band and preserve the higher frequency band. We conducted objective and subjective evaluations to investigate the effectiveness of the two proposed methods. Experimental results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/161/16 without degrading converted-speech quality and 2) the proposed sub-band-processing method for full-band VC can improve the converted-speech quality.

2 Conventional spectral-differential VC with minimum-phase filter

This section describes the training and conversion processes of the conventional spectral-differential VC with a minimum-phase filter.

2.1 Training process

Let 𝑭(X)=[𝑭1(X),,𝑭t(X),,𝑭T(X)]\textrm{\boldmath$F$}^{(\mathrm{X})}=[{\textrm{\boldmath$F$}^{(\mathrm{X})}_{1}}^{\top},...,{\textrm{\boldmath$F$}^{(\mathrm{X})}_{t}}^{\top},...,{\textrm{\boldmath$F$}^{(\mathrm{X})}_{T}}^{\top}]^{\top} be a complex frequency spectrum sequence obtained by applying the short-time Fourier transform (STFT) to an input speech waveform, where tt represents the frame index and TT is the total number of frames. For the simplicity, we now focus on the frame tt. A low-order real cepstrum 𝑪t(X)\textrm{\boldmath$C$}_{t}^{(\mathrm{X})} can be extracted from 𝑭t(X)\textrm{\boldmath$F$}_{t}^{(\mathrm{X})} [14]. The DNNs then estimate a real cepstrum of differential filter 𝑪t(D)\textrm{\boldmath$C$}_{t}^{(\mathrm{D})} from 𝑪t(X)\textrm{\boldmath$C$}_{t}^{(\mathrm{X})}. The loss function for tt is calculated as Lt=(𝑪t(Y)𝑪^t(Y))(𝑪t(Y)𝑪^t(Y))L_{t}=(\textrm{\boldmath$C$}_{t}^{(\mathrm{Y})}-\hat{\textrm{\boldmath$C$}}_{t}^{(\mathrm{Y})})^{\top}(\textrm{\boldmath$C$}_{t}^{(\mathrm{Y})}-\hat{\textrm{\boldmath$C$}}_{t}^{(\mathrm{Y})}), where 𝑪^t(Y)\hat{\textrm{\boldmath$C$}}_{t}^{(\mathrm{Y})} is a real cepstrum of converted speech given as 𝑪^t(Y)=𝑪t(X)+𝑪t(D)\hat{\textrm{\boldmath$C$}}_{t}^{(\mathrm{Y})}=\textrm{\boldmath$C$}_{t}^{(\mathrm{X})}+\textrm{\boldmath$C$}_{t}^{(\mathrm{D})}, and 𝑪t(Y)\textrm{\boldmath$C$}_{t}^{(\mathrm{Y})} is a real cepstrum of the target speech. The DNNs are trained to minimize the loss function for all time frames represented as follows:

L=1Tt=1TLt.L=\frac{1}{T}\sum_{t=1}^{T}L_{t}. (1)

2.2 Conversion process

𝑪t(D)\textrm{\boldmath$C$}_{t}^{(\mathrm{D})} is estimated with the DNNs. After the high-order components of the cepstrum are padded with zeros, 𝑪t(D)\textrm{\boldmath$C$}_{t}^{(\mathrm{D})} is multiplied by a time-independent lifter 𝒖min\textrm{\boldmath$u$}_{\rm{min}} for a minimum-phase filter. The complex frequency spectrum of differential filter 𝑭t(D)\textrm{\boldmath$F$}_{t}^{(\mathrm{D})} can be obtained by taking the inverse discrete Fourier transform (IDFT) of the liftered cepstrum. The lifter 𝒖min\textrm{\boldmath$u$}_{\rm{min}} is represented as follows [15]:

𝒖min(n)={1(n=0,n=N/2)2(0<n<N/2),0(n>N/2)\textrm{\boldmath$u$}_{\rm{min}}(n)=\begin{cases}1&\left(n=0,n=N/2\right)\\ 2&\left(0<n<N/2\right),\\ 0&\left(n>N/2\right)\end{cases} (2)

where NN is the number of frequency bins of the DFT. A differential filter in the time domain 𝒇t(D)\textrm{\boldmath$f$}_{t}^{(\mathrm{D})} is obtained by applying the IDFT to 𝑭t(D)\textrm{\boldmath$F$}_{t}^{(\mathrm{D})}. The tap length of 𝒇t(D)\textrm{\boldmath$f$}_{t}^{(\mathrm{D})} is equal to NN. To reduce the computational cost of convolution operation, we can truncate 𝒇t(D)\textrm{\boldmath$f$}_{t}^{(\mathrm{D})} with a fixed tap length ll (l<Nl<N). We define the ll-tap truncated filter as 𝒇t(l)\textrm{\boldmath$f$}_{t}^{(l)}. Although filter truncation can efficiently reduce the computational cost, 𝒇t(l)\textrm{\boldmath$f$}_{t}^{(l)} degrades converted-speech quality.

3 Proposed methods

We present the proposed methods for lifter training with filter truncation for computational-cost reduction and sub-band processing for improving the full-band converted-speech quality.

3.1 Lifter training with filter truncation

Our lifter-training method trains not only DNNs but also a lifter to avoid converted-speech quality degradation caused by filter truncation. Let 𝒖=[u1,,uc]\textrm{\boldmath$u$}=[u_{1},...,u_{c}]^{\top} be a time-independent trainable lifter, where cc is the dimension of the real cepstrum. The filter-truncation process with ll is integrated into training as shown in Fig. 2.

As we described in Section 2.1, the DNNs estimate 𝑪t(D)\textrm{\boldmath$C$}_{t}^{(\mathrm{D})} from 𝑪t(X)\textrm{\boldmath$C$}_{t}^{(\mathrm{X})}. Then 𝑪t(D)\textrm{\boldmath$C$}_{t}^{(\mathrm{D})} is multiplied by the trainable lifter 𝒖u, and the complex frequency spectrum of the differential filter 𝑭t(D)\textrm{\boldmath$F$}_{t}^{(\mathrm{D})} is obtained from the IDFT of 𝑪t(D)\textrm{\boldmath$C$}_{t}^{(\mathrm{D})} and exponential calculation. The differential filter in the time domain 𝒇t(D)\textrm{\boldmath$f$}_{t}^{(\mathrm{D})} is obtained by applying the IDFT to 𝑭t(D)\textrm{\boldmath$F$}_{t}^{(\mathrm{D})}. 𝒇t(D)\textrm{\boldmath$f$}_{t}^{(\mathrm{D})} is truncated to 𝒇t(l)\textrm{\boldmath$f$}_{t}^{(l)} by applying a window function 𝒘w given as Eq. (3):

𝒇t(l)\displaystyle\textrm{\boldmath$f$}_{t}^{(l)} =\displaystyle= 𝒇t(D)𝒘,\displaystyle\textrm{\boldmath$f$}_{t}^{(\mathrm{D})}\cdot\textrm{\boldmath$w$}, (3)
𝒘w =\displaystyle= [10th,,1(l1)th,0lth,,0(N1)th].\displaystyle\left[\overset{0\mathrm{th}}{1},\cdots,\overset{(l-1)\mathrm{th}}{1},\overset{l\mathrm{th}}{0},\cdots,\overset{(N-1)\mathrm{th}}{0}\right]^{\top}. (4)

By using the DFT again, a complex spectrum of the ll-tap truncated differential filter 𝑭t(l)\textrm{\boldmath$F$}_{t}^{(l)} can be obtained. A complex spectrum of converted speech 𝑭^t(Y)\hat{\textrm{\boldmath$F$}}_{t}^{(\mathrm{Y})} is obtained by multiplying 𝑭t(X)\textrm{\boldmath$F$}_{t}^{(\mathrm{X})} by 𝑭t(l)\textrm{\boldmath$F$}_{t}^{(l)}, and the real cepstrum of converted speech 𝑪^t(Y)\hat{\textrm{\boldmath$C$}}_{t}^{(\mathrm{Y})} is extracted from 𝑭^t(Y)\hat{\textrm{\boldmath$F$}}_{t}^{(\mathrm{Y})}. The parameters of the DNNs and the lifter are jointly trained to minimize the same loss function as Eq. (1). Since all processes of this method are differentiable, the training can be done by back-propagation [16].

Refer to caption
Fig. 2: Procedure of proposed lifter-training method

3.2 Conversion process

In the conversion process, the trained DNNs and lifter estimate 𝑭t(D)\textrm{\boldmath$F$}_{t}^{(\mathrm{D})}. 𝒇t(D)\textrm{\boldmath$f$}_{t}^{(\mathrm{D})} is obtained by applying the IDFT to 𝑭t(D)\textrm{\boldmath$F$}_{t}^{(\mathrm{D})}, and 𝒇t(l)\textrm{\boldmath$f$}_{t}^{(l)} is obtained by truncating with ll. We can obtain the converted speech waveform by applying 𝒇t(l)\textrm{\boldmath$f$}_{t}^{(l)} to the source speech waveform.

3.3 Sub-band processing for full-band VC

We now describe our sub-band processing method for full-band extension of the conventional method. When the bandwidth of spectral-differential VC is extended to 48 kHz, conversion cannot be performed well due to the large fluctuations in the wider-band components. To avoid errors in the high-frequency components, our method involves sub-band processing for filtering separately for each frequency band. We apply the differential filter in a frequency region lower than 8 kHz, and do not apply it to a frequency region higher than 8 kHz. In practice, conversion of only the low-frequency band is possible by 1) subtracting 1.0 from the absolute value of 𝑭t(D)\textrm{\boldmath$F$}_{t}^{(\mathrm{D})}, 2) applying the sigmoid function that transitions 1 to 0 at around 8 kHz, and 3) adding 1.0 again.

3.4 Discussion

With the conventional method, the cepstrum is multiplied by the lifter coefficient to determine the shape of the filter so that the phase is minimized. Although the shape of the differential filter changes due to truncation, it is transformed to compensate for the effect of the truncation by applying the Hilbert transform using the lifter trained with the proposed lifter-training method. As a result, our lifter-training method can reduce the calculation amount while suppressing converted-speech quality degradation caused by the filter truncation. Figure 3 shows the cumulative power distribution of the differential filter with the conventional method (l=512l=512) and proposed lifter-training method (l=32l=32). The values on the vertical axis are normalized with the cumulative total. The power is concentrated in the region of tap length 0 to 100. Figure 4 also shows the difference between the lifter trained with the proposed method (l=64l=64) and that for minimum phasing.

As explained in Section 1, liftering-based phase estimation requires only small computation. Since our lifter-training method adopts the same estimation as the conventional method, there is no increase in computational cost of phase estimation.

We applied our lifter-training method to VC, i.e., speaker conversion. We expect that this method can be applied to other tasks processed by filtering, e.g., source separation and speech enhancement.

Refer to caption
Fig. 3: Cumulative power distributions of the differential filter
Refer to caption
Fig. 4: Difference between lifter trained with the proposed lifter-trained method (l=64l=64) and that for minimum phasing

We also discuss our sub-band processing method. Since the characteristics of a speech waveform vary significantly from band to band, it is effective to process the waveform separately for each band. In sub-band WaveNet [17], the speech waveform is divided into several bands and down-sampled, and the waveform in each band is processed separately. In our proposed sub-band processing method using full-band speech, the waveform of each frequency band is not down-sampled. In implementing real-time conversion, the computational cost of filtering can be reduced to 1/31/3 by dividing the frequency domain into three bands and down-sampling the waveform of each band.

4 Experimental evaluations

4.1 experimental condition

We built two intra-gender VC: for female-to-female (f2f) and male-to-male (m2m) conversion. The source and target speakers in female-to-female conversion were stored in the JSUT corpus [18] and Voice Actress Corpus [19], respectively. Those in male-to-male conversion were stored in the JVS corpus [20]. We used 100 utterances (approx. 12 min.) of each speaker, and the numbers of utterances for training, validation, and test data were 80, 10, 10, respectively.

We used narrow-band speech (16 kHz) and full-band speech (48 kHz) for the evaluation. In the narrow-band case, the window length was 25 ms, frame shift was 5 ms, the fast Fourier transform (FFT) length was 512 samples, and number of dimensions of the cepstrum was 40 (0th-through-39th). In the full-band case, the window length and frame shift were the same as those in the narrow-band case, but the FFT length was 2048 samples, and number of dimensions of the cepstrum was 120 (0th-through-119th). For pre-processing, the silent intervals of training and validation data were removed, and the lengths of the source and target speech were aligned by dynamic time warping.

The DNN architecture of the acoustic model was multi-layer perceptron consisting of two hidden layers. The numbers of each hidden unit were 280 and 100 in the narrow-band case, and 840 and 300 in the full-band case. The DNNs consisted of a gated linear unit [21] including the sigmoid activation layer and tanh activation layer, and batch normalization [22] was carried out before applying each activation function. Adam [23] was used as the optimization method. During training, the cepstrum of the source and target speech was normalized to have zero mean and unit variance. The batch size and number of epochs were set to 1,000 and 100, respectively. The model parameters of the DNNs used with the proposed lifter-training method were initialized with the conventional method. The initial value of the lifter coefficient was set to that of the lifter for minimum phasing. In the narrow-band case, the learning rates for the conventional and proposed lifter-training methods were set to 0.0005 and 0.00001, respectively. In the full-band case, the learning rates for the conventional and proposed lifter-training methods were set to 0.0001 and 0.000005, respectively.

The proposed lifter-training method was evaluated using both narrow-band (16 kHz) and full-band (48 kHz) speech. The truncated tap length ll for the narrow-band case was 128, 64, 48, and 32, and that for the full-band case was 224 and 192. On the other hand, the proposed sub-band processing method was evaluated using only the full-band speech.

4.2 Objective evaluation

We compared root mean squared error (RMSE) with the proposed lifter-training and conventional methods when changing ll. The RMSE is obtained by taking the squared root of Eq. (1). Figure 5 shows a plot of the RMSEs in male-to-male and female-to-female VC using narrow-band speech (16 kHz). The proposed lifter-training method achieved higher-precision conversion than the conventional method for all ll. The differences in the RMSEs between the proposed and conventional methods also tended to become more significant when ll is smaller. This result indicates that the proposed lifter-training method can reduce the effect of filter truncation.

Refer to caption
Fig. 5: RMSEs at each ll in narrow-band case (16 kHz)
Table 1: Preference scores with proposed lifter-training and conventional methods in narrow-band case (16 kHz)
Proposed Score pp-value Conventional
l=32l=32 (m2m) 0.587 vs. 0.413 1.3×1051.3\times 10^{-5} l=32l=32 (m2m)
l=32l=32 (m2m) 0.463 vs. 0.537 7.3×1027.3\times 10^{-2} l=512l=512 (m2m)
l=32l=32 (f2f) 0.642 vs. 0.358 <1010<10^{-10} l=32l=32 (f2f)
l=32l=32 (f2f) 0.543 vs. 0.457 3.4×1023.4\times 10^{-2} l=512l=512 (f2f)
l=48l=48 (m2m) 0.533 vs. 0.467 1.0×1011.0\times 10^{-1} l=48l=48 (m2m)
l=48l=48 (m2m) 0.550 vs. 0.450 1.4×1021.4\times 10^{-2} l=512l=512 (m2m)
l=48l=48 (f2f) 0.613 vs. 0.387 1.3×1081.3\times 10^{-8} l=48l=48 (f2f)
l=48l=48 (f2f) 0.548 vs. 0.452 2.0×1022.0\times 10^{-2} l=512l=512 (f2f)
(a) Speaker similarity
Proposed Score pp-value Conventional
l=32l=32 (m2m) 0.687 vs. 0.313 <1010<10^{-10} l=32l=32 (m2m)
l=32l=32 (m2m) 0.529 vs. 0.471 2.3×1012.3\times 10^{-1} l=512l=512 (m2m)
l=32l=32 (f2f) 0.807 vs. 0.193 <1010<10^{-10} l=32l=32 (f2f)
l=32l=32 (f2f) 0.742 vs. 0.258 <1010<10^{-10} l=512l=512 (f2f)
l=48l=48 (m2m) 0.606 vs. 0.394 8.7×1088.7\times 10^{-8} l=48l=48 (m2m)
l=48l=48 (m2m) 0.523 vs. 0.477 2.6×1012.6\times 10^{-1} l=512l=512 (m2m)
l=48l=48 (f2f) 0.581 vs. 0.419 5.5×1055.5\times 10^{-5} l=48l=48 (f2f)
l=48l=48 (f2f) 0.513 vs. 0.487 5.1×1015.1\times 10^{-1} l=512l=512 (f2f)
(b) Speech quality
Table 2: Preference scores with proposed lifter-training and conventional methods in full-band case (48 kHz)
Proposed Score pp-value Conventional
l=192l=192 (m2m) 0.431 vs. 0.569 4.9×1044.9\times 10^{-4} l=2048l=2048 (m2m)
l=192l=192 (f2f) 0.519 vs. 0.481 3.4×1013.4\times 10^{-1} l=2048l=2048 (f2f)
l=224l=224 (m2m) 0.474 vs. 0.526 2.0×1012.0\times 10^{-1} l=2048l=2048 (m2m)
l=224l=224 (f2f) 0.519 vs. 0.481 3.4×1013.4\times 10^{-1} l=2048l=2048 (f2f)
(a) Speaker similarity
Proposed Score pp-value Conventional
l=192l=192 (m2m) 0.529 vs. 0.471 2.3×1012.3\times 10^{-1} l=2048l=2048 (m2m)
l=192l=192 (f2f) 0.447 vs. 0.553 8.9×1038.9\times 10^{-3} l=2048l=2048 (f2f)
l=224l=224 (m2m) 0.513 vs. 0.487 5.2×1015.2\times 10^{-1} l=2048l=2048 (m2m)
l=224l=224 (f2f) 0.517 vs. 0.483 4.2×1014.2\times 10^{-1} l=2048l=2048 (f2f)
(b) Speech quality
Table 3: Preference scores with proposed sub-band processing and conventional methods in full-band case (48 kHz)
Proposed (sub-band) Score pp-value Conventional
m2m 0.519 vs. 0.481 3.4×1013.4\times 10^{-1} m2m
f2f 0.603 vs. 0.397 5.0×1075.0\times 10^{-7} f2f
(a) Speaker similarity
Proposed (sub-band) Score pp-value Conventional
m2m 0.721 vs. 0.279 <1010<10^{-10} m2m
f2f 0.700 vs. 0.300 <1010<10^{-10} f2f
(b) Speech quality

4.3 Subjective evaluations

4.3.1 The evaluation of lifter training

To investigate the effectiveness of the proposed methods, we conducted a series of preference AB tests on speech quality and XAB tests on speaker similarity of converted speech. Thirty listeners participated in each of the evaluations through our crowd-sourced evaluation systems, and each listener evaluated ten speech samples. The target speaker’s natural speech was used as the reference X in the preference XAB tests.

We compared several settings of the conventional and proposed lifter-training methods. Table 1 lists the results for the narrow-band (16 kHz) case. Compared to the truncated conventional method (“Conventional (l=32,48l=32,48)”), we can see that the proposed lifter-training method significantly outperformed the conventional one in terms of speaker similarity and speech quality. Also, compared to the non-truncated conventional method (“Conventional (l=512l=512)”), the proposed lifter-training method (“Proposed (l=32,48l=32,48)”) had the same or higher quality. These results indicate that the proposed lifter-training method can reduce the tap length to 1/161/16 without degrading converted-speech quality whereas the truncated conventional method significantly degrades converted-speech quality. The same tendency can be seen in the full-band (48 kHz) case, as shown in Table 2. The proposed method with l=224l=224 had the same converted-speech quality as the non-truncated conventional method, but the proposed lifter-training method with l=192l=192 degraded speaker similarity and speech quality. Therefore, the proposed lifter-training method can significantly reduce the tap length in the full-band case, though not as much as the narrow-band case.

4.3.2 The evaluation of sub-band processing

We also compared our proposed sub-band-processing method with the conventional method. The AB tests on speech quality and XAB tests on speaker similarity were conducted in the same manner as those mentioned in Section 4.3.1. Table 3 lists the results. Note that, lifter training and filter truncation were not used for both conventional and proposed methods. The proposed sub-band-processing method achieved a higher score than the conventional method excluding the speaker similarity of the male-to-male VC, demonstrating the effectiveness of this method.

5 Conclusion

We presented a lifter-training and sub-band-processing methods for computationally efficient and high-quality voice conversion based on spectral differentials. The lifter was trained considering filter truncation. The sub-band-processing method efficiently converted the lower frequency band of a full-band voice. The experimental results indicate the superiority of our methods in terms of computational efficiency, converted-speech quality compared to the conventional method. For future work, we will implement real-time VC using the proposed methods and evaluate its effectiveness in converted-speech quality and latency.

Acknowledgements: Part of this work was supported by the MIC/SCOPE #182103104.

References

  • [1] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” in Proc. ICASSP, New York, U.S.A., Apr. 1988, pp. 655–658.
  • [2] T. Toda, “Augmented speech production based on real-time statistical voice conversion,” in Proc. GlobalSIP, Atlanta, U.S.A, Dec. 2014, pp. 592–596.
  • [3] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “Voice conversion using artificial neural networks,” in Proc. ICASSP, Taipei, Taiwan, Apr. 2009, pp. 3893–3896.
  • [4] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,” in Proc. ICASSP, Brisbane, Australia, Apr. 2015, pp. 4869–4873.
  • [5] T. Toda, T. Muramatsu, and H. Banno, “Implementation of computationally efficient real-time voice conversion,” in Proc. INTERSPEECH, Portland, U.S.A., Sep. 2012, pp. 94–97.
  • [6] R. Arakawa, S. Takamichi, and H. Saruwatari, “Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device,” in Proc. SSW10, Vienna, Austria, Sep. 2019, pp. 93–98.
  • [7] K. Kobayashi, T. Toda, and S. Nakamura, “Intra-gender statistical singing voice conversion with direct waveform modification using log-spectral differential,” Speech Communication, vol. 99, pp. 211–220, 2018.
  • [8] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent WaveNet vocoder,” in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 1118–1122.
  • [9] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv, vol. abs/1609.03499, 2018.
  • [10] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based waveform model for statistical parametric speech synthesis,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 5916–5920.
  • [11] S. Imai, K. Sumita, and C. Furuichi, “Mel log spectrum approximation (MLSA) filter for speech synthesis,” Electronics and Communications in Japan, vol. 66, no. 2, pp. 10–18, 1983.
  • [12] H. Suda, G. Kotani, S. Takamichi, and D. Saito, “A revisit to feature handling for high-quality voice conversion,” in Proc. APSIPA ASC, Hawaii, U.S.A., Nov. 2018, pp. 816–822.
  • [13] M. Sunohara, C. Haruta, and N. Ono, “Low-latency real-time blind source separation with binaural directional hearing aids,” in Proc. CHAT, Stockholm, Sweden, Aug. 2017, pp. 9–13.
  • [14] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. ICASSP, San Francisco, U.S.A., Mar 1992, pp. 137–140.
  • [15] S.-C. Pei and H.-S. Lin, “Minimum-phase FIR filter design using real cepstrum,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 53, no. 10, pp. 1113–1117, 2006.
  • [16] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
  • [17] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai, “Subband WaveNet with overlapped single-sideband filterbanks,” in Proc. ASRU, Okinawa, Japan, Dec. 2017, pp. 698–704.
  • [18] R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis,” arXiv, vol. abs/1711.00354, 2017.
  • [19] y_benjo and MagnesiumRibbon, “Voice-actress corpus,” http://voice-statistics.github.io/.
  • [20] S. Takamichi, K. Mitsui, Y. Saito, N. Tanji T. Koriyama, and H. Saruwatari, “JVS corpus: free Japanese multi-speaker voice corpus,” arXiv, vol. abs/1908.06248, 2019.
  • [21] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” arXiv, vol. abs/1612.08083, 2016.
  • [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv, vol. abs/1502.03167, 2015.
  • [23] D. Kingma and B. Jimmy, “Adam: A method for stochastic optimization,” arXiv, vol. abs/1412.6980, 2014.