SqueezeFlow: Adaptive Text-to-Speech in Low Computational Resource Scenarios
This is the audio demo for the SqueezeFlow paper. We present a demo for TTS synthesis to seen target speakers, unseen target speakers, and an abelation study on vocoder quality.
Abstract
Adaptive text-to-speech (TTS) has many important applications on edge devices, such as synthesizing personalized voices for the speech impaired, producing customized speech in translation apps, etc. However, existing models either require too much memory to adapt on the edge or too much computation for real-time inference on the edge. On the one hand, some auto-regressive TTS models can run inference in real-time on the edge, but the limited memory available on edge devices precludes training these models through backpropagation to adapt to unseen speakers. On the other hand, flow-based models are fully invertible, allowing efficient backpropagation with limited memory; however, the invertibility requirement of flow-based models reduces their expressivity, leading to larger and more expensive models to produce audio of the same fidelity. In this paper, we propose a flow-based adaptive TTS system with an extremely low computational cost, which is achieved through manipulating dimensions of the "information bottleneck" between a series of flows. The system, which requires only 7.2G MACs for inference (42x smaller than its flow-based baselines), can run inference in real-time on the edge. And because it is flow-based, the system also has the potential to perform adaptation with the limited amount of memory available at the edge. Despite its low cost, we show empirically that the audio generated by our system matches target speakers' voices with no significant reduction to fidelity and audio naturalness compared to baseline models.
TTS Synthesis to Seen Target Speakers
In this section, we convert source mel-spectrograms to target speakers that are present in the training set. We present three systems for comparison:
- Blow, the baseline, which uses ground truth audio for voice conversion
- SqueezeFlow, our model presented in the paper, which first converts mel-spectrograms to target speaker's, and vocodes mel-spectrograms to generate audios
- SqueezeFlow + WaveGlow, a model for abelation, which uses our proposed model to convert mel-spectrograms to target speaker's, and uses an existing model WaveGlow to transform mel-spectrograms into audios
The sentences presented here are generated from our testset, which are not seen during training (i.e. only training sentences from the speakers are seen). The goal for our system is to generate audios that are both natural and similar to target speaker's voice.
Male to Female | ||
Source Speech (Content Provider) |
Target Speech (Style Provider) |
Blow |
SqueezeFlow |
||
SqueezeFlow + WaveGlow |
||
Male to Male | ||
Source Speech (Content Provider) |
Target Speech (Style Provider) |
Blow |
SqueezeFlow |
||
SqueezeFlow + WaveGlow |
||
Female to Female | ||
Source Speech (Content Provider) |
Target Speech (Style Provider) |
Blow |
SqueezeFlow |
||
SqueezeFlow + WaveGlow |
||
Female to Male | ||
Source Speech (Content Provider) |
Target Speech (Style Provider) |
Blow |
SqueezeFlow |
||
SqueezeFlow + WaveGlow |
TTS Synthesis to Unseen Target Speakers
In this section, we convert source mel-spectrograms to target speakers that are not present in the training set. Since our baseline Blow does not currently support conversion to unseen speaker, we only present two systems here for comparison:
- SqueezeFlow, our model presented in the paper, which first converts mel-spectrograms to target speaker's, and vocodes mel-spectrograms to generate audios
- SqueezeFlow + WaveGlow, a model for abelation, which uses our proposed model to convert mel-spectrograms to target speaker's, and uses an existing model WaveGlow to transform mel-spectrograms into audios
Both the sentences and the speakers are not seen during training. The model learns the new speaker's style using ~21 minutes of speech using the methodology described in Section 3.4 of the paper. The goal for our system is still generating audios that are both natural and similar to target speaker's voice.
Male to Female | ||
Source Speech (Content Provider) |
Target Speech (Style Provider) |
SqueezeFlow |
SqueezeFlow + WaveGlow |
||
Male to Male | ||
Source Speech (Content Provider) |
Target Speech (Style Provider) |
SqueezeFlow |
SqueezeFlow + WaveGlow |
||
Female to Female | ||
Source Speech (Content Provider) |
Target Speech (Style Provider) |
SqueezeFlow |
SqueezeFlow + WaveGlow |
||
Female to Male | ||
Source Speech (Content Provider) |
Target Speech (Style Provider) |
SqueezeFlow |
SqueezeFlow + WaveGlow |
Abelation Study on Vocoder Quality
In this section, we present audio demos to evaluate the vocoder part of our SqueezeFlow model. We compare its audio naturalness to our baseline vocoder WaveGlow, using both of them to generate audios from ground truth mel-spectrograms. Since we test our vocoder on a variety of bottleneck dimension configurations to achieve different level of efficiency in Section 4.2 of the paper, we also use the demo here to show their repsective audio naturalness. The models we present here are:
- WaveGlow, a pretrained baseline vocoder model, proposed by Prenger et al., 2019
- SqueezeFlow-V, our vocoder model presented in the paper, whose bottleneck dimension is L=128, C=256
- SqueezeFlow-V-128S, a variant whose bottlneck dimension is L=128, C=128
- SqueezeFlow-V-64L, a variant whose bottlneck dimension is L=64, C=256
- SqueezeFlow-V-64S, a variant whose bottlneck dimension is L=64, C=128
In the table below, we compare the model in terms of their generated audio, computational cost for generating 1s of 22kHz audio (in GMACs), and number of parameters (in Millions).
Model | Btnk Config | GMACs | Param Size | Audio |
---|---|---|---|---|
WaveGlow | -- | 228.9 | 87.7 | |
SqueezeFlow-V | L=128, C=256 | 3.78 | 23.6 | |
SqueezeFlow-V-128S | L=128, C=128 | 1.07 | 7.1 | |
SqueezeFlow-V-64L | L=64, C=256 | 2.16 | 24.6 | |
SqueezeFlow-V-64S | L=64, C=128 | 0.69 | 8.8 |