SqueezeFlow: Adaptive Text-to-Speech in Low Computational Resource Scenarios

This is the audio demo for the SqueezeFlow paper. We present a demo for TTS synthesis to seen target speakers, unseen target speakers, and an abelation study on vocoder quality.

Abstract

Adaptive text-to-speech (TTS) has many important applications on edge devices, such as synthesizing personalized voices for the speech impaired, producing customized speech in translation apps, etc. However, existing models either require too much memory to adapt on the edge or too much computation for real-time inference on the edge. On the one hand, some auto-regressive TTS models can run inference in real-time on the edge, but the limited memory available on edge devices precludes training these models through backpropagation to adapt to unseen speakers. On the other hand, flow-based models are fully invertible, allowing efficient backpropagation with limited memory; however, the invertibility requirement of flow-based models reduces their expressivity, leading to larger and more expensive models to produce audio of the same fidelity. In this paper, we propose a flow-based adaptive TTS system with an extremely low computational cost, which is achieved through manipulating dimensions of the "information bottleneck" between a series of flows. The system, which requires only 7.2G MACs for inference (42x smaller than its flow-based baselines), can run inference in real-time on the edge. And because it is flow-based, the system also has the potential to perform adaptation with the limited amount of memory available at the edge. Despite its low cost, we show empirically that the audio generated by our system matches target speakers' voices with no significant reduction to fidelity and audio naturalness compared to baseline models.

TTS Synthesis to Seen Target Speakers

In this section, we convert source mel-spectrograms to target speakers that are present in the training set. We present three systems for comparison:

Blow, the baseline, which uses ground truth audio for voice conversion
SqueezeFlow, our model presented in the paper, which first converts mel-spectrograms to target speaker's, and vocodes mel-spectrograms to generate audios
SqueezeFlow + WaveGlow, a model for abelation, which uses our proposed model to convert mel-spectrograms to target speaker's, and uses an existing model WaveGlow to transform mel-spectrograms into audios

The sentences presented here are generated from our testset, which are not seen during training (i.e. only training sentences from the speakers are seen). The goal for our system is to generate audios that are both natural and similar to target speaker's voice.

Male to Female
Source Speech (Content Provider)	Target Speech (Style Provider)	Blow
		SqueezeFlow
		SqueezeFlow + WaveGlow
Male to Male
Source Speech (Content Provider)	Target Speech (Style Provider)	Blow
		SqueezeFlow
		SqueezeFlow + WaveGlow
Female to Female
Source Speech (Content Provider)	Target Speech (Style Provider)	Blow
		SqueezeFlow
		SqueezeFlow + WaveGlow
Female to Male
Source Speech (Content Provider)	Target Speech (Style Provider)	Blow
		SqueezeFlow
		SqueezeFlow + WaveGlow

TTS Synthesis to Unseen Target Speakers

In this section, we convert source mel-spectrograms to target speakers that are not present in the training set. Since our baseline Blow does not currently support conversion to unseen speaker, we only present two systems here for comparison:

SqueezeFlow, our model presented in the paper, which first converts mel-spectrograms to target speaker's, and vocodes mel-spectrograms to generate audios
SqueezeFlow + WaveGlow, a model for abelation, which uses our proposed model to convert mel-spectrograms to target speaker's, and uses an existing model WaveGlow to transform mel-spectrograms into audios

Both the sentences and the speakers are not seen during training. The model learns the new speaker's style using ~21 minutes of speech using the methodology described in Section 3.4 of the paper. The goal for our system is still generating audios that are both natural and similar to target speaker's voice.

Male to Female
Source Speech (Content Provider)	Target Speech (Style Provider)	SqueezeFlow
Source Speech (Content Provider)	Target Speech (Style Provider)	SqueezeFlow + WaveGlow
Male to Male
Source Speech (Content Provider)	Target Speech (Style Provider)	SqueezeFlow
Source Speech (Content Provider)	Target Speech (Style Provider)	SqueezeFlow + WaveGlow
Female to Female
Source Speech (Content Provider)	Target Speech (Style Provider)	SqueezeFlow
Source Speech (Content Provider)	Target Speech (Style Provider)	SqueezeFlow + WaveGlow
Female to Male
Source Speech (Content Provider)	Target Speech (Style Provider)	SqueezeFlow
Source Speech (Content Provider)	Target Speech (Style Provider)	SqueezeFlow + WaveGlow

Abelation Study on Vocoder Quality

In this section, we present audio demos to evaluate the vocoder part of our SqueezeFlow model. We compare its audio naturalness to our baseline vocoder WaveGlow, using both of them to generate audios from ground truth mel-spectrograms. Since we test our vocoder on a variety of bottleneck dimension configurations to achieve different level of efficiency in Section 4.2 of the paper, we also use the demo here to show their repsective audio naturalness. The models we present here are:

WaveGlow, a pretrained baseline vocoder model, proposed by Prenger et al., 2019
SqueezeFlow-V, our vocoder model presented in the paper, whose bottleneck dimension is L=128, C=256
SqueezeFlow-V-128S, a variant whose bottlneck dimension is L=128, C=128
SqueezeFlow-V-64L, a variant whose bottlneck dimension is L=64, C=256
SqueezeFlow-V-64S, a variant whose bottlneck dimension is L=64, C=128

In the table below, we compare the model in terms of their generated audio, computational cost for generating 1s of 22kHz audio (in GMACs), and number of parameters (in Millions).

Model	Btnk Config	GMACs	Param Size
WaveGlow	--	228.9	87.7
SqueezeFlow-V	L=128, C=256	3.78	23.6
SqueezeFlow-V-128S	L=128, C=128	1.07	7.1
SqueezeFlow-V-64L	L=64, C=256	2.16	24.6
SqueezeFlow-V-64S	L=64, C=128	0.69	8.8