ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization
Abstract
In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow.Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show that ReFlow-VC performs exceptionally well in small datasets and zero-shot scenarios.
Voice Conversation results
The following audio examples demonstrate a comparison between the original Recordings, AUTO-VC, Free-VC, Diff-VC (1 step, 50 steps, 1000 steps), our proposed ReFlow-VC (1 step, 50 steps, RK45 solver), and the ablation comparison model NReFlow-VC(1 step, 50 steps, RK45 solver) without the feature fusion module. The samples are selected from the LibriTTS test dataset for zero-shot voice conversion comparison.