Conditional Vocal Timbral Technique Conversion via Embedding-Guided Attribute Modulation
Ting-Chao Hsu and Yi-Hsuan Yang
Abstract
Vocal timbral techniques—such as whisper, falsetto, and vocal fry scream—uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models. To support this task, we also introduce the EMO dataset, a high-quality, paired corpus developed with a specific focus on vocal fry scream.
Timbral Technique Conversion
To evaluate the performance of FABYOL and baseline models in the timbral technique conversion task, we randomly select samples from the test set featuring unseen singers as both reference and source. The source audio, in a modal voice, is conditioned on the reference audio to convert it to the target timbral technique.
🎵 Falsetto Conversion
Unseen Speakers
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Seen Speakers
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Reference
</thead>
</table>
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
🎵 Whisper Conversion
Unseen Speakers
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Seen Speakers
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
🎵 Scream Conversion
Unseen Speakers
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Seen Speakers
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
🎵 Modal Conversion
In the modal-to-modal conversion, we can better assess the model's ability to preserve the source speaker's identity.
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Reference
Source
Ground Truth
CosyVoice
FreeVC
FACodec
FABYOL (Proposed)
Ablation Study
In this ablation study, we assess the significance of dual attribute modulation (timbre and prosody) in achieving authentic conversions. Additionally, we conduct another ablation study to evaluate the targeted augmentation strategy for timbral technique representation in contrastive learning.
Dual Modulation
Falsetto
Whisper
Scream
Modal
Ground Truth
w/o Prosody Modulation
Dula Modulation
BYOL-TT Augmentations
Falsetto
Whisper
Scream
Modal
Ground Truth
DSP Augmentations
Real Data Selection
Ting-Chao Hsu and Yi-Hsuan Yang, "Conditional vocal timbral technique conversion via embedding-guided dual attribute modulation," in AAAI Workshop on Artificial Emerging AI Technologies for Music (EAIM), 2026.