Skip to the content.
Code Dataset

Abstract

Vocal timbral techniques—such as whisper, falsetto, and vocal fry scream—uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models. To support this task, we also introduce the EMO dataset, a high-quality, paired corpus developed with a specific focus on vocal fry scream.

FABYOL Model Architecture

Timbral Technique Conversion

To evaluate the performance of FABYOL and baseline models in the timbral technique conversion task, we randomly select samples from the test set featuring unseen singers as both reference and source. The source audio, in a modal voice, is conditioned on the reference audio to convert it to the target timbral technique.

🎵 Falsetto Conversion

Unseen Speakers

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

Seen Speakers

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

</thead> </table>

Reference
Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

🎵 Whisper Conversion

Unseen Speakers

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

Seen Speakers

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

🎵 Scream Conversion

Unseen Speakers

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)


Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

Seen Speakers

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)


Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

🎵 Modal Conversion

In the modal-to-modal conversion, we can better assess the model's ability to preserve the source speaker's identity.

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

Reference

Source Ground Truth CosyVoice FreeVC FACodec FABYOL (Proposed)

Ablation Study

In this ablation study, we assess the significance of dual attribute modulation (timbre and prosody) in achieving authentic conversions. Additionally, we conduct another ablation study to evaluate the targeted augmentation strategy for timbral technique representation in contrastive learning.

Dual Modulation

Falsetto Whisper Scream Modal
Ground Truth
w/o Prosody Modulation
Dula Modulation

BYOL-TT Augmentations

Falsetto Whisper Scream Modal
Ground Truth
DSP Augmentations
Real Data Selection

Ting-Chao Hsu and Yi-Hsuan Yang, "Conditional vocal timbral technique conversion via embedding-guided dual attribute modulation," in AAAI Workshop on Artificial Emerging AI Technologies for Music (EAIM), 2026.