Abstract

Vocal timbral techniques—such as whisper, falsetto, and vocal fry scream—uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models. To support this task, we also introduce the EMO dataset, a high-quality, paired corpus developed with a specific focus on vocal fry scream.

Timbral Technique Conversion

To evaluate the performance of FABYOL and baseline models in the timbral technique conversion task, we randomly select samples from the test set featuring unseen singers as both reference and source. The source audio, in a modal voice, is conditioned on the reference audio to convert it to the target timbral technique.

🎵 Falsetto Conversion

Unseen Speakers

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Seen Speakers

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

</thead> </table>

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

🎵 Whisper Conversion

Unseen Speakers

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Seen Speakers

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

🎵 Scream Conversion

Unseen Speakers

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Seen Speakers

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

🎵 Modal Conversion

In the modal-to-modal conversion, we can better assess the model's ability to preserve the source speaker's identity.

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Reference

Source	Ground Truth	CosyVoice	FreeVC	FACodec	FABYOL (Proposed)

Ablation Study

In this ablation study, we assess the significance of dual attribute modulation (timbre and prosody) in achieving authentic conversions. Additionally, we conduct another ablation study to evaluate the targeted augmentation strategy for timbral technique representation in contrastive learning.

Dual Modulation

	Falsetto	Whisper	Scream	Modal
Ground Truth
w/o Prosody Modulation
Dula Modulation

BYOL-TT Augmentations

	Falsetto	Whisper	Scream	Modal
Ground Truth
DSP Augmentations
Real Data Selection

Ting-Chao Hsu and Yi-Hsuan Yang, "Conditional vocal timbral technique conversion via embedding-guided dual attribute modulation," in AAAI Workshop on Artificial Emerging AI Technologies for Music (EAIM), 2026.