CtrlSpeech Coarse-to-Fine Control for Expressive Speech Synthesis

phone-aligned controls expressive speech
1The University of Texas at Austin 2Amazon

Abstract. Recent Text-To-Speech (TTS) systems have achieved strong naturalness and zero-shot voice cloning performance, but fine-grained control of expressive speech at the word or phoneme level remains challenging. We propose CtrlSpeech, a controllable, expressive TTS framework with coarse-to-fine control. Built on the DiTAR architecture, CtrlSpeech combines global speaker conditioning with phone-aligned pitch, loudness, and duration signals, enabling localized prosodic control while preserving the target speaker's timbre. This design allows users to adjust expressive attributes at a fine temporal granularity, making speech refinement more flexible and controllable. Experimental results show that CtrlSpeech achieves competitive zero-shot TTS performance and improves controllability over expressive attributes, demonstrating its effectiveness for flexible and practical expressive speech synthesis.

System Overview

CtrlSpeech system overview showing a coarse control round followed by fine pitch, loudness, and duration adjustments.

Figure 1. CtrlSpeech follows a coarse-to-fine workflow: users first provide input text, and optionally specify speaker settings or provide prompt audio; in subsequent rounds, they iteratively refine pitch, loudness, and word duration.

Coarse-to-Fine Control Design

Coarse Global Control Phone-Level Control Local Refinement
Stage 1: Coarse

Global Speaker / Timbre Conditioning

Reference speech or speaker embeddings provide utterance-level conditioning, preserving the target speaker identity and vocal timbre.

prompt speech speaker embedding timbre
Stage 2: Fine

Phone-Aligned Prosody Control

phones /h/ /e/ /l/ /ow/ pitch loud dur

Pitch, loudness, and duration controls are aligned to phones, enabling explicit manipulation of intonation, intensity, and rhythm at fine temporal granularity.

pitch loudness duration phone alignment
Stage 3: Refine

Iterative Local Prosody Refinement

word ph edit ph

Users can adjust selected phone- or word-level controls and regenerate refined speech while keeping the global speaker condition fixed.

localized edits iterative refinement timbre preserving

Expressive Control Samples

Case Target Text Reference Neutral Controlled
Pitch The room grew quiet, and then everyone started laughing.
reference speaker
baseline synthesis
raised phrase-level pitch
Loudness I said we should leave now, not after the storm arrives.
reference speaker
baseline synthesis
emphasized word loudness
Duration Please wait here for a moment before opening the door.
reference speaker
baseline synthesis
lengthened local duration

Table 1. Representative sample layout for pitch, loudness, and duration control experiments.

BibTeX

@inproceedings{zheng2026ctrlspeech,
  title     = {CtrlSpeech: Coarse-to-Fine Control for Expressive Speech Synthesis},
  author    = {Zheng, Zhisheng and Sun, Xiaohang and Liu, Zhu and Chen, Caren and Kumar, Rohith and Aggarwal, Manoj and Medioni, Gerard and Harwath, David},
  booktitle = {Proc. Interspeech 2026},
  year      = {2026}
}