I’m a second-year Ph.D. student in Computer Science at the University of Texas at Austin, with a primary focus on Multimodal Large Language Models, Speech and Audio Understanding, and Text-to-Speech.
Previously, I served as a research intern at Microsoft Research Asia, under the mentorship of Lei He and Xu Tan, concentrating on Multilingual Text-to-Speech.
During the summer of 2023, I had the opportunity to work as a research intern at the SALT Lab at UT-Austin, collaborating with Prof. David Harwath and Prof. Eunsol Choi. Additionally, I’ve been a research Intern at the X-Lance Lab at SJTU since 2021, supervised by Prof. Xie Chen.
Download my resumé.
Ph.D. in Computer Science, 2024 - 2028 (expected)
The University of Texas at Austin
BSc in Electrical Engineering & Zhiyuan Honors Program of Engineering, 2020 - 2024
Shanghai Jiao Tong University
Recent Text-To-Speech (TTS) systems have achieved strong naturalness and zero-shot voice cloning performance, but fine-grained control of expressive speech at the word or phoneme level remains challenging. We propose CtrlSpeech, a controllable, expressive TTS framework with coarse-to-fine control. Built on the DiTAR architecture, CtrlSpeech combines global speaker conditioning with phone-aligned pitch, loudness, and duration signals, enabling localized prosodic control while preserving the target speaker’s timbre. This design allows users to adjust expressive attributes at a fine temporal granularity, making speech refinement more flexible and controllable. Experimental results show that CtrlSpeech achieves competitive zero-shot TTS performance and improves controllability over expressive attributes, demonstrating its effectiveness for flexible and practical expressive speech synthesis.
Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret our surroundings based on sound. In this paper we present BAT, which combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of a large language model (LLM) to replicate this innate ability. To address the lack of existing datasets of in-the-wild spatial sounds, we synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. Next, we developed SpatialSoundQA, a spatial sound-based question-answering dataset, offering a range of QA tasks that train BAT in various aspects of spatial sound perception and reasoning. The acoustic front end encoder of BAT is a novel spatial audio encoder named Spatial Audio Spectrogram Transformer, or Spatial-AST, which by itself achieves strong performance across sound event detection, spatial localization, and distance estimation. By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment. Our experiments demonstrate BAT’s superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.