Zhisheng Zheng

Ph.D. Student in Computer Science

The University of Texas at Austin

About

I’m a second-year Ph.D. student in Computer Science at the University of Texas at Austin, with a primary focus on Multimodal Large Language Models, Speech and Audio Understanding, and Text-to-Speech.

Previously, I served as a research intern at Microsoft Research Asia, under the mentorship of Lei He and Xu Tan, concentrating on Multilingual Text-to-Speech.

During the summer of 2023, I had the opportunity to work as a research intern at the SALT Lab at UT-Austin, collaborating with Prof. David Harwath and Prof. Eunsol Choi. Additionally, I’ve been a research Intern at the X-Lance Lab at SJTU since 2021, supervised by Prof. Xie Chen.

Download my resumé.

Interests

Multimodal Large Language Model
Self-Supervised Learning
Speech and Audio Understanding

Education

Ph.D. in Computer Science, 2024 - 2028 (expected)

The University of Texas at Austin
BSc in Electrical Engineering & Zhiyuan Honors Program of Engineering, 2020 - 2024

Shanghai Jiao Tong University

News

2025.08 2 paper was accepted by EMNLP 2025.
2024.05 BAT was accepted by ICML 2024.
2024.04 EAT: Self-Supervised Pre-Training with Efficient Audio Transformer was accepted by IJCAI 2024.
2023.12 We release emotion2vec, the first universal speech emotion model that excels across diverse emotional tasks, languages.
2023.12 1 paper was accepted by ICASSP 2024. See details.
2023.09 🚀 We release Fast-HuBERT, accelerating HuBERT pre-training in 5.2X speedup without performance drop.
2023.09 2 papers were accepted by IEEE ASRU 2023. See Fast-HuBERT and paper b.
2023.08 Our work MT4SSL was nominated in ISCA Interspeech Best Student Paper Shortlist.
2023.07 I've started working as a visiting scholar at UT-Austin! 🤘
2023.05 3 papers were accepted by ISCA INTERSPEECH 2023. See paper a, paper b, and paper c.
2023.02 1 paper was accepted by ICASSP 2023. See details.

Featured Publications

Zhisheng Zheng, Xiaohang Sun, Zhu Liu, Caren Chen, Rohith Kumar, Manoj Aggarwal, Gerard Medioni, David Harwath

Interspeech 2026

CtrlSpeech: Coarse-to-Fine Control for Expressive Speech Synthesis

Recent Text-To-Speech (TTS) systems have achieved strong naturalness and zero-shot voice cloning performance, but fine-grained control of expressive speech at the word or phoneme level remains challenging. We propose CtrlSpeech, a controllable, expressive TTS framework with coarse-to-fine control. Built on the DiTAR architecture, CtrlSpeech combines global speaker conditioning with phone-aligned pitch, loudness, and duration signals, enabling localized prosodic control while preserving the target speaker’s timbre. This design allows users to adjust expressive attributes at a fine temporal granularity, making speech refinement more flexible and controllable. Experimental results show that CtrlSpeech achieves competitive zero-shot TTS performance and improves controllability over expressive attributes, demonstrating its effectiveness for flexible and practical expressive speech synthesis.

Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

ICML 2024

BAT: Learning to Reason about Spatial Sounds with Large Language Models

Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret our surroundings based on sound. In this paper we present BAT, which combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of a large language model (LLM) to replicate this innate ability. To address the lack of existing datasets of in-the-wild spatial sounds, we synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. Next, we developed SpatialSoundQA, a spatial sound-based question-answering dataset, offering a range of QA tasks that train BAT in various aspects of spatial sound perception and reasoning. The acoustic front end encoder of BAT is a novel spatial audio encoder named Spatial Audio Spectrogram Transformer, or Spatial-AST, which by itself achieves strong performance across sound event detection, spatial localization, and distance estimation. By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment. Our experiments demonstrate BAT’s superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.

Experience

Research Intern

NVIDIA

Feb 2026 – Present Santa Clara, CA, USA

Applied Scientist Intern

Amazon

May 2025 – Feb 2026 Seattle, WA, USA

Research Intern

Microsoft Research Asia

Apr 2024 – Aug 2024 Beijing, China

Research Intern

SALT Lab at UT-Austin CS NLP

May 2023 – Jan 2024 Austin, TX, USA

Research Intern

X-Lance at Shanghai Jiao Tong University

Dec 2021 – Jul 2024 Shanghai, China

Awards & Honors

Shanghai Outstanding Graduates, 2024
SenseTime Scholarship for Undergraduate AI Researchers (30 winners nationwide each year), SenseTime, 2023
Rongchang Science and Technology Innovation Scholarship (<0.1%), Shanghai Rongchang Public Welfare Foundation, 2023
Best Student Paper Shortlist, INTERSPEECH, 2023
Zhiyuan College Honors Scholarship (TOP 10%), 2020-2024
Tencent Scholarship (TOP 2%), Tencent Technology (Shenzhen) Co., Ltd., 2021