SpeechParaling-Bench: Evaluating Paralinguistic-Aware Speech Generation

Introduction

Paralinguistic cues are pivotal to natural human-computer interaction, and state-of-the-art Large Audio-Language Models (LALMs) have demonstrated preliminary advances in speech generation that incorporate paralinguistic features. Yet, evaluation of LALMs in this area remains limited due to the lack of a specialized, comprehensive benchmark. Existing benchmarks address this aspect but mainly focus on basic dimensions like emotion, gender, and age.

To bridge this gap, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. Structured around application-oriented tasks, it comprises over 1,000 English-Chinese parallel speech queries covering more than 100 paralinguistic features. Unlike previous approaches that rely on costly human evaluation or specialized models (e.g., speech or emotion recognition models), which may suffer from information loss, we employ an LALM-based judge.

Extensive evaluations on state-of-the-art LALMs reveal that even leading proprietary models struggle with multilingual paralinguistic mastery, and that the failure to grasp paralinguistic cues is a critical bottleneck, accounting for 43.3% of errors in situational dialogues. These findings highlight challenges in developing voice assistants that align with human speech.

Leaderboard (Chinese)

Normalized scores (0-100) on the Chinese subset SpeechParaling-Bench.

#	Model	Date	Overall	Paralanguage Control	Dynamic Variation	Situational Adaptation
1	*Doubao Realtime 🥇**	2025-12-20	70.84	71.86	54.09	58.21
2	GPT Audio 🥈	2025-12-20	39.09	35.57	63.33	40.18
3	Gemini Audio 🥉	2025-12-20	28.18	29.64	29.17	23.04
4	Qwen3-Omni-Flash	2026-01-01	22.58	14.16	35.00	44.64
5	Qwen3-Omni-Realtime	2026-01-12	14.34	4.72	5.83	49.29

Leaderboard (English)

Normalized scores (0-100) on the English subset of SpeechParaling-Bench.

#	Model	Date	Overall	Paralanguage Control	Dynamic Variation	Situational Adaptation
1	*Gemini Audio 🥇**	2025-12-20	64.97	66.49	61.08	52.37
2	GPT Audio 🥈	2025-12-20	49.39	46.38	52.92	57.68
3	Doubao Realtime 🥉	2025-12-20	31.39	28.05	22.50	46.07
4	Qwen3-Omni-Realtime	2026-01-01	15.52	6.75	5.00	48.57
5	Qwen3-Omni-Flash	2026-01-12	13.73	12.51	7.92	20.18

Baseline models*: Doubao Realtime and Gemini Audio serve as baseline models for Chinese and English subset, respectively. We adopt a pairwise comparison mechanism for scoring, where each candidate model is compared against the baseline. Gemini-3 Pro is utilized as an LALM-based judge.

🚨 To submit your results to the leaderboard, please send to this email with your result files.

🚨 For more evaluation details, please refer to this link.

Overview

SpeechParaling-Bench is a comprehensive paralinguistic-aware speech generation benchmark within real-world interaction contexts. It consists of three specific tasks: Paralanguage Control, Dynamic Variation, and Situational Adaptation, which is tailored to evaluate speech generation with paralanguage, dynamic control on paralinguistic features, and adaptive use of appropriate paralanguage grounded in specific contexts, respectively. It also incorporates 13 dimensions and more than 100 features, comprehensively covering various paralinguistic nuances across different real-life scenarios. In total, SpeechParaling-Bench includes more than 1,000 examples newly curated with our data pipeline.

All the data samples are divided into two subsets: Chinese and English. You can download the dataset on Hugging Face Dataset.

Key statistics and data composition of SpeechParaling-Bench.
For a full list of paralinguistic dimensions, definitions, and feature ranges, please refer to our paper.

Examples

We show one representative example for each task types in SpeechParaling-Bench

Control Chinese

Prompt

请用活泼搞怪的声音说: 嘿，瞧我新学的这个小魔术！等下变没了你的零食可别哭鼻子哦！

Style: Lively and Mischievous

Multi-dimensional Control

System Prompt Directly speak the user's requested sentence in the specified tone without any prefix.

Model Output

GPT Audio

Doubao Realtime

Gemini Audio

Qwen3-Omni-Flash

Qwen3-Omni-Realtime

Control English

Prompt

Please read this sentence with a lively and mischievous voice: 'Hey, check out this new magic trick I learned! Don't cry if your snacks disappear later, okay?'

Style: Lively and Mischievous

Multi-dimensional Control

System Prompt Directly speak the user's requested sentence in the specified tone without any prefix.

Model Output

GPT Audio

Doubao Realtime

Gemini Audio

Qwen3-Omni-Flash

Qwen3-Omni-Realtime

Variation Chinese

Prompt

请先用喊叫开始，然后逐渐减弱到耳语说: 我不要吃这个！我讨厌青菜！可是妈妈说吃了才能长高高。

Feature: Volume Transition: Shouting → Whisper

Dynamic Volume Change

System Prompt Directly speak the user's requested sentence in the specified tone without any prefix.

Model Output

GPT Audio

Doubao Realtime

Gemini Audio

Qwen3-Omni-Flash

Qwen3-Omni-Realtime

Variation English

Prompt

Please read this sentence starting with a shouting manner and gradually decreasing the volume to a whisper: 'I don't want to eat this! I hate vegetables! But Mommy said I need to eat them to grow tall.'

Feature: Volume Transition: Shouting → Whisper

Dynamic Volume Change

System Prompt Directly speak the user's requested sentence in the specified tone without any prefix.

Model Output

GPT Audio

Doubao Realtime

Gemini Audio

Qwen3-Omni-Flash

Qwen3-Omni-Realtime

Adaptation Chinese

Prompt

我不敢一个人睡，天太黑了，我害怕有坏人。

Emotion: Fear Age: Child

Empathetic Generation

System Prompt You're a great conversationalist, please chat with users in an appropriate tone and keep responses concise and to the point.

Model Output

GPT Audio

Doubao Realtime

Gemini Audio

Qwen3-Omni-Flash

Qwen3-Omni-Realtime

Adaptation English

Prompt

I don't dare to sleep alone, it's too dark, I'm afraid of bad guys.

Emotion: Fear Age: Child

Empathetic Generation

System Prompt You're a great conversationalist, please chat with users in an appropriate tone and keep responses concise and to the point.

Model Output

GPT Audio

Doubao Realtime

Gemini Audio

Qwen3-Omni-Flash

Qwen3-Omni-Realtime

Results and Findings

Overall performance comparison of state-of-the-art LALMs.
(a) Currently, Doubao and Gemini are the leading models in Chinese and English, respectively;
(b) Multilingual proficiency remains a significant challenge for existing audio assistants.

Dimension-wise performance on Paralanguage Control task (Chinese).
(a) A well-rounded top-performer is still lacking.
For instance, Doubao excels in expressive features (e.g., emotion and attitude) and acoustic features (e.g., timbre and volume),
yet lags in prosodic attributes such as pauses and stress;
(b) Current models generally underperform in abstract, expressive nuances, such as NLV and style.

Manual error pattern analysis of Gemini Audio.
In Chinese subset, Situational Adaptation task, Gemini Audio fails on 67 out of 190 samples, of which a substantial portion (43.3%) stems from
overlooking paralinguistic information embedded in the user's speech.
These results highlight the current gap in capturing paralinguistic nuances beyond mere linguistic content.

BibTeX

@article{speechparaling-bench,
  title     = {SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation},
  author    = {Liu, Ruohan and Yin, Shukang and Wang, Tao and Zhang, Dong and Zhuang, Weiji and Ren, Shuhuai and He, Ran and Shan, Caifeng and Fu, Chaoyou},
  journal   = {arXiv preprint arXiv:2604.20842},
  year      = {2026}
}

SpeechParaling-Bench

Evaluating Speech-Aware Speech Generation in
Real-World Scenarios

Introduction

Leaderboard (Chinese)

Leaderboard (English)

SpeechParaling-Bench Dataset

Overview

Examples

Experiment Results

Results and Findings

BibTeX

SpeechParaling-Bench

Evaluating Speech-Aware Speech Generation in Real-World Scenarios

Introduction

Leaderboard (Chinese)

Leaderboard (English)

SpeechParaling-Bench Dataset

Overview

Examples

Experiment Results

Results and Findings

BibTeX

Evaluating Speech-Aware Speech Generation in
Real-World Scenarios