Logo SpeechParaling-Bench

Evaluating Speech-Aware Speech Generation in
Real-World Scenarios

*Equal Contribution Corresponding Author
1Nanjing University, 2Xiaomi

Introduction

Paralinguistic cues are pivotal to natural human-computer interaction, and state-of-the-art Large Audio-Language Models (LALMs) have demonstrated preliminary advances in speech generation that incorporate paralinguistic features. Yet, evaluation of LALMs in this area remains limited due to the lack of a specialized, comprehensive benchmark. Existing benchmarks address this aspect but mainly focus on basic dimensions like emotion, gender, and age.

To bridge this gap, we introduce Logo SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. Structured around application-oriented tasks, it comprises over 1,000 English-Chinese parallel speech queries covering more than 100 paralinguistic features. Unlike previous approaches that rely on costly human evaluation or specialized models (e.g., speech or emotion recognition models), which may suffer from information loss, we employ an LALM-based judge.

Extensive evaluations on state-of-the-art LALMs reveal that even leading proprietary models struggle with multilingual paralinguistic mastery, and that the failure to grasp paralinguistic cues is a critical bottleneck, accounting for 43.3% of errors in situational dialogues. These findings highlight challenges in developing voice assistants that align with human speech.

Leaderboard (Chinese)

Normalized scores (0-100) on the Chinese subset Logo SpeechParaling-Bench.

# Model Date Overall Paralanguage Control Dynamic Variation Situational Adaptation
1 Doubao Realtime* 🥇 2025-12-20 70.84 71.86 54.09 58.21
2 GPT Audio 🥈 2025-12-20 39.09 35.57 63.33 40.18
3 Gemini Audio 🥉 2025-12-20 28.18 29.64 29.17 23.04
4 Qwen3-Omni-Flash 2026-01-01 22.58 14.16 35.00 44.64
5 Qwen3-Omni-Realtime 2026-01-12 14.34 4.72 5.83 49.29

Leaderboard (English)

Normalized scores (0-100) on the English subset of Logo SpeechParaling-Bench.

# Model Date Overall Paralanguage Control Dynamic Variation Situational Adaptation
1 Gemini Audio* 🥇 2025-12-20 64.97 66.49 61.08 52.37
2 GPT Audio 🥈 2025-12-20 49.39 46.38 52.92 57.68
3 Doubao Realtime 🥉 2025-12-20 31.39 28.05 22.50 46.07
4 Qwen3-Omni-Realtime 2026-01-01 15.52 6.75 5.00 48.57
5 Qwen3-Omni-Flash 2026-01-12 13.73 12.51 7.92 20.18
Baseline models*: Doubao Realtime and Gemini Audio serve as baseline models for Chinese and English subset, respectively. We adopt a pairwise comparison mechanism for scoring, where each candidate model is compared against the baseline. Gemini-3 Pro is utilized as an LALM-based judge.

🚨 To submit your results to the leaderboard, please send to this email with your result files.

🚨 For more evaluation details, please refer to this link.

Logo SpeechParaling-Bench Dataset

Overview

Logo SpeechParaling-Bench is a comprehensive paralinguistic-aware speech generation benchmark within real-world interaction contexts. It consists of three specific tasks: Paralanguage Control, Dynamic Variation, and Situational Adaptation, which is tailored to evaluate speech generation with paralanguage, dynamic control on paralinguistic features, and adaptive use of appropriate paralanguage grounded in specific contexts, respectively. It also incorporates 13 dimensions and more than 100 features, comprehensively covering various paralinguistic nuances across different real-life scenarios. In total, Logo SpeechParaling-Bench includes more than 1,000 examples newly curated with our data pipeline.

All the data samples are divided into two subsets: Chinese and English. You can download the dataset on Hugging Face Dataset.

data-overview

Key statistics and data composition of Logo SpeechParaling-Bench.
For a full list of paralinguistic dimensions, definitions, and feature ranges, please refer to our paper.

Examples

We show one representative example for each task types in Logo SpeechParaling-Bench

Control Chinese
Prompt

请用活泼搞怪的声音说: 嘿,瞧我新学的这个小魔术!等下变没了你的零食可别哭鼻子哦!

Style: Lively and Mischievous
Multi-dimensional Control
System Prompt Directly speak the user's requested sentence in the specified tone without any prefix.
Model Output
GPT Audio
Doubao Realtime
Gemini Audio
Qwen3-Omni-Flash
Qwen3-Omni-Realtime
Control English
Prompt

Please read this sentence with a lively and mischievous voice: 'Hey, check out this new magic trick I learned! Don't cry if your snacks disappear later, okay?'

Style: Lively and Mischievous
Multi-dimensional Control
System Prompt Directly speak the user's requested sentence in the specified tone without any prefix.
Model Output
GPT Audio
Doubao Realtime
Gemini Audio
Qwen3-Omni-Flash
Qwen3-Omni-Realtime
Variation Chinese
Prompt

请先用喊叫开始,然后逐渐减弱到耳语说: 我不要吃这个!我讨厌青菜!可是妈妈说吃了才能长高高。

Feature: Volume Transition: Shouting → Whisper
Dynamic Volume Change
System Prompt Directly speak the user's requested sentence in the specified tone without any prefix.
Model Output
GPT Audio
Doubao Realtime
Gemini Audio
Qwen3-Omni-Flash
Qwen3-Omni-Realtime
Variation English
Prompt

Please read this sentence starting with a shouting manner and gradually decreasing the volume to a whisper: 'I don't want to eat this! I hate vegetables! But Mommy said I need to eat them to grow tall.'

Feature: Volume Transition: Shouting → Whisper
Dynamic Volume Change
System Prompt Directly speak the user's requested sentence in the specified tone without any prefix.
Model Output
GPT Audio
Doubao Realtime
Gemini Audio
Qwen3-Omni-Flash
Qwen3-Omni-Realtime
Adaptation Chinese
Prompt

我不敢一个人睡,天太黑了,我害怕有坏人。

Emotion: Fear Age: Child
Empathetic Generation
System Prompt You're a great conversationalist, please chat with users in an appropriate tone and keep responses concise and to the point.
Model Output
GPT Audio
Doubao Realtime
Gemini Audio
Qwen3-Omni-Flash
Qwen3-Omni-Realtime
Adaptation English
Prompt

I don't dare to sleep alone, it's too dark, I'm afraid of bad guys.

Emotion: Fear Age: Child
Empathetic Generation
System Prompt You're a great conversationalist, please chat with users in an appropriate tone and keep responses concise and to the point.
Model Output
GPT Audio
Doubao Realtime
Gemini Audio
Qwen3-Omni-Flash
Qwen3-Omni-Realtime

Experiment Results

Results and Findings

BibTeX

@article{speechparaling-bench,
  author    = {},
  title     = {SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation},
  journal   = {arXiv preprint arXiv:},
  year      = {2026}
}