Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and extensible SE method. Our approach involves initially extracting audio embeddings from noisy speech using a pre-trained audioencoder, which are then denoised by a compact encoder network. Subsequently, a vocoder synthesizes the clean speech from denoised embeddings. An ablation study substantiates the parameter efficiency of the denoise encoder with a pre-trained audioencoder and vocoder. Experimental results on both speech enhancement and speaker fidelity demonstrate that our generative audioencoder-based SE system outperforms models utilizing discriminative audioencoders. Furthermore, subjective listening tests validate that our proposed system surpasses an existing state-of-the-art SE model in terms of perceptual quality.
The samples listed below are processed by different models referred as:
Noisy: the original noisy speech.
Clean: the clean speech.
DEMUCS: the denoised speech from the open-sourced demucs model.
LMS: the denoised speech from proposed method using log-Mel spectrogram as hand-crafted embedding.
Whisper: the denoised speech from proposed method using Whisper as pre-trained audioencoder.
WavLM: the denoised speech from proposed method using WavLM as pre-trained audioencoder.
Dasheng: the denoised speech from proposed method using Dasheng as pre-trained audioencoder.
Sample 1 | Sample 2 | Sample 3 | Sample 4 | |
![]() |
![]() |
![]() |
![]() |
|
Noisy | ||||
![]() |
![]() |
![]() |
![]() |
|
Clean | ||||
![]() |
![]() |
![]() |
![]() |
|
DEMUCS | ||||
![]() |
![]() |
![]() |
![]() |
|
LMS | ||||
![]() |
![]() |
![]() |
![]() |
|
Whisper | ||||
![]() |
![]() |
![]() |
![]() |
|
WavLM | ||||
![]() |
![]() |
![]() |
![]() |
|
Dasheng |
Sample 1 | Sample 2 | |
![]() |
![]() |
|
Noisy | ||
![]() |
![]() |
|
DEMUCS | ||
![]() |
![]() |
|
Whisper | ||
![]() |
![]() |
|
WavLM | ||
![]() |
![]() |
|
Dasheng |
@inproceedings{xingwei2025dashengdenoise,
title={Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders},
author={Xingwei Sun, Heinrich Dinkel, Yadong Niu, Linzhang Wang, Junbo Zhang, Jian Luan},
booktitle={Interspeech 2025},
year={2025}
}