Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

Qi, Jun; Hu, Hu; Wang, Yannan; Yang, Chao-Han Huck; Siniscalchi, Sabato Marco; Lee, Chin-Hui

Full-text links:

Download:

Current browse context:

eess.AS

< prev | next >

new | recent | 2007

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

Authors: Jun Qi, Hu Hu, Yannan Wang, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

(Submitted on 25 Jul 2020 (v1), last revised 3 Aug 2020 (this version, v2))

Abstract: This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality and a tensor-train (TT) output layer on the top to reduce model parameters. We first derive a new upper bound on the generalization power of the convolutional neural network (CNN) based vector-to-vector regression models. Then, we provide experimental evidence on the Edinburgh noisy speech corpus to demonstrate that, in single-channel speech enhancement, CNN outperforms DNN at the expense of a small increment of model sizes. Besides, CNN-TT slightly outperforms the CNN counterpart by utilizing only 32\% of the CNN model parameters. Besides, further performance improvement can be attained if the number of CNN-TT parameters is increased to 44\% of the CNN model size. Finally, our experiments of multi-channel speech enhancement on a simulated noisy WSJ0 corpus demonstrate that our proposed hybrid CNN-TT architecture achieves better results than both DNN and CNN models in terms of better-enhanced speech qualities and smaller parameter sizes.

Comments:	Accepted to InterSpeech 2020
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD)
Cite as:	arXiv:2007.13024 [eess.AS]
	(or arXiv:2007.13024v2 [eess.AS] for this version)

Submission history

From: C.-H. Huck Yang [view email]
[v1] Sat, 25 Jul 2020 22:21:05 GMT (1236kb,D)
[v2] Mon, 3 Aug 2020 00:07:39 GMT (1236kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> eess > arXiv:2007.13024

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

Submission history