We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

eess.AS

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Sound

Title: Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks

Abstract: Thanks to the latest deep learning algorithms, silent speech interfaces (SSI) are now able to synthesize intelligible speech from articulatory movement data under certain conditions. However, the resulting models are rather speaker-specific, making a quick switch between users troublesome. Even for the same speaker, these models perform poorly cross-session, i.e. after dismounting and re-mounting the recording equipment. To aid quick speaker and session adaptation of ultrasound tongue imaging-based SSI models, we extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images. Although the STN part takes up only about 10% of the network, our experiments show that adapting just the STN module might allow to reduce MSE by 88% on the average, compared to retraining the whole network. The improvement is even larger (around 92%) when adapting the network to different recording sessions from the same speaker.
Comments: 5 pages, 3 figures, 3 tables
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Journal reference: the Proceedings of Interspeech 2023
DOI: 10.21437/Interspeech.2023-1607
Cite as: arXiv:2305.19130 [cs.SD]
  (or arXiv:2305.19130v3 [cs.SD] for this version)

Submission history

From: Amin Honarmandi Shandiz [view email]
[v1] Tue, 30 May 2023 15:41:47 GMT (610kb)
[v2] Wed, 31 May 2023 07:51:32 GMT (610kb)
[v3] Tue, 17 Oct 2023 08:01:34 GMT (610kb)

Link back to: arXiv, form interface, contact.