LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Xue, Feng; Li, Yu; Liu, Deyin; Xie, Yincen; Wu, Lin; Hong, Richang

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2302

Computer Science > Computer Vision and Pattern Recognition

Title: LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Authors: Feng Xue, Yu Li, Deyin Liu, Yincen Xie, Lin Wu, Richang Hong

(Submitted on 4 Feb 2023)

Abstract: Lipreading refers to understanding and further translating the speech of a speaker in the video into natural language. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference sets. However, generalizing these methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank and the evident visual variations caused by the shape/color of lips for different speakers. Therefore, merely depending on the visible changes of lips tends to cause model overfitting. To address this problem, we propose to use multi-modal features across visual and landmarks, which can describe the lip motion irrespective to the speaker identities. Then, we develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer. Specifically, LipFormer consists of a lip motion stream, a facial landmark stream, and a cross-modal fusion. The embeddings from the two streams are produced by self-attention, which are fed to the cross-attention module to achieve the alignment between visuals and landmarks. Finally, the resulting fused features can be decoded to output texts by a cascade seq2seq model. Experiments demonstrate that our method can effectively enhance the model generalization to unseen speakers.

Comments:	Under review
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2302.02141 [cs.CV]
	(or arXiv:2302.02141v1 [cs.CV] for this version)

Submission history

From: Lin Wu [view email]
[v1] Sat, 4 Feb 2023 10:22:18 GMT (27653kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2302.02141

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Submission history