Current browse context:
cs.SD
Change to browse by:
References & Citations
Computer Science > Sound
Title: Towards Listening to 10 People Simultaneously: An Efficient Permutation Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm
(Submitted on 22 Oct 2020 (v1), last revised 16 May 2021 (this version, v2))
Abstract: In neural network-based monaural speech separation techniques, it has been recently common to evaluate the loss using the permutation invariant training (PIT) loss. However, the ordinary PIT requires to try all $N!$ permutations between $N$ ground truths and $N$ estimates. Since the factorial complexity explodes very rapidly as $N$ increases, a PIT-based training works only when the number of source signals is small, such as $N = 2$ or $3$. To overcome this limitation, this paper proposes a SinkPIT, a novel variant of the PIT losses, which is much more efficient than the ordinary PIT loss when $N$ is large. The SinkPIT is based on Sinkhorn's matrix balancing algorithm, which efficiently finds a doubly stochastic matrix which approximates the best permutation in a differentiable manner. The author conducted an experiment to train a neural network model to decompose a single-channel mixture into 10 sources using the SinkPIT, and obtained promising results.
Submission history
From: Hideyuki Tachibana [view email][v1] Thu, 22 Oct 2020 17:08:17 GMT (3192kb,D)
[v2] Sun, 16 May 2021 13:40:26 GMT (3528kb,D)
Link back to: arXiv, form interface, contact.