Simplifying and Understanding State Space Models with Diagonal Linear RNNs

Gupta, Ankit; Mehta, Harsh; Berant, Jonathan

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2212

Computer Science > Machine Learning

Title: Simplifying and Understanding State Space Models with Diagonal Linear RNNs

Authors: Ankit Gupta, Harsh Mehta, Jonathan Berant

(Submitted on 1 Dec 2022 (this version), latest version 14 Nov 2023 (v3))

Abstract: Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that $\mathrm{DLR}$ is as performant as previously-proposed SSMs in the presence of strong supervision, despite being conceptually much simpler. Moreover, we characterize the expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via $\textit{few}$ convolutional kernels, they struggle on tasks requiring $\textit{many}$ such kernels and especially when the desired sequence manipulation is $\textit{context-dependent}$. For example, $\mathrm{DLR}$ learns to perfectly shift a $0.5M$-long input by an arbitrary number of positions but fails when the shift size depends on context. Despite these limitations, $\mathrm{DLR}$ reaches high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$ and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths $8K$ and $65K$ respectively, and gives encouraging performance on $\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$ for which attention is not a viable choice.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2212.00768 [cs.LG]
	(or arXiv:2212.00768v1 [cs.LG] for this version)

Submission history

From: Ankit Gupta [view email]
[v1] Thu, 1 Dec 2022 18:53:06 GMT (436kb,D)
[v2] Wed, 7 Dec 2022 10:46:50 GMT (448kb,D)
[v3] Tue, 14 Nov 2023 16:52:48 GMT (491kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2212.00768v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Simplifying and Understanding State Space Models with Diagonal Linear RNNs

Submission history