What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Garg, Shivam; Tsipras, Dimitris; Liang, Percy; Valiant, Gregory

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2208

Computer Science > Computation and Language

Title: What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Authors: Shivam Garg, Dimitris Tsipras, Percy Liang, Gregory Valiant

(Submitted on 1 Aug 2022 (v1), last revised 11 Aug 2023 (this version, v3))

Abstract: In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms. Our code and models are available at this https URL .

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2208.01066 [cs.CL]
	(or arXiv:2208.01066v3 [cs.CL] for this version)

Submission history

From: Dimitris Tsipras [view email]
[v1] Mon, 1 Aug 2022 18:01:40 GMT (1319kb,D)
[v2] Sun, 15 Jan 2023 02:19:19 GMT (1343kb,D)
[v3] Fri, 11 Aug 2023 19:27:58 GMT (1367kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2208.01066

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Submission history