The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2101

Change to browse by:

Computer Science > Computation and Language

Title: The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Authors: Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

(Submitted on 31 Dec 2020)

Abstract: Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2101.00027 [cs.CL]
	(or arXiv:2101.00027v1 [cs.CL] for this version)

Submission history

From: Leo Gao [view email]
[v1] Thu, 31 Dec 2020 19:00:10 GMT (2152kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2101.00027v1

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Submission history