Designing compact training sets for data-driven molecular property prediction

Li, Bowen; Rangarajan, Srinivas

Full-text links:

Download:

Current browse context:

physics.data-an

< prev | next >

new | recent | 1906

Physics > Data Analysis, Statistics and Probability

Title: Designing compact training sets for data-driven molecular property prediction

Authors: Bowen Li, Srinivas Rangarajan

(Submitted on 25 Jun 2019)

Abstract: In this paper, we consider the problem of designing a training set using the most informative molecules from a specified library to build data-driven molecular property models. Specifically, we use (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the epsilon--greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on subsets of various databases, including QM7, NIST, and a catalysis dataset. For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy as five-fold cross validation on the entire set. On the other hand, kernel ridge regression prefers diversity-maximizing selections.

Comments:	16 pages with supplemental material, 7 figures in main body and 3 figures in SI
Subjects:	Data Analysis, Statistics and Probability (physics.data-an); Computational Physics (physics.comp-ph)
Cite as:	arXiv:1906.10273 [physics.data-an]
	(or arXiv:1906.10273v1 [physics.data-an] for this version)

Submission history

From: Bowen Li [view email]
[v1] Tue, 25 Jun 2019 00:26:40 GMT (574kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> physics > arXiv:1906.10273

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Physics > Data Analysis, Statistics and Probability

Title: Designing compact training sets for data-driven molecular property prediction

Submission history