Current browse context:
physics.data-an
Change to browse by:
References & Citations
Physics > Data Analysis, Statistics and Probability
Title: Designing compact training sets for data-driven molecular property prediction
(Submitted on 25 Jun 2019)
Abstract: In this paper, we consider the problem of designing a training set using the most informative molecules from a specified library to build data-driven molecular property models. Specifically, we use (i) sparse generalized group additivity and (ii) kernel ridge regression as two representative classes of models, we propose a method combining rigorous model-based design of experiments and cheminformatics-based diversity-maximizing subset selection within the epsilon--greedy framework to systematically minimize the amount of data needed to train these models. We demonstrate the effectiveness of the algorithm on subsets of various databases, including QM7, NIST, and a catalysis dataset. For sparse group additive models, a balance between exploration (diversity-maximizing selection) and exploitation (D-optimality selection) leads to learning with a fraction (sometimes as little as 15%) of the data to achieve similar accuracy as five-fold cross validation on the entire set. On the other hand, kernel ridge regression prefers diversity-maximizing selections.
Link back to: arXiv, form interface, contact.