Current browse context:
q-bio.QM
Change to browse by:
References & Citations
Quantitative Biology > Quantitative Methods
Title: Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization
(Submitted on 20 Jun 2017 (v1), last revised 9 May 2018 (this version, v2))
Abstract: Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems that accounts for the similarity amongst inactive molecules as well as active. We investigated seven widely-used benchmarks for virtual screening and classification, and show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously-applied unbiasing techniques. Therefore, it may be that the previously-reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy.
Submission history
From: Abraham Heifets [view email][v1] Tue, 20 Jun 2017 18:47:46 GMT (2729kb,D)
[v2] Wed, 9 May 2018 20:59:37 GMT (2332kb,D)
Link back to: arXiv, form interface, contact.