We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

stat.AP

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Statistics > Applications

Title: Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Abstract: Recently, Saeb et al (2017) showed that, in diagnostic machine learning applications, having data of each subject randomly assigned to both training and test sets (record-wise data split) can lead to massive underestimation of the cross-validation prediction error, due to the presence of "subject identity confounding" caused by the classifier's ability to identify subjects, instead of recognizing disease. To solve this problem, the authors recommended the random assignment of the data of each subject to either the training or the test set (subject-wise data split). The adoption of subject-wise split has been criticized in Little et al (2017), on the basis that it can violate assumptions required by cross-validation to consistently estimate generalization error. In particular, adopting subject-wise splitting in heterogeneous data-sets might lead to model under-fitting and larger classification errors. Hence, Little et al argue that perhaps the overestimation of prediction errors with subject-wise cross-validation, rather than underestimation with record-wise cross-validation, is the reason for the discrepancies between prediction error estimates generated by the two splitting strategies. In order to shed light on this controversy, we focus on simpler classification performance metrics and develop permutation tests that can detect identity confounding. By focusing on permutation tests, we are able to evaluate the merits of record-wise and subject-wise data splits under more general statistical dependencies and distributional structures of the data, including situations where cross-validation breaks down. We illustrate the application of our tests using synthetic and real data from a Parkinson's disease study.
Subjects: Applications (stat.AP)
Cite as: arXiv:1712.03120 [stat.AP]
  (or arXiv:1712.03120v2 [stat.AP] for this version)

Submission history

From: Elias Chaibub Neto [view email]
[v1] Fri, 8 Dec 2017 15:23:58 GMT (1137kb,D)
[v2] Fri, 6 Jul 2018 23:32:58 GMT (1137kb,D)

Link back to: arXiv, form interface, contact.