Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Neto, Elias Chaibub; Pratap, Abhishek; Perumal, Thanneer M; Tummalacherla, Meghasyam; Bot, Brian M; Trister, Andrew D; Friend, Stephen H; Mangravite, Lara; Omberg, Larsson

Full-text links:

Download:

Current browse context:

stat.AP

< prev | next >

new | recent | 1712

Statistics > Applications

Title: Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Authors: Elias Chaibub Neto, Abhishek Pratap, Thanneer M Perumal, Meghasyam Tummalacherla, Brian M Bot, Andrew D Trister, Stephen H Friend, Lara Mangravite, Larsson Omberg

(Submitted on 8 Dec 2017 (v1), last revised 6 Jul 2018 (this version, v2))

Abstract: Recently, Saeb et al (2017) showed that, in diagnostic machine learning applications, having data of each subject randomly assigned to both training and test sets (record-wise data split) can lead to massive underestimation of the cross-validation prediction error, due to the presence of "subject identity confounding" caused by the classifier's ability to identify subjects, instead of recognizing disease. To solve this problem, the authors recommended the random assignment of the data of each subject to either the training or the test set (subject-wise data split). The adoption of subject-wise split has been criticized in Little et al (2017), on the basis that it can violate assumptions required by cross-validation to consistently estimate generalization error. In particular, adopting subject-wise splitting in heterogeneous data-sets might lead to model under-fitting and larger classification errors. Hence, Little et al argue that perhaps the overestimation of prediction errors with subject-wise cross-validation, rather than underestimation with record-wise cross-validation, is the reason for the discrepancies between prediction error estimates generated by the two splitting strategies. In order to shed light on this controversy, we focus on simpler classification performance metrics and develop permutation tests that can detect identity confounding. By focusing on permutation tests, we are able to evaluate the merits of record-wise and subject-wise data splits under more general statistical dependencies and distributional structures of the data, including situations where cross-validation breaks down. We illustrate the application of our tests using synthetic and real data from a Parkinson's disease study.

Subjects:	Applications (stat.AP)
Cite as:	arXiv:1712.03120 [stat.AP]
	(or arXiv:1712.03120v2 [stat.AP] for this version)

Submission history

From: Elias Chaibub Neto [view email]
[v1] Fri, 8 Dec 2017 15:23:58 GMT (1137kb,D)
[v2] Fri, 6 Jul 2018 23:32:58 GMT (1137kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> stat > arXiv:1712.03120

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Statistics > Applications

Title: Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Submission history