References & Citations
Computer Science > Databases
Title: QUIP: Query-driven Missing Value Imputation
(Submitted on 31 Mar 2022 (v1), last revised 5 Apr 2022 (this version, v2))
Abstract: Missing values widely exist in real-world data sets, and failure to clean the missing data may result in the poor quality of answers to queries. \yiming{Traditionally, missing value imputation has been studied as an offline process as part of preparing data for analysis.} This paper studies query-time missing value imputation and proposes QUIP, which only imputes minimal missing values to answer the query. Specifically, by taking a reasonable good query plan as input, QUIP tries to minimize the missing value imputation cost and query processing overhead. QUIP proposes a new implementation of outer join to preserve missing values in query processing and a bloom filter based index structure to optimize the space and runtime overhead. QUIP also designs a cost-based decision function to automatically guide each operator to impute missing values now or delay imputations. Efficient optimizations are proposed to speed-up aggregate operations in QUIP, such as MAX/MIN operator. Extensive experiments on both real and synthetic data sets demonstrates the effectiveness and efficiency of QUIP, which outperforms the state-of-the-art ImputeDB by 2 to 10 times on different query sets and data sets, and achieves the order-of-magnitudes improvement over the offline approach.
Submission history
From: Yiming Lin [view email][v1] Thu, 31 Mar 2022 21:41:43 GMT (22176kb,D)
[v2] Tue, 5 Apr 2022 04:31:51 GMT (21625kb,D)
Link back to: arXiv, form interface, contact.