We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 8 entries: 1-8 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 9 Jun 23

[1]  arXiv:2306.04743 [pdf, other]
Title: ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems
Comments: 12 pages, 2 figures, 5 tables
Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Natural Language to SQL systems (NL-to-SQL) have recently shown a significant increase in accuracy for natural language to SQL query translation. This improvement is due to the emergence of transformer-based language models, and the popularity of the Spider benchmark - the de-facto standard for evaluating NL-to-SQL systems. The top NL-to-SQL systems reach accuracies of up to 85\%. However, Spider mainly contains simple databases with few tables, columns, and entries, which does not reflect a realistic setting. Moreover, complex real-world databases with domain-specific content have little to no training data available in the form of NL/SQL-pairs leading to poor performance of existing NL-to-SQL systems.
In this paper, we introduce ScienceBenchmark, a new complex NL-to-SQL benchmark for three real-world, highly domain-specific databases. For this new benchmark, SQL experts and domain experts created high-quality NL/SQL-pairs for each domain. To garner more data, we extended the small amount of human-generated data with synthetic data generated using GPT-3. We show that our benchmark is highly challenging, as the top performing systems on Spider achieve a very low performance on our benchmark. Thus, the challenge is many-fold: creating NL-to-SQL systems for highly complex domains with a small amount of hand-made training data augmented with synthetic data. To our knowledge, ScienceBenchmark is the first NL-to-SQL benchmark designed with complex real-world scientific databases, containing challenging training and test data carefully validated by domain experts.

[2]  arXiv:2306.04846 [pdf, ps, other]
Title: Learned spatial data partitioning
Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI)

Due to the significant increase in the size of spatial data, it is essential to use distributed parallel processing systems to efficiently analyze spatial data. In this paper, we first study learned spatial data partitioning, which effectively assigns groups of big spatial data to computers based on locations of data by using machine learning techniques. We formalize spatial data partitioning in the context of reinforcement learning and develop a novel deep reinforcement learning algorithm. Our learning algorithm leverages features of spatial data partitioning and prunes ineffective learning processes to find optimal partitions efficiently. Our experimental study, which uses Apache Sedona and real-world spatial data, demonstrates that our method efficiently finds partitions for accelerating distance join queries and reduces the workload run time by up to 59.4%.

[3]  arXiv:2306.04945 [pdf, ps, other]
Title: Modern Data Pricing Models: Taxonomy and Comprehensive Survey
Subjects: Databases (cs.DB)

Data play an increasingly important role in smart data analytics, which facilitate many data-driven applications. The goal of various data markets aims to alleviate the issue of isolated data islands, so as to benefit data circulation. The problem of data pricing is indispensable yet challenging in data trade. In this paper, we conduct a comprehensive survey on the modern data pricing solutions. We divide the data pricing solutions into three major strategies and thirteen models, including query pricing strategy, feature-based data pricing strategy, and pricing strategy in machine learning. It is so far the first attempt to classify so many existing data pricing models. Moreover, we not only elaborate the thirteen specific pricing models within each pricing strategy, but also make in-depth analyses among these models. We also conclude five research directions for the data pricing field, and put forward some novel and interesting data pricing topics. This paper aims at gaining better insights, and directing the future research towards practical and sophisticated pricing mechanisms for better data trade and share.

Cross-lists for Fri, 9 Jun 23

[4]  arXiv:2306.05402 (cross-list from cs.IT) [pdf, ps, other]
Title: Fully Robust Federated Submodel Learning in a Distributed Storage System
Subjects: Information Theory (cs.IT); Cryptography and Security (cs.CR); Databases (cs.DB); Signal Processing (eess.SP)

We consider the federated submodel learning (FSL) problem in a distributed storage system. In the FSL framework, the full learning model at the server side is divided into multiple submodels such that each selected client needs to download only the required submodel(s) and upload the corresponding update(s) in accordance with its local training data. The server comprises multiple independent databases and the full model is stored across these databases. An eavesdropper passively observes all the storage and listens to all the communicated data, of its controlled databases, to gain knowledge about the remote client data and the submodel information. In addition, a subset of databases may fail, negatively affecting the FSL process, as FSL process may take a non-negligible amount of time for large models. To resolve these two issues together (i.e., security and database repair), we propose a novel coding mechanism coined ramp secure regenerating coding (RSRC), to store the full model in a distributed manner. Using our new RSRC method, the eavesdropper is permitted to learn a controllable amount of submodel information for the sake of reducing the communication and storage costs. Further, during the database repair process, in the construction of the replacement database, the submodels to be updated are stored in the form of their latest version from updating clients, while the remaining submodels are obtained from the previous version in other databases through routing clients. Our new RSRC-based distributed FSL approach is constructed on top of our earlier two-database FSL scheme which uses private set union (PSU). A complete one-round FSL process consists of FSL-PSU phase, FSL-write phase and additional auxiliary phases. Our proposed FSL scheme is also robust against database drop-outs, client drop-outs, client late-arrivals and an active adversary controlling databases.

Replacements for Fri, 9 Jun 23

[5]  arXiv:2302.02118 (replaced) [pdf, other]
Title: Practical View-Change-Less Protocol through Rapid View Synchronization
Comments: 16 pages, 14 figures
Subjects: Databases (cs.DB)
[6]  arXiv:2302.03629 (replaced) [pdf, ps, other]
Title: Principlism Guided Responsible Data Curation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
[7]  arXiv:2303.06516 (replaced) [pdf, other]
Title: Efficient Computation of Shap Explanation Scores for Neural Network Classifiers via Knowledge Compilation
Comments: Conference submission. It replaces the previously uploaded paper "Opening Up the Neural Network Classifier for Shap Score Computation", by the same authors. This version considerably revised the previous one
Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
[8]  arXiv:2306.04349 (replaced) [pdf, other]
Title: GPT Self-Supervision for a Better Data Annotator
Subjects: Computation and Language (cs.CL); Databases (cs.DB)
[ total of 8 entries: 1-8 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2306, contact, help  (Access key information)