We gratefully acknowledge support from
the Simons Foundation and member institutions.

Databases

New submissions

[ total of 16 entries: 1-16 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 14 May 24

[1]  arXiv:2405.06730 [pdf, other]
Title: Ocean-DC: An analysis ready data cube framework for environmental and climate change monitoring over the port areas
Subjects: Databases (cs.DB); Atmospheric and Oceanic Physics (physics.ao-ph)

The environmental hazards and climate change effects causes serious problems in land and coastal areas. A solution to this problem can be the periodic monitoring over critical areas, like coastal region with heavy industrial activity (i.e., ship-buildings) or areas where a disaster (i.e., oil-spill) has occurred. Today there are several Earth and non-Earth Observation data available from several data providers. These data are huge in size and usually it is needed to combine several data from multiple sources (i.e., data with format differences) for a more effective evaluation. For addressing these issues, this work proposes the Ocean-DC framework as a solution in data harmonization and homogenization. A strong advantage of this Data Cube implementation is the generation of a single NetCDF product that contains Earth Observation data of several data types (i.e., Landsat-8 and Sentinel-2). To evaluate the effectiveness and efficiency of the Ocean-DC implementation, it is examined a case study of an oil-spill in Saronic gulf in September of 2017. The generated 4D Data Cube considers both Landsat-8,9 and Sentinel-2 products for a time-series analysis, before, during, and after the oil-spill event. The Ocean-DC framework successfully generated a NetCDF product, containing all the necessary remote sensing products for monitoring the oil-spill disaster in the Saronic gulf.

[2]  arXiv:2405.06767 [pdf, other]
Title: Color: A Framework for Applying Graph Coloring to Subgraph Cardinality Estimation
Subjects: Databases (cs.DB)

Graph workloads pose a particularly challenging problem for query optimizers. They typically feature large queries made up of entirely many-to-many joins with complex correlations. This puts significant stress on traditional cardinality estimation methods which generally see catastrophic errors when estimating the size of queries with only a handful of joins. To overcome this, we propose COLOR, a framework for subgraph cardinality estimation which applies insights from graph compression theory to produce a compact summary that captures the global topology of the data graph. Further, we identify several key optimizations that enable tractable estimation over this summary even for large query graphs. We then evaluate several designs within this framework and find that they improve accuracy by up to 10$^3$x over all competing methods while maintaining fast inference, a small memory footprint, efficient construction, and graceful degradation under updates.

[3]  arXiv:2405.07081 [pdf, other]
Title: T-curator: a trust based curation tool for LOD logs
Authors: Dihia Lanasri
Subjects: Databases (cs.DB); Computation and Language (cs.CL)

Nowadays, companies are racing towards Linked Open Data (LOD) to improve their added value, but they are ignoring their SPARQL query logs. If well curated, these logs can present an asset for decision makers. A naive and straightforward use of these logs is too risky because their provenance and quality are highly questionable. Users of these logs in a trusted way have to be assisted by providing them with in-depth knowledge of the whole LOD environment and tools to curate these logs. In this paper, we propose an interactive and intuitive trust based tool that can be used to curate these LOD logs before exploiting them. This tool is proposed to support our approach proposed in our previous work Lanasri et al. [2020].

[4]  arXiv:2405.07196 [pdf, other]
Title: Permissioned Blockchain-based Framework for Ranking Synthetic Data Generators
Subjects: Databases (cs.DB); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

Synthetic data generation is increasingly recognized as a crucial solution to address data related challenges such as scarcity, bias, and privacy concerns. As synthetic data proliferates, the need for a robust evaluation framework to select a synthetic data generator becomes more pressing given the variety of options available. In this research study, we investigate two primary questions: 1) How can we select the most suitable synthetic data generator from a set of options for a specific purpose? 2) How can we make the selection process more transparent, accountable, and auditable? To address these questions, we introduce a novel approach in which the proposed ranking algorithm is implemented as a smart contract within a permissioned blockchain framework called Sawtooth. Through comprehensive experiments and comparisons with state-of-the-art baseline ranking solutions, our framework demonstrates its effectiveness in providing nuanced rankings that consider both desirable and undesirable properties. Furthermore, our framework serves as a valuable tool for selecting the optimal synthetic data generators for specific needs while ensuring compliance with data protection principles.

[5]  arXiv:2405.07792 [pdf, other]
Title: Optimal Matrix Sketching over Sliding Windows
Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)

Matrix sketching, aimed at approximating a matrix $\boldsymbol{A} \in \mathbb{R}^{N\times d}$ consisting of vector streams of length $N$ with a smaller sketching matrix $\boldsymbol{B} \in \mathbb{R}^{\ell\times d}, \ell \ll N$, has garnered increasing attention in fields such as large-scale data analytics and machine learning. A well-known deterministic matrix sketching method is the Frequent Directions algorithm, which achieves the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound and provides a covariance error guarantee of $\varepsilon = \lVert \boldsymbol{A}^\top \boldsymbol{A} - \boldsymbol{B}^\top \boldsymbol{B} \rVert_2/\lVert \boldsymbol{A} \rVert_F^2$. The matrix sketching problem becomes particularly interesting in the context of sliding windows, where the goal is to approximate the matrix $\boldsymbol{A}_W$, formed by input vectors over the most recent $N$ time units. However, despite recent efforts, whether achieving the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound on sliding windows is possible has remained an open question.
In this paper, we introduce the DS-FD algorithm, which achieves the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound for matrix sketching over row-normalized, sequence-based sliding windows. We also present matching upper and lower space bounds for time-based and unnormalized sliding windows, demonstrating the generality and optimality of \dsfd across various sliding window models. This conclusively answers the open question regarding the optimal space bound for matrix sketching over sliding windows. Furthermore, we conduct extensive experiments with both synthetic and real-world datasets, validating our theoretical claims and thus confirming the correctness and effectiveness of our algorithm, both theoretically and empirically.

Cross-lists for Tue, 14 May 24

[6]  arXiv:2405.07022 (cross-list from cs.LG) [pdf, other]
Title: DTMamba : Dual Twin Mamba for Time Series Forecasting
Subjects: Machine Learning (cs.LG); Databases (cs.DB)

We utilized the Mamba model for time series data prediction tasks, and the experimental results indicate that our model performs well.

[7]  arXiv:2405.07460 (cross-list from cs.LG) [pdf, other]
Title: HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)

Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundational models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical records, imaging data, and patient outcomes. It employs data preprocessing techniques and transformer-based architectures to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of the embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.

[8]  arXiv:2405.07601 (cross-list from cs.LG) [pdf, other]
Title: On-device Online Learning and Semantic Management of TinyML Systems
Comments: Accepted by Journal Transactions on Embedded Computing Systems (TECS)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)

Recent advances in Tiny Machine Learning (TinyML) empower low-footprint embedded devices for real-time on-device Machine Learning. While many acknowledge the potential benefits of TinyML, its practical implementation presents unique challenges. This study aims to bridge the gap between prototyping single TinyML models and developing reliable TinyML systems in production: (1) Embedded devices operate in dynamically changing conditions. Existing TinyML solutions primarily focus on inference, with models trained offline on powerful machines and deployed as static objects. However, static models may underperform in the real world due to evolving input data distributions. We propose online learning to enable training on constrained devices, adapting local models towards the latest field conditions. (2) Nevertheless, current on-device learning methods struggle with heterogeneous deployment conditions and the scarcity of labeled data when applied across numerous devices. We introduce federated meta-learning incorporating online learning to enhance model generalization, facilitating rapid learning. This approach ensures optimal performance among distributed devices by knowledge sharing. (3) Moreover, TinyML's pivotal advantage is widespread adoption. Embedded devices and TinyML models prioritize extreme efficiency, leading to diverse characteristics ranging from memory and sensors to model architectures. Given their diversity and non-standardized representations, managing these resources becomes challenging as TinyML systems scale up. We present semantic management for the joint management of models and devices at scale. We demonstrate our methods through a basic regression example and then assess them in three real-world TinyML applications: handwritten character image classification, keyword audio classification, and smart building presence detection, confirming our approaches' effectiveness.

[9]  arXiv:2405.07770 (cross-list from quant-ph) [pdf, other]
Title: Hype or Heuristic? Quantum Reinforcement Learning for Join Order Optimisation
Subjects: Quantum Physics (quant-ph); Databases (cs.DB); Machine Learning (cs.LG)

Identifying optimal join orders (JOs) stands out as a key challenge in database research and engineering. Owing to the large search space, established classical methods rely on approximations and heuristics. Recent efforts have successfully explored reinforcement learning (RL) for JO. Likewise, quantum versions of RL have received considerable scientific attention. Yet, it is an open question if they can achieve sustainable, overall practical advantages with improved quantum processors.
In this paper, we present a novel approach that uses quantum reinforcement learning (QRL) for JO based on a hybrid variational quantum ansatz. It is able to handle general bushy join trees instead of resorting to simpler left-deep variants as compared to approaches based on quantum(-inspired) optimisation, yet requires multiple orders of magnitudes fewer qubits, which is a scarce resource even for post-NISQ systems.
Despite moderate circuit depth, the ansatz exceeds current NISQ capabilities, which requires an evaluation by numerical simulations. While QRL may not significantly outperform classical approaches in solving the JO problem with respect to result quality (albeit we see parity), we find a drastic reduction in required trainable parameters. This benefits practically relevant aspects ranging from shorter training times compared to classical RL, less involved classical optimisation passes, or better use of available training data, and fits data-stream and low-latency processing scenarios. Our comprehensive evaluation and careful discussion delivers a balanced perspective on possible practical quantum advantage, provides insights for future systemic approaches, and allows for quantitatively assessing trade-offs of quantum approaches for one of the most crucial problems of database management systems.

Replacements for Tue, 14 May 24

[10]  arXiv:2404.18673 (replaced) [pdf, other]
Title: Open-Source Drift Detection Tools in Action: Insights from Two Use Cases
Subjects: Databases (cs.DB); Machine Learning (cs.LG)
[11]  arXiv:2405.05413 (replaced) [pdf, ps, other]
Title: Digital Evolution: Novo Nordisk's Shift to Ontology-Based Data Management
Comments: 14 pages, 2 figures
Subjects: Databases (cs.DB)
[12]  arXiv:2311.00960 (replaced) [pdf, other]
Title: Trajectory Similarity Measurement: An Efficiency Perspective
Comments: Accepted by VLDB 2024
Subjects: Databases (cs.DB)
[13]  arXiv:2211.15363 (replaced) [pdf, other]
Title: On the Security Vulnerabilities of Text-to-SQL Models
Comments: Best Paper Candidate at ISSRE 2023. Replaced "PLM" with "LLM" for better visibility
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG); Software Engineering (cs.SE)
[14]  arXiv:2310.02129 (replaced) [pdf, other]
Title: Unveiling the Pitfalls of Knowledge Editing for Large Language Models
Comments: ICLR 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB); Machine Learning (cs.LG)
[15]  arXiv:2312.14322 (replaced) [pdf, other]
Title: Data Needs and Challenges of Quantum Dot Devices Automation: Workshop Report
Comments: White paper/overview based on a workshop held at the National Institute of Standards and Technology, Gaithersburg, MD. 13 pages
Subjects: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Databases (cs.DB); Machine Learning (cs.LG); Quantum Physics (quant-ph)
[16]  arXiv:2405.03708 (replaced) [pdf, ps, other]
Title: Delta Tensor: Efficient Vector and Tensor Storage in Delta Lake
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Databases (cs.DB); Machine Learning (cs.LG)
[ total of 16 entries: 1-16 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2405, contact, help  (Access key information)