We gratefully acknowledge support from
the Simons Foundation and member institutions.

Computer Science

New submissions

[ total of 605 entries: 1-605 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 19 Apr 24

[1]  arXiv:2404.11630 [pdf, other]
Title: SNP: Structured Neuron-level Pruning to Preserve Attention Scores
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs), which have achieved great success in various vision tasks. However, their high computational cost and memory footprint hinder their deployment on resource-constrained devices. Conventional pruning approaches can only compress and accelerate the MSA module using head pruning, although the head is not an atomic unit. To address this issue, we propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP). SNP prunes neurons with less informative attention scores and eliminates redundancy among heads. Specifically, it prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors. For instance, the DeiT-Small with SNP runs 3.1$\times$ faster than the original model and achieves performance that is 21.94\% faster and 1.12\% higher than the DeiT-Tiny. Additionally, SNP combine successfully with conventional head or block pruning approaches. SNP with head pruning could compress the DeiT-Base by 80\% of the parameters and computational costs and achieve 3.85$\times$ faster inference speed on RTX3090 and 4.93$\times$ on Jetson Nano.

[2]  arXiv:2404.11631 [pdf, other]
Title: A Preliminary Study on Accelerating Simulation Optimization with GPU Implementation
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

We provide a preliminary study on utilizing GPU (Graphics Processing Unit) to accelerate computation for three simulation optimization tasks with either first-order or second-order algorithms. Compared to the implementation using only CPU (Central Processing Unit), the GPU implementation benefits from computational advantages of parallel processing for large-scale matrices and vectors operations. Numerical experiments demonstrate computational advantages of utilizing GPU implementation in simulation optimization problems, and show that such advantage comparatively further increase as the problem scale increases.

[3]  arXiv:2404.11661 [pdf, other]
Title: Designing an Intelligent Parcel Management System using IoT & Machine Learning
Comments: 6 pages, 6 figures, 2022 IEEE IAS Global Conference on Emerging Technologies (GlobConET)
Journal-ref: 2022 IEEE IAS Global Conference on Emerging Technologies (GlobConET), Arad, Romania, 2022, pp. 751-756
Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG)

Parcels delivery is a critical activity in railways. More importantly, each parcel must be thoroughly checked and sorted according to its destination address. We require an efficient and robust IoT system capable of doing all of these tasks with great precision and minimal human interaction. This paper discusses, We created a fully-fledged solution using IoT and machine learning to assist trains in performing this operation efficiently. In this study, we covered the product, which consists mostly of two phases. Scanning is the first step, followed by sorting. During the scanning process, the parcel will be passed through three scanners that will look for explosives, drugs, and any dangerous materials in the parcel and will trash it if any of the tests fail. When the scanning step is over, the parcel moves on to the sorting phase, where we use QR codes to retrieve the details of the parcels and sort them properly. The simulation of the system is done using the blender software. Our research shows that our procedure significantly improves accuracy as well as the assessment of cutting-edge technology and existing techniques.

[4]  arXiv:2404.11665 [pdf, other]
Title: Exploring DNN Robustness Against Adversarial Attacks Using Approximate Multipliers
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)

Deep Neural Networks (DNNs) have advanced in many real-world applications, such as healthcare and autonomous driving. However, their high computational complexity and vulnerability to adversarial attacks are ongoing challenges. In this letter, approximate multipliers are used to explore DNN robustness improvement against adversarial attacks. By uniformly replacing accurate multipliers for state-of-the-art approximate ones in DNN layer models, we explore the DNNs robustness against various adversarial attacks in a feasible time. Results show up to 7% accuracy drop due to approximations when no attack is present while improving robust accuracy up to 10% when attacks applied.

[5]  arXiv:2404.11667 [pdf, other]
Title: Deep Dependency Networks and Advanced Inference Schemes for Multi-Label Classification
Comments: Will appear in AISTATS 2024. arXiv admin note: substantial text overlap with arXiv:2302.00633
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

We present a unified framework called deep dependency networks (DDNs) that combines dependency networks and deep learning architectures for multi-label classification, with a particular emphasis on image and video data. The primary advantage of dependency networks is their ease of training, in contrast to other probabilistic graphical models like Markov networks. In particular, when combined with deep learning architectures, they provide an intuitive, easy-to-use loss function for multi-label classification. A drawback of DDNs compared to Markov networks is their lack of advanced inference schemes, necessitating the use of Gibbs sampling. To address this challenge, we propose novel inference schemes based on local search and integer linear programming for computing the most likely assignment to the labels given observations. We evaluate our novel methods on three video datasets (Charades, TACoS, Wetlab) and three image datasets (MS-COCO, PASCAL VOC, NUS-WIDE), comparing their performance with (a) basic neural architectures and (b) neural architectures combined with Markov networks equipped with advanced inference and learning techniques. Our results demonstrate the superiority of our new DDN methods over the two competing approaches.

[6]  arXiv:2404.11669 [pdf, other]
Title: Factorized Motion Fields for Fast Sparse Input Dynamic View Synthesis
Comments: Accepted at SIGGRAPH 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Designing a 3D representation of a dynamic scene for fast optimization and rendering is a challenging task. While recent explicit representations enable fast learning and rendering of dynamic radiance fields, they require a dense set of input viewpoints. In this work, we focus on learning a fast representation for dynamic radiance fields with sparse input viewpoints. However, the optimization with sparse input is under-constrained and necessitates the use of motion priors to constrain the learning. Existing fast dynamic scene models do not explicitly model the motion, making them difficult to be constrained with motion priors. We design an explicit motion model as a factorized 4D representation that is fast and can exploit the spatio-temporal correlation of the motion field. We then introduce reliable flow priors including a combination of sparse flow priors across cameras and dense flow priors within cameras to regularize our motion model. Our model is fast, compact and achieves very good performance on popular multi-view dynamic scene datasets with sparse input viewpoints. The source code for our model can be found on our project page: https://nagabhushansn95.github.io/publications/2024/RF-DeRF.html.

[7]  arXiv:2404.11671 [pdf, other]
Title: A Study of Undefined Behavior Across Foreign Function Boundaries in Rust Libraries
Comments: 26 pages without appendix supplement, preprint
Subjects: Software Engineering (cs.SE)

The Rust programming language restricts aliasing and mutability to provide static safety guarantees, which developers rely on to write secure and performant applications. However, Rust is frequently used to interoperate with other languages that have far weaker restrictions. These languages support cyclic and self-referential design patterns that conflict with current models of Rust's operational semantics, representing a potentially significant source of undefined behavior that no current tools can detect. We created MiriLLI, a tool which uses existing Rust and LLVM interpreters to jointly execute multi-language Rust applications. We used our tool in a large-scale study of Rust libraries that call foreign functions, and we found 45 instances of undefined or undesirable behavior. These include four bugs from libraries that had over 10,000 daily downloads on average, one from a component of the GNU Compiler Collection (GCC), and one from a library maintained by the Rust Project. Most of these errors were caused by incompatible aliasing and initialization patterns, incorrect foreign function bindings, and invalid type conversion. The majority of aliasing violations were caused by unsound operations in Rust, but they occurred in foreign code. The Rust community must invest in new tools for validating multi-language programs to ensure that developers can easily detect and fix these errors.

[8]  arXiv:2404.11672 [pdf, other]
Title: MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory
Subjects: Computation and Language (cs.CL)

While current large language models (LLMs) demonstrate some capabilities in knowledge-intensive tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with infrequent knowledge and temporal degradation. In addition, the uninterpretable nature of parametric memorization makes it challenging to understand and prevent hallucination. Parametric memory pools and model editing are only partial solutions. Retrieval Augmented Generation (RAG) $\unicode{x2013}$ though non-parametric $\unicode{x2013}$ has its own limitations: it lacks structure, complicates interpretability and makes it hard to effectively manage stored knowledge. In this paper, we introduce MemLLM, a novel method of enhancing LLMs by integrating a structured and explicit read-and-write memory module. MemLLM tackles the aforementioned challenges by enabling dynamic interaction with the memory and improving the LLM's capabilities in using stored knowledge. Our experiments indicate that MemLLM enhances the LLM's performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular. We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.

[9]  arXiv:2404.11673 [pdf, other]
Title: Hairpin Completion Distance Lower Bound
Comments: To be published in CPM 2024
Subjects: Data Structures and Algorithms (cs.DS)

Hairpin completion, derived from the hairpin formation observed in DNA biochemistry, is an operation applied to strings, particularly useful in DNA computing. Conceptually, a right hairpin completion operation transforms a string $S$ into $S\cdot S'$ where $S'$ is the reverse complement of a prefix of $S$. Similarly, a left hairpin completion operation transforms a string $S$ into $S'\cdot S$ where $S'$ is the reverse complement of a suffix of $S$. The hairpin completion distance from $S$ to $T$ is the minimum number of hairpin completion operations needed to transform $S$ into $T$. Recently Boneh et al. showed an $O(n^2)$ time algorithm for finding the hairpin completion distance between two strings of length at most $n$. In this paper we show that for any $\varepsilon>0$ there is no $O(n^{2-\varepsilon})$-time algorithm for the hairpin completion distance problem unless the Strong Exponential Time Hypothesis (SETH) is false. Thus, under SETH, the time complexity of the hairpin completion distance problem is quadratic, up to sub-polynomial factors.

[10]  arXiv:2404.11677 [pdf, other]
Title: Cross-Problem Learning for Solving Vehicle Routing Problems
Subjects: Artificial Intelligence (cs.AI)

Existing neural heuristics often train a deep architecture from scratch for each specific vehicle routing problem (VRP), ignoring the transferable knowledge across different VRP variants. This paper proposes the cross-problem learning to assist heuristics training for different downstream VRP variants. Particularly, we modularize neural architectures for complex VRPs into 1) the backbone Transformer for tackling the travelling salesman problem (TSP), and 2) the additional lightweight modules for processing problem-specific features in complex VRPs. Accordingly, we propose to pre-train the backbone Transformer for TSP, and then apply it in the process of fine-tuning the Transformer models for each target VRP variant. On the one hand, we fully fine-tune the trained backbone Transformer and problem-specific modules simultaneously. On the other hand, we only fine-tune small adapter networks along with the modules, keeping the backbone Transformer still. Extensive experiments on typical VRPs substantiate that 1) the full fine-tuning achieves significantly better performance than the one trained from scratch, and 2) the adapter-based fine-tuning also delivers comparable performance while being notably parameter-efficient. Furthermore, we empirically demonstrate the favorable effect of our method in terms of cross-distribution application and versatility.

[11]  arXiv:2404.11681 [pdf, ps, other]
Title: Evaluating Tenant-Landlord Tensions Using Generative AI on Online Tenant Forums
Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)

Tenant-landlord relationships exhibit a power asymmetry where landlords' power to evict the tenants at a low-cost results in their dominating status in such relationships. Tenant concerns are thus often unspoken, unresolved, or ignored and this could lead to blatant conflicts as suppressed tenant concerns accumulate. Modern machine learning methods and Large Language Models (LLM) have demonstrated immense abilities to perform language tasks. In this study, we incorporate Latent Dirichlet Allocation (LDA) with GPT-4 to classify Reddit post data scraped from the subreddit r/Tenant, aiming to unveil trends in tenant concerns while exploring the adoption of LLMs and machine learning methods in social science research. We find that tenant concerns in topics like fee dispute and utility issues are consistently dominant in all four states analyzed while each state has other common tenant concerns special to itself. Moreover, we discover temporal trends in tenant concerns that provide important implications regarding the impact of the pandemic and the Eviction Moratorium.

[12]  arXiv:2404.11682 [pdf, other]
Title: How Well Can You Articulate that Idea? Insights from Automated Formative Assessment
Subjects: Computation and Language (cs.CL)

Automated methods are becoming increasingly integrated into studies of formative feedback on students' science explanation writing. Most of this work, however, addresses students' responses to short answer questions. We investigate automated feedback on students' science explanation essays, where students must articulate multiple ideas. Feedback is based on a rubric that identifies the main ideas students are prompted to include in explanatory essays about the physics of energy and mass, given their experiments with a simulated roller coaster. We have found that students generally improve on revised versions of their essays. Here, however, we focus on two factors that affect the accuracy of the automated feedback. First, we find that the main ideas in the rubric differ with respect to how much freedom they afford in explanations of the idea, thus explanation of a natural law is relatively constrained. Students have more freedom in how they explain complex relations they observe in their roller coasters, such as transfer of different forms of energy. Second, by tracing the automated decision process, we can diagnose when a student's statement lacks sufficient clarity for the automated tool to associate it more strongly with one of the main ideas above all others. This in turn provides an opportunity for teachers and peers to help students reflect on how to state their ideas more clearly.

[13]  arXiv:2404.11683 [pdf, other]
Title: Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Representing the environment is a central challenge in robotics, and is essential for effective decision-making. Traditionally, before capturing images with a manipulator-mounted camera, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag. However, recent advances in computer vision have led to the development of \emph{3D foundation models}. These are large, pre-trained neural networks that can establish fast and accurate multi-view correspondences with very few images, even in the absence of rich visual features. This paper advocates for the integration of 3D foundation models into scene representation approaches for robotic systems equipped with manipulator-mounted RGB cameras. Specifically, we propose the Joint Calibration and Representation (JCR) method. JCR uses RGB images, captured by a manipulator-mounted camera, to simultaneously construct an environmental representation and calibrate the camera relative to the robot's end-effector, in the absence of specific calibration markers. The resulting 3D environment representation is aligned with the robot's coordinate frame and maintains physically accurate scales. We demonstrate that JCR can build effective scene representations using a low-cost RGB camera attached to a manipulator, without prior calibration.

[14]  arXiv:2404.11691 [pdf, ps, other]
Title: Improvement in Semantic Address Matching using Natural Language Processing
Comments: 5 pages, 7 tables, 2021 2nd International Conference for Emerging Technology (INCET)
Journal-ref: 2021 2nd International Conference for Emerging Technology (INCET), Belagavi, India, 2021, pp. 1-5
Subjects: Computation and Language (cs.CL)

Address matching is an important task for many businesses especially delivery and take out companies which help them to take out a certain address from their data warehouse. Existing solution uses similarity of strings, and edit distance algorithms to find out the similar addresses from the address database, but these algorithms could not work effectively with redundant, unstructured, or incomplete address data. This paper discuss semantic Address matching technique, by which we can find out a particular address from a list of possible addresses. We have also reviewed existing practices and their shortcoming. Semantic address matching is an essentially NLP task in the field of deep learning. Through this technique We have the ability to triumph the drawbacks of existing methods like redundant or abbreviated data problems. The solution uses the OCR on invoices to extract the address and create the data pool of addresses. Then this data is fed to the algorithm BM-25 for scoring the best matching entries. Then to observe the best result, this will pass through BERT for giving the best possible result from the similar queries. Our investigation exhibits that our methodology enormously improves both accuracy and review of cutting-edge technology existing techniques.

[15]  arXiv:2404.11698 [pdf, other]
Title: A Secure and Trustworthy Network Architecture for Federated Learning Healthcare Applications
Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

Federated Learning (FL) has emerged as a promising approach for privacy-preserving machine learning, particularly in sensitive domains such as healthcare. In this context, the TRUSTroke project aims to leverage FL to assist clinicians in ischemic stroke prediction. This paper provides an overview of the TRUSTroke FL network infrastructure. The proposed architecture adopts a client-server model with a central Parameter Server (PS). We introduce a Docker-based design for the client nodes, offering a flexible solution for implementing FL processes in clinical settings. The impact of different communication protocols (HTTP or MQTT) on FL network operation is analyzed, with MQTT selected for its suitability in FL scenarios. A control plane to support the main operations required by FL processes is also proposed. The paper concludes with an analysis of security aspects of the FL architecture, addressing potential threats and proposing mitigation strategies to increase the trustworthiness level.

[16]  arXiv:2404.11699 [pdf, other]
Title: Retrieval-Augmented Embodied Agents
Comments: CVPR2024
Subjects: Robotics (cs.RO)

Embodied agents operating in complex and uncertain environments face considerable challenges. While some advanced agents handle complex manipulation tasks with proficiency, their success often hinges on extensive training data to develop their capabilities. In contrast, humans typically rely on recalling past experiences and analogous situations to solve new problems. Aiming to emulate this human approach in robotics, we introduce the Retrieval-Augmented Embodied Agent (RAEA). This innovative system equips robots with a form of shared memory, significantly enhancing their performance. Our approach integrates a policy retriever, allowing robots to access relevant strategies from an external policy memory bank based on multi-modal inputs. Additionally, a policy generator is employed to assimilate these strategies into the learning process, enabling robots to formulate effective responses to tasks. Extensive testing of RAEA in both simulated and real-world scenarios demonstrates its superior performance over traditional methods, representing a major leap forward in robotic technology.

[17]  arXiv:2404.11706 [pdf, other]
Title: Pretraining Billion-scale Geospatial Foundational Models on Frontier
Subjects: Artificial Intelligence (cs.AI)

As AI workloads increase in scope, generalization capability becomes challenging for small task-specific models and their demand for large amounts of labeled training samples increases. On the contrary, Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning and have been shown to adapt to various tasks with minimal fine-tuning. Although large FMs have demonstrated significant impact in natural language processing and computer vision, efforts toward FMs for geospatial applications have been restricted to smaller size models, as pretraining larger models requires very large computing resources equipped with state-of-the-art hardware accelerators. Current satellite constellations collect 100+TBs of data a day, resulting in images that are billions of pixels and multimodal in nature. Such geospatial data poses unique challenges opening up new opportunities to develop FMs. We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data. We studied from end-to-end the performance and impact in the solution by scaling the model size. Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy when comparing a 100M parameter model. Moreover, we detail performance experiments on the Frontier supercomputer, America's first exascale system, where we study different model and data parallel approaches using PyTorch's Fully Sharded Data Parallel library. Specifically, we study variants of the Vision Transformer architecture (ViT), conducting performance analysis for ViT models with size up to 15B parameters. By discussing throughput and performance bottlenecks under different parallelism configurations, we offer insights on how to leverage such leadership-class HPC resources when developing large models for geospatial imagery applications.

[18]  arXiv:2404.11707 [pdf, other]
Title: Perspectives on Contractivity in Control, Optimization, and Learning
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)

Contraction theory is a mathematical framework for studying the convergence, robustness, and modularity properties of dynamical systems and algorithms. In this opinion paper, we provide five main opinions on the virtues of contraction theory. These opinions are (i) contraction theory is a unifying framework emerging from classical and modern works, (ii) contractivity is computationally-friendly, robust, and modular stability, (iii) numerous dynamical systems are contracting, (iv) contraction theory is relevant to modern applications, and (v) contraction theory can be vastly extended in numerous directions. We survey recent theoretical and applied research in each of these five directions.

[19]  arXiv:2404.11709 [pdf, ps, other]
Title: Satisfiability of commutative vs. non-commutative CSPs
Subjects: Computational Complexity (cs.CC); Logic in Computer Science (cs.LO); Quantum Physics (quant-ph)

The Mermin-Peres magic square is a celebrated example of a system of Boolean linear equations that is not (classically) satisfiable but is satisfiable via linear operators on a Hilbert space of dimension four. A natural question is then, for what kind of problems such a phenomenon occurs? Atserias, Kolaitis, and Severini answered this question for all Boolean Constraint Satisfaction Problems (CSPs): For 2-SAT, Horn-SAT, and Dual Horn-SAT, classical satisfiability and operator satisfiability is the same and thus there is no gap; for all other Boolean CSPs, the two notions differ as there is a gap, i.e., there are unsatisfiable instances that are satisfied via operators on a finite-dimensional Hilbert space. We generalize their result to CSPs on arbitrary finite domains: CSPs of so-called bounded-width have no satisfiability gap, whereas all other CSPs have a satisfiability gap.

[20]  arXiv:2404.11712 [pdf, ps, other]
Title: Numerical Analysis of Locally Adaptive Penalty Methods For The Navier-Stokes Equations
Authors: Rui Fang
Subjects: Numerical Analysis (math.NA)

Penalty methods relax the incompressibility condition and uncouple velocity and pressure. Experience with them indicates that the velocity error is sensitive to the choice of penalty parameter $\epsilon$. So far, there is no effective \'a prior formula for $\epsilon$. Recently, Xie developed an adaptive penalty scheme for the Stokes problem that picks the penalty parameter $\epsilon$ self-adaptively element by element small where $\nabla \cdot u^h$ is large. Her numerical tests gave accurate fluid predictions. The next natural step, developed here, is to extend the algorithm with supporting analysis to the non-linear, time-dependent incompressible Navier-Stokes equations. In this report, we prove its unconditional stability, control of $\|\nabla \cdot u^h\|$, and provide error estimates. We confirm the predicted convergence rates with numerical tests.

[21]  arXiv:2404.11714 [pdf, ps, other]
Title: Implementation and Evaluation of a Gradient Descent-Trained Defensible Blackboard Architecture System
Subjects: Artificial Intelligence (cs.AI)

A variety of forms of artificial intelligence systems have been developed. Two well-known techniques are neural networks and rule-fact expert systems. The former can be trained from presented data while the latter is typically developed by human domain experts. A combined implementation that uses gradient descent to train a rule-fact expert system has been previously proposed. A related system type, the Blackboard Architecture, adds an actualization capability to expert systems. This paper proposes and evaluates the incorporation of a defensible-style gradient descent training capability into the Blackboard Architecture. It also introduces the use of activation functions for defensible artificial intelligence systems and implements and evaluates a new best path-based training algorithm.

[22]  arXiv:2404.11716 [pdf, other]
Title: A Survey on Semantic Modeling for Building Energy Management
Comments: 29 pages, 6 figures, 2 tables
Subjects: Artificial Intelligence (cs.AI)

Buildings account for a substantial portion of global energy consumption. Reducing buildings' energy usage primarily involves obtaining data from building systems and environment, which are instrumental in assessing and optimizing the building's performance. However, as devices from various manufacturers represent their data in unique ways, this disparity introduces challenges for semantic interoperability and creates obstacles in developing scalable building applications. This survey explores the leading semantic modeling techniques deployed for energy management in buildings. Furthermore, it aims to offer tangible use cases for applying semantic models, shedding light on the pivotal concepts and limitations intrinsic to each model. Our findings will assist researchers in discerning the appropriate circumstances and methodologies for employing these models in various use cases.

[23]  arXiv:2404.11717 [pdf, other]
Title: How often are errors in natural language reasoning due to paraphrastic variability?
Comments: accepted to TACL 2024 (pre-MIT Press publication version)
Subjects: Computation and Language (cs.CL)

Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model's variance in correctness attributable to paraphrasing. To estimate paraphrastic consistency, we collect ParaNLU, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference. Using ParaNLU, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not finetuning. All models tested exhibited room for improvement in paraphrastic consistency.

[24]  arXiv:2404.11718 [pdf, other]
Title: Linear and nonlinear filtering for a two-layer quasi-geostrophic ocean model
Comments: 34 pages, 13 figures
Subjects: Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)

Although the two-layer quasi-geostrophic equations (2QGE) are a simplified model for the dynamics of a stratified, wind-driven ocean, their numerical simulation is still plagued by the need for high resolution to capture the full spectrum of turbulent scales. Since such high resolution would lead to unreasonable computational times, it is typical to resort to coarse low-resolution meshes combined with the so-called eddy viscosity parameterization to account for the diffusion mechanisms that are not captured due to mesh under-resolution. We propose to enable the use of further coarsened meshes by adding a (linear or nonlinear) differential low-pass to the 2QGE, without changing the eddy viscosity coefficient. While the linear filter introduces constant (additional) artificial viscosity everywhere in the domain, the nonlinear filter relies on an indicator function to determine where and how much artificial viscosity is needed. Through several numerical results for a double-gyre wind forcing benchmark, we show that with the nonlinear filter we obtain accurate results with very coarse meshes, thereby drastically reducing the computational time (speed up ranging from 30 to 300).

[25]  arXiv:2404.11720 [pdf, other]
Title: GEOBIND: Binding Text, Image, and Audio through Satellite Images
Comments: 2024 IEEE International Geoscience and Remote Sensing Symposium
Subjects: Artificial Intelligence (cs.AI)

In remote sensing, we are interested in modeling various modalities for some geographic location. Several works have focused on learning the relationship between a location and type of landscape, habitability, audio, textual descriptions, etc. Recently, a common way to approach these problems is to train a deep-learning model that uses satellite images to infer some unique characteristics of the location. In this work, we present a deep-learning model, GeoBind, that can infer about multiple modalities, specifically text, image, and audio, from satellite imagery of a location. To do this, we use satellite images as the binding element and contrastively align all other modalities to the satellite image data. Our training results in a joint embedding space with multiple types of data: satellite image, ground-level image, audio, and text. Furthermore, our approach does not require a single complex dataset that contains all the modalities mentioned above. Rather it only requires multiple satellite-image paired data. While we only align three modalities in this paper, we present a general framework that can be used to create an embedding space with any number of modalities by using satellite images as the binding element. Our results show that, unlike traditional unimodal models, GeoBind is versatile and can reason about multiple modalities for a given satellite image input.

[26]  arXiv:2404.11721 [pdf, other]
Title: Functionality Locality, Mixture & Control = Logic = Memory
Authors: Xiangjun Peng
Subjects: Hardware Architecture (cs.AR)

This work provides new insights and constructs to the field of computer architecture and systems, and these insights are expected to be useful for the broad software stack. First, this work introduces Functionality Locality: this form of Functionality Locality shows that functionalities can be changed with a single piece of information, by solely changing the access order. This broadens the scope of ``principle of locality", which originally includes spatial and temporal locality. Second, this work coins the term Mixture, by incorporating the layout-directed functionalities with the original quantifiers such as scalar and vector. The implications of Mixture significantly expands new understanding of quantifiers, and this work identifies several important ones (from the author perspective). Third, with Functionality and Mixture, this work identifies the principle ``Control = Logic = Memory", and provides a revisit to Von Neumann architectures and Harvard architectures. This centers the focus on the memory, and brings further guidelines on memory-centric architectures with a new analytic framework. Fourth, this work discusses several important implications from this work in a variety of aspects.

[27]  arXiv:2404.11726 [pdf, other]
Title: Investigating Gender Bias in Turkish Language Models
Comments: arXiv admin note: text overlap with arXiv:1903.10561 by other authors
Subjects: Computation and Language (cs.CL)

Language models are trained mostly on Web data, which often contains social stereotypes and biases that the models can inherit. This has potentially negative consequences, as models can amplify these biases in downstream tasks or applications. However, prior research has primarily focused on the English language, especially in the context of gender bias. In particular, grammatically gender-neutral languages such as Turkish are underexplored despite representing different linguistic properties to language models with possibly different effects on biases. In this paper, we fill this research gap and investigate the significance of gender bias in Turkish language models. We build upon existing bias evaluation frameworks and extend them to the Turkish language by translating existing English tests and creating new ones designed to measure gender bias in the context of T\"urkiye. Specifically, we also evaluate Turkish language models for their embedded ethnic bias toward Kurdish people. Based on the experimental results, we attribute possible biases to different model characteristics such as the model size, their multilingualism, and the training corpora. We make the Turkish gender bias dataset publicly available.

[28]  arXiv:2404.11727 [pdf, ps, other]
Title: Deep Learning for Video-Based Assessment of Endotracheal Intubation Skills
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Endotracheal intubation (ETI) is an emergency procedure performed in civilian and combat casualty care settings to establish an airway. Objective and automated assessment of ETI skills is essential for the training and certification of healthcare providers. However, the current approach is based on manual feedback by an expert, which is subjective, time- and resource-intensive, and is prone to poor inter-rater reliability and halo effects. This work proposes a framework to evaluate ETI skills using single and multi-view videos. The framework consists of two stages. First, a 2D convolutional autoencoder (AE) and a pre-trained self-supervision network extract features from videos. Second, a 1D convolutional enhanced with a cross-view attention module takes the features from the AE as input and outputs predictions for skill evaluation. The ETI datasets were collected in two phases. In the first phase, ETI is performed by two subject cohorts: Experts and Novices. In the second phase, novice subjects perform ETI under time pressure, and the outcome is either Successful or Unsuccessful. A third dataset of videos from a single head-mounted camera for Experts and Novices is also analyzed. The study achieved an accuracy of 100% in identifying Expert/Novice trials in the initial phase. In the second phase, the model showed 85% accuracy in classifying Successful/Unsuccessful procedures. Using head-mounted cameras alone, the model showed a 96% accuracy on Expert and Novice classification while maintaining an accuracy of 85% on classifying successful and unsuccessful. In addition, GradCAMs are presented to explain the differences between Expert and Novice behavior and Successful and Unsuccessful trials. The approach offers a reliable and objective method for automated assessment of ETI skills.

[29]  arXiv:2404.11728 [pdf, other]
Title: Araucaria: Simplifying INC Fault Tolerance with High-Level Intents
Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)

Network programmability allows modification of fine-grain data plane functionality. The performance benefits of data plane programmability have motivated many researchers to offload computation that previously operated only on servers to the network, creating the notion of in-network computing (INC). Because failures can occur in the data plane, fault tolerance mechanisms are essential for INC. However, INC operators and developers must manually set fault tolerance requirements using domain knowledge to change the source code. These manually set requirements may take time and lead to errors in case of misconfiguration. In this work, we present Araucaria, a system that aims to simplify the definition and implementation of fault tolerance requirements for INC. The system allows requirements specification using an intent language, which enables the expression of consistency and availability requirements in a constrained natural language. A refinement process translates the intent and incorporates the essential building blocks and configurations into the INC code. We present a prototype of Araucaria and analyze the end-to-end system behavior. Experiments demonstrate that the refinement scales to multiple intents and that the system provides fault tolerance with negligible overhead in failure scenarios.

[30]  arXiv:2404.11730 [pdf, other]
Title: Missed Connections: Lateral Thinking Puzzles for Large Language Models
Comments: 8 pages, 3 figures
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.

[31]  arXiv:2404.11731 [pdf, other]
Title: A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search
Subjects: Information Retrieval (cs.IR)

A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search. Its objective is to return a set of $k$ data points that are closest to a query point, with its accuracy measured by the proportion of exact nearest neighbors captured in the returned set. One popular approach to this question is clustering: The indexing algorithm partitions data points into non-overlapping subsets and represents each partition by a point such as its centroid. The query processing algorithm first identifies the nearest clusters -- a process known as routing -- then performs a nearest neighbor search over those clusters only. In this work, we make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank. Interestingly, ground-truth is often freely available: Given a query distribution in a top-$k$ configuration, the ground-truth is the set of clusters that contain the exact top-$k$ vectors. We develop this insight and apply it to Maximum Inner Product Search (MIPS). As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS.

[32]  arXiv:2404.11732 [pdf, other]
Title: Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach
Comments: Accepted at CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The emergence of attention-based transformer models has led to their extensive use in various tasks, due to their superior generalization and transfer properties. Recent research has demonstrated that such models, when prompted appropriately, are excellent for few-shot inference. However, such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work, we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally, we introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall, this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or transduction). Furthermore, test-time optimization leveraging unlabelled test data can be used to improve the prompts, which we refer to as transductive prompt tuning.

[33]  arXiv:2404.11734 [pdf, other]
Title: Let's Ask AI About Their Programs: Exploring ChatGPT's Answers To Program Comprehension Questions
Subjects: Computers and Society (cs.CY)

Recent research has explored the creation of questions from code submitted by students. These Questions about Learners' Code (QLCs) are created through program analysis, exploring execution paths, and then creating code comprehension questions from these paths and the broader code structure. Responding to the questions requires reading and tracing the code, which is known to support students' learning. At the same time, computing education researchers have witnessed the emergence of Large Language Models (LLMs) that have taken the community by storm. Researchers have demonstrated the applicability of these models especially in the introductory programming context, outlining their performance in solving introductory programming problems and their utility in creating new learning resources. In this work, we explore the capability of the state-of-the-art LLMs (GPT-3.5 and GPT-4) in answering QLCs that are generated from code that the LLMs have created. Our results show that although the state-of-the-art LLMs can create programs and trace program execution when prompted, they easily succumb to similar errors that have previously been recorded for novice programmers. These results demonstrate the fallibility of these models and perhaps dampen the expectations fueled by the recent LLM hype. At the same time, we also highlight future research possibilities such as using LLMs to mimic students as their behavior can indeed be similar for some specific tasks.

[34]  arXiv:2404.11735 [pdf, other]
Title: Learning with 3D rotations, a hitchhiker's guide to SO(3)
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Many settings in machine learning require the selection of a rotation representation. However, choosing a suitable representation from the many available options is challenging. This paper acts as a survey and guide through rotation representations. We walk through their properties that harm or benefit deep learning with gradient-based optimization. By consolidating insights from rotation-based learning, we provide a comprehensive overview of learning functions with rotation representations. We provide guidance on selecting representations based on whether rotations are in the model's input or output and whether the data primarily comprises small angles.

[35]  arXiv:2404.11737 [pdf, other]
Title: Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection
Comments: technical report
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.

[36]  arXiv:2404.11740 [pdf, ps, other]
Title: Simulating Cloud Environments of Connected Vehicles for Anomaly Detection
Comments: 11 pages, 10 figures
Subjects: Emerging Technologies (cs.ET); Distributed, Parallel, and Cluster Computing (cs.DC)

The emergence of connected vehicles is driven by increasing customer and regulatory demands. To meet these, more complex software applications, some of which require service-based cloud and edge backends, are developed. When new software is deployed however, the high complexity and interdependencies between components can lead to unforeseen side effects in other system parts. As such, it becomes more challenging to recognize whether deviations to the intended system behavior are occurring, ultimately resulting in higher monitoring efforts and slower responses to errors. To overcome this problem, a simulation of the cloud environment running in parallel to the system is proposed. This approach enables the live comparison between simulated and real cloud behavior. Therefore, a concept is developed mirroring the existing cloud system into a simulation. To collect the necessary data, an observability platform is presented, capturing telemetry and architecture information. Subsequently, a simulation environment is designed that converts the architecture into a simulation model and simulates its dynamic workload by utilizing captured communication data. The proposed concept is evaluated in a real-world application scenario for electric vehicle charging: Vehicles can apply for an unoccupied charging station at a cloud service backend, the latter which manages all incoming requests and performs the assignment. Benchmarks are conducted by comparing the collected telemetry data with the simulated results under different loads and injected faults. The results show that regular cloud behavior is mirrored well by the simulation and that misbehavior due to fault injection is well visible, indicating that simulations are a promising data source for anomaly detection in connected vehicle cloud environments during operation.

[37]  arXiv:2404.11742 [pdf, other]
Title: Meta-Decomposition: Dynamic Segmentation Approach Selection in IoT-based Activity Recognition
Subjects: Artificial Intelligence (cs.AI)

Internet of Things (IoT) devices generate heterogeneous data over time; and relying solely on individual data points is inadequate for accurate analysis.
Segmentation is a common preprocessing step in many IoT applications, including IoT-based activity recognition, aiming to address the limitations of individual events and streamline the process. However, this step introduces at least two families of uncontrollable biases. The first is caused by the changes made by the segmentation process on the initial problem space, such as dividing the input data into 60 seconds windows. The second category of biases results from the segmentation process itself, including the fixation of the segmentation method and its parameters.
To address these biases, we propose to redefine the segmentation problem as a special case of a decomposition problem, including three key components: a decomposer, resolutions, and a composer.
The inclusion of the composer task in the segmentation process facilitates an assessment of the relationship between the original problem and the problem after the segmentation. Therefore, It leads to an improvement in the evaluation process and, consequently, in the selection of the appropriate segmentation method.
Then, we formally introduce our novel meta-decomposition or learning-to-decompose approach. It reduces the segmentation biases by considering the segmentation as a hyperparameter to be optimized by the outer learning problem. Therefore, meta-decomposition improves the overall system performance by dynamically selecting the appropriate segmentation method without including the mentioned biases. Extensive experiments on four real-world datasets demonstrate the effectiveness of our proposal.

[38]  arXiv:2404.11744 [pdf, other]
Title: Incremental Bootstrapping and Classification of Structured Scenes in a Fuzzy Ontology
Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Logic in Computer Science (cs.LO); Robotics (cs.RO)

We foresee robots that bootstrap knowledge representations and use them for classifying relevant situations and making decisions based on future observations. Particularly for assistive robots, the bootstrapping mechanism might be supervised by humans who should not repeat a training phase several times and should be able to refine the taught representation. We consider robots that bootstrap structured representations to classify some intelligible categories. Such a structure should be incrementally bootstrapped, i.e., without invalidating the identified category models when a new additional category is considered. To tackle this scenario, we presented the Scene Identification and Tagging (SIT) algorithm, which bootstraps structured knowledge representation in a crisp OWL-DL ontology. Over time, SIT bootstraps a graph representing scenes, sub-scenes and similar scenes. Then, SIT can classify new scenes within the bootstrapped graph through logic-based reasoning. However, SIT has issues with sensory data because its crisp implementation is not robust to perception noises. This paper presents a reformulation of SIT within the fuzzy domain, which exploits a fuzzy DL ontology to overcome the robustness issues. By comparing the performances of fuzzy and crisp implementations of SIT, we show that fuzzy SIT is robust, preserves the properties of its crisp formulation, and enhances the bootstrapped representations. On the contrary, the fuzzy implementation of SIT leads to less intelligible knowledge representations than the one bootstrapped in the crisp domain.

[39]  arXiv:2404.11746 [pdf, ps, other]
Title: On the Representation of Block Languages
Subjects: Formal Languages and Automata Theory (cs.FL)

In this paper we consider block languages, namely sets of words having the same length, and we propose a new representation for these languages. In particular, given an alphabet of size $k$ and a length $\ell$, these languages can be represented by bitmaps of size $k^\ell$, in which each bit indicates whether the correspondent word, according to the lexicographical order, belongs to the language (bit equal to 1) or not (bit equal to 0). This representation turns out to be a good tool for the investigation of several properties of block languages, making proofs simpler and reasoning clearer. After showing how to convert bitmaps into minimal deterministic and nondeterministic finite automata, we use this representation as a tool to study the deterministic and nondeterministic state complexity of block languages, as well as the costs of basic operations on block languages, in terms of the sizes of the equivalent finite automata.

[40]  arXiv:2404.11752 [pdf, ps, other]
Title: Mapping Violence: Developing an Extensive Framework to Build a Bangla Sectarian Expression Dataset from Social Media Interactions
Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)

Communal violence in online forums has become extremely prevalent in South Asia, where many communities of different cultures coexist and share resources. These societies exhibit a phenomenon characterized by strong bonds within their own groups and animosity towards others, leading to conflicts that frequently escalate into violent confrontations. To address this issue, we have developed the first comprehensive framework for the automatic detection of communal violence markers in online Bangla content accompanying the largest collection (13K raw sentences) of social media interactions that fall under the definition of four major violence class and their 16 coarse expressions. Our workflow introduces a 7-step expert annotation process incorporating insights from social scientists, linguists, and psychologists. By presenting data statistics and benchmarking performance using this dataset, we have determined that, aside from the category of Non-communal violence, Religio-communal violence is particularly pervasive in Bangla text. Moreover, we have substantiated the effectiveness of fine-tuning language models in identifying violent comments by conducting preliminary benchmarking on the state-of-the-art Bangla deep learning model.

[41]  arXiv:2404.11753 [pdf, other]
Title: Virtual Foundry Graphnet for Metal Sintering Deformation Prediction
Subjects: Machine Learning (cs.LG)

Metal Sintering is a necessary step for Metal Injection Molded parts and binder jet such as HP's metal 3D printer. The metal sintering process introduces large deformation varying from 25 to 50% depending on the green part porosity. In this paper, we use a graph-based deep learning approach to predict the part deformation, which can speed up the deformation simulation substantially at the voxel level. Running a well-trained Metal Sintering inferencing engine only takes a range of seconds to obtain the final sintering deformation value. The tested accuracy on example complex geometry achieves 0.7um mean deviation for a 63mm testing part.

[42]  arXiv:2404.11754 [pdf, other]
Title: Improved Generalization Bounds for Communication Efficient Federated Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients' generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and our representation learning interpretation of this analysis, we show for the first time that less frequent aggregations, hence more local updates, for the representation extractor (usually corresponds to initial layers) leads to the creation of more generalizable models, particularly for non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS.

[43]  arXiv:2404.11755 [pdf, other]
Title: The Ramshaw-Mesina Hybrid Algorithm applied to the Navier Stokes Equations
Comments: 19 pages, 7 figures,1 table
Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)

In 1991, Ramshaw and Mesina proposed a novel synthesis of penalty methods and artificial compression methods. When the two were balanced they found the combination was 3-4 orders more accurate than either alone. This report begins the study of their interesting method applied to the Navier-Stokes equations. We perform stability analysis, semi-discrete error analysis, and tests of the algorithm. Although most of the results for implicit time discretizations of our numerical tests comply with theirs for explicit time discretizations, the behavior in damping pressure oscillations and violations of incompressibility are different from their findings and our heuristic analysis.

[44]  arXiv:2404.11757 [pdf, other]
Title: Language Models Still Struggle to Zero-shot Reason about Time Series
Subjects: Computation and Language (cs.CL)

Time series are critical for decision-making in fields like finance and healthcare. Their importance has driven a recent influx of works passing time series into language models, leading to non-trivial forecasting on some datasets. But it remains unknown whether non-trivial forecasting implies that language models can reason about time series. To address this gap, we generate a first-of-its-kind evaluation framework for time series reasoning, including formal tasks and a corresponding dataset of multi-scale time series paired with text captions across ten domains. Using these data, we probe whether language models achieve three forms of reasoning: (1) Etiological Reasoning - given an input time series, can the language model identify the scenario that most likely created it? (2) Question Answering - can a language model answer factual questions about time series? (3) Context-Aided Forecasting - does highly relevant textual context improve a language model's time series forecasts?
We find that otherwise highly-capable language models demonstrate surprisingly limited time series reasoning: they score marginally above random on etiological and question answering tasks (up to 30 percentage points worse than humans) and show modest success in using context to improve forecasting. These weakness showcase that time series reasoning is an impactful, yet deeply underdeveloped direction for language model research. We also make our datasets and code public at to support further research in this direction at https://github.com/behavioral-data/TSandLanguage

[45]  arXiv:2404.11760 [pdf, other]
Title: Predictive Model Development to Identify Failed Healing in Patients after Non-Union Fracture Surgery
Comments: To be presented at the 46th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC 2024)
Subjects: Machine Learning (cs.LG)

Bone non-union is among the most severe complications associated with trauma surgery, occurring in 10-30% of cases after long bone fractures. Treating non-unions requires a high level of surgical expertise and often involves multiple revision surgeries, sometimes even leading to amputation. Thus, more accurate prognosis is crucial for patient well-being. Recent advances in machine learning (ML) hold promise for developing models to predict non-union healing, even when working with smaller datasets, a commonly encountered challenge in clinical domains. To demonstrate the effectiveness of ML in identifying candidates at risk of failed non-union healing, we applied three ML models (logistic regression, support vector machine, and XGBoost) to the clinical dataset TRUFFLE, which includes 797 patients with long bone non-union. The models provided prediction results with 70% sensitivity, and the specificities of 66% (XGBoost), 49% (support vector machine), and 43% (logistic regression). These findings offer valuable clinical insights because they enable early identification of patients at risk of failed non-union healing after the initial surgical revision treatment protocol.

[46]  arXiv:2404.11762 [pdf, other]
Title: IrrNet: Advancing Irrigation Mapping with Incremental Patch Size Training on Remote Sensing Imagery
Comments: Full version of the paper will be appearing in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Irrigation mapping plays a crucial role in effective water management, essential for preserving both water quality and quantity, and is key to mitigating the global issue of water scarcity. The complexity of agricultural fields, adorned with diverse irrigation practices, especially when multiple systems coexist in close quarters, poses a unique challenge. This complexity is further compounded by the nature of Landsat's remote sensing data, where each pixel is rich with densely packed information, complicating the task of accurate irrigation mapping. In this study, we introduce an innovative approach that employs a progressive training method, which strategically increases patch sizes throughout the training process, utilizing datasets from Landsat 5 and 7, labeled with the WRLU dataset for precise labeling. This initial focus allows the model to capture detailed features, progressively shifting to broader, more general features as the patch size enlarges. Remarkably, our method enhances the performance of existing state-of-the-art models by approximately 20%. Furthermore, our analysis delves into the significance of incorporating various spectral bands into the model, assessing their impact on performance. The findings reveal that additional bands are instrumental in enabling the model to discern finer details more effectively. This work sets a new standard for leveraging remote sensing imagery in irrigation mapping.

[47]  arXiv:2404.11763 [pdf, other]
Title: The Code the World Depends On: A First Look at Technology Makers' Open Source Software Dependencies
Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)

Open-source software (OSS) supply chain security has become a topic of concern for organizations. Patching an OSS vulnerability can require updating other dependent software products in addition to the original package. However, the landscape of OSS dependencies is not well explored: we do not know what packages are most critical to patch, hindering efforts to improve OSS security where it is most needed. There is thus a need to understand OSS usage in major software and device makers' products. Our work takes a first step toward closing this knowledge gap. We investigate published OSS dependency information for 108 major software and device makers, cataloging how available and how detailed this information is and identifying the OSS packages that appear the most frequently in our data.

[48]  arXiv:2404.11764 [pdf, other]
Title: Multimodal 3D Object Detection on Unseen Domains
Comments: technical report
Subjects: Computer Vision and Pattern Recognition (cs.CV)

LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.

[49]  arXiv:2404.11766 [pdf, other]
Title: End-to-End Mesh Optimization of a Hybrid Deep Learning Black-Box PDE Solver
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)

Deep learning has been widely applied to solve partial differential equations (PDEs) in computational fluid dynamics. Recent research proposed a PDE correction framework that leverages deep learning to correct the solution obtained by a PDE solver on a coarse mesh. However, end-to-end training of such a PDE correction model over both solver-dependent parameters such as mesh parameters and neural network parameters requires the PDE solver to support automatic differentiation through the iterative numerical process. Such a feature is not readily available in many existing solvers. In this study, we explore the feasibility of end-to-end training of a hybrid model with a black-box PDE solver and a deep learning model for fluid flow prediction. Specifically, we investigate a hybrid model that integrates a black-box PDE solver into a differentiable deep graph neural network. To train this model, we use a zeroth-order gradient estimator to differentiate the PDE solver via forward propagation. Although experiments show that the proposed approach based on zeroth-order gradient estimation underperforms the baseline that computes exact derivatives using automatic differentiation, our proposed method outperforms the baseline trained with a frozen input mesh to the solver. Moreover, with a simple warm-start on the neural network parameters, we show that models trained by these zeroth-order algorithms achieve an accelerated convergence and improved generalization performance.

[50]  arXiv:2404.11769 [pdf, other]
Title: QGen: On the Ability to Generalize in Quantization Aware Training
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.

[51]  arXiv:2404.11770 [pdf, other]
Title: Event-Based Eye Tracking. AIS 2024 Challenge Survey
Comments: Qinyu Chen is the corresponding author
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggle competition, and 8 teams submitted a challenge factsheet. The novel and diverse methods from the submitted factsheets are reviewed and analyzed in this survey to advance future event-based eye tracking research.

[52]  arXiv:2404.11771 [pdf, ps, other]
Title: IoT-Driven Cloud-based Energy and Environment Monitoring System for Manufacturing Industry
Subjects: Systems and Control (eess.SY)

This research focused on the development of a cost-effective IoT solution for energy and environment monitoring geared towards manufacturing industries. The proposed system is developed using open-source software that can be easily deployed in any manufacturing environment. The system collects real-time temperature, humidity, and energy data from different devices running on different communication such as TCP/IP, Modbus, etc., and the data is transferred wirelessly using an MQTT client to a database working as a cloud storage solution. The collected data is then visualized and analyzed using a website running on a host machine working as a web client.

[53]  arXiv:2404.11773 [pdf, other]
Title: Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommendation Systems
Comments: Accepted by the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Large Language Models (LLMs) have demonstrated great potential in Conversational Recommender Systems (CRS). However, the application of LLMs to CRS has exposed a notable discrepancy in behavior between LLM-based CRS and human recommenders: LLMs often appear inflexible and passive, frequently rushing to complete the recommendation task without sufficient inquiry.This behavior discrepancy can lead to decreased accuracy in recommendations and lower user satisfaction. Despite its importance, existing studies in CRS lack a study about how to measure such behavior discrepancy. To fill this gap, we propose Behavior Alignment, a new evaluation metric to measure how well the recommendation strategies made by a LLM-based CRS are consistent with human recommenders'. Our experiment results show that the new metric is better aligned with human preferences and can better differentiate how systems perform than existing evaluation metrics. As Behavior Alignment requires explicit and costly human annotations on the recommendation strategies, we also propose a classification-based method to implicitly measure the Behavior Alignment based on the responses. The evaluation results confirm the robustness of the method.

[54]  arXiv:2404.11776 [pdf, ps, other]
Title: 3D object quality prediction for Metal Jet Printer with Multimodal thermal encoder
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

With the advancements in 3D printing technologies, it is extremely important that the quality of 3D printed objects, and dimensional accuracies should meet the customer's specifications. Various factors during metal printing affect the printed parts' quality, including the power quality, the printing stage parameters, the print part's location inside the print bed, the curing stage parameters, and the metal sintering process. With the large data gathered from HP's MetJet printing process, AI techniques can be used to analyze, learn, and effectively infer the printed part quality metrics, as well as assist in improving the print yield. In-situ thermal sensing data captured by printer-installed thermal sensors contains the part thermal signature of fusing layers. Such part thermal signature contains a convoluted impact from various factors. In this paper, we use a multimodal thermal encoder network to fuse data of a different nature including the video data vectorized printer control data, and exact part thermal signatures with a trained encoder-decoder module. We explored the data fusing techniques and stages for data fusing, the optimized end-to-end model architecture indicates an improved part quality prediction accuracy.

[55]  arXiv:2404.11778 [pdf, other]
Title: CU-Mamba: Selective State Space Models with Channel Learning for Image Restoration
Authors: Rui Deng, Tianpei Gu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Reconstructing degraded images is a critical task in image processing. Although CNN and Transformer-based models are prevalent in this field, they exhibit inherent limitations, such as inadequate long-range dependency modeling and high computational costs. To overcome these issues, we introduce the Channel-Aware U-Shaped Mamba (CU-Mamba) model, which incorporates a dual State Space Model (SSM) framework into the U-Net architecture. CU-Mamba employs a Spatial SSM module for global context encoding and a Channel SSM component to preserve channel correlation features, both in linear computational complexity relative to the feature map size. Extensive experimental results validate CU-Mamba's superiority over existing state-of-the-art methods, underscoring the importance of integrating both spatial and channel contexts in image restoration.

[56]  arXiv:2404.11782 [pdf, other]
Title: REQUAL-LM: Reliability and Equity through Aggregation in Large Language Models
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

The extensive scope of large language models (LLMs) across various domains underscores the critical importance of responsibility in their application, beyond natural language processing. In particular, the randomized nature of LLMs, coupled with inherent biases and historical stereotypes in data, raises critical concerns regarding reliability and equity. Addressing these challenges are necessary before using LLMs for applications with societal impact. Towards addressing this gap, we introduce REQUAL-LM, a novel method for finding reliable and equitable LLM outputs through aggregation. Specifically, we develop a Monte Carlo method based on repeated sampling to find a reliable output close to the mean of the underlying distribution of possible outputs. We formally define the terms such as reliability and bias, and design an equity-aware aggregation to minimize harmful bias while finding a highly reliable output. REQUAL-LM does not require specialized hardware, does not impose a significant computing load, and uses LLMs as a blackbox. This design choice enables seamless scalability alongside the rapid advancement of LLM technologies. Our system does not require retraining the LLMs, which makes it deployment ready and easy to adapt. Our comprehensive experiments using various tasks and datasets demonstrate that REQUAL- LM effectively mitigates bias and selects a more equitable response, specifically the outputs that properly represents minority groups.

[57]  arXiv:2404.11784 [pdf, other]
Title: Analysis of Evolutionary Diversity Optimisation for the Maximum Matching Problem
Subjects: Neural and Evolutionary Computing (cs.NE)

This paper explores the enhancement of solution diversity in evolutionary algorithms (EAs) for the maximum matching problem, concentrating on complete bipartite graphs and paths. We adopt binary string encoding for matchings and use Hamming distance to measure diversity, aiming for its maximization. Our study centers on the $(\mu+1)$-EA and $2P-EA_D$, which are applied to optimize diversity. We provide a rigorous theoretical and empirical analysis of these algorithms.
For complete bipartite graphs, our runtime analysis shows that, with a reasonably small $\mu$, the $(\mu+1)$-EA achieves maximal diversity with an expected runtime of $O(\mu^2 m^4 \log(m))$ for the small gap case (where the population size $\mu$ is less than the difference in the sizes of the bipartite partitions) and $O(\mu^2 m^2 \log(m))$ otherwise. For paths, we establish an upper runtime bound of $O(\mu^3 m^3)$. The $2P-EA_D$ displays stronger performance, with bounds of $O(\mu^2 m^2 \log(m))$ for the small gap case, $O(\mu^2 n^2 \log(n))$ otherwise, and $O(\mu^3 m^2)$ for paths. Here, $n$ represents the total number of vertices and $m$ the number of edges. Our empirical studies, which examine the scaling behavior with respect to $m$ and $\mu$, complement these theoretical insights and suggest potential for further refinement of the runtime bounds.

[58]  arXiv:2404.11788 [pdf, other]
Title: NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads
Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)

Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore, significant effort has been put to study and optimize the GEMM operators in order to speed up the execution of ML models. GPUs and accelerators are widely deployed to accelerate ML workloads by optimizing the execution of GEMM operators. Nonetheless, the performance of NonGEMM operators have not been studied as thoroughly as GEMMs. Therefore, this paper describes \bench, a benchmark to study NonGEMM operators. We first construct \bench using popular ML workloads from different domains, then perform case studies on various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU accelerated systems. Finally, we present some key takeaways to bridge the gap between GEMM and NonGEMM operators and to offer the community with potential new optimization directions.

[59]  arXiv:2404.11791 [pdf, other]
Title: Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing
Subjects: Information Retrieval (cs.IR)

The powerful generative abilities of large language models (LLMs) show potential in generating relevance labels for search applications. Previous work has found that directly asking about relevancy, such as ``How relevant is document A to query Q?", results in sub-optimal ranking. Instead, the pairwise ranking prompting (PRP) approach produces promising ranking performance through asking about pairwise comparisons, e.g., ``Is document A more relevant than document B to query Q?". Thus, while LLMs are effective at their ranking ability, this is not reflected in their relevance label generation. In this work, we propose a post-processing method to consolidate the relevance labels generated by an LLM with its powerful ranking abilities. Our method takes both LLM generated relevance labels and pairwise preferences. The labels are then altered to satisfy the pairwise preferences of the LLM, while staying as close to the original values as possible. Our experimental results indicate that our approach effectively balances label accuracy and ranking performance. Thereby, our work shows it is possible to combine both the ranking and labeling abilities of LLMs through post-processing.

[60]  arXiv:2404.11792 [pdf, other]
Title: Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Comparative Study
Comments: 15 pages, 5 figures
Subjects: Artificial Intelligence (cs.AI)

This paper investigates the impact of domain-specific model fine-tuning and of reasoning mechanisms on the performance of question-answering (Q&A) systems powered by large language models (LLMs) and Retrieval-Augmented Generation (RAG). Using the FinanceBench SEC financial filings dataset, we observe that, for RAG, combining a fine-tuned embedding model with a fine-tuned LLM achieves better accuracy than generic models, with relatively greater gains attributable to fine-tuned embedding models. Additionally, employing reasoning iterations on top of RAG delivers an even bigger jump in performance, enabling the Q&A systems to get closer to human-expert quality. We discuss the implications of such findings, propose a structured technical design space capturing major technical components of Q&A AI, and provide recommendations for making high-impact technical choices for such components. We plan to follow up on this work with actionable guides for AI teams and further investigations into the impact of domain-specific augmentation in RAG and into agentic AI capabilities such as advanced planning and reasoning.

[61]  arXiv:2404.11793 [pdf, other]
Title: Enhancing Argument Summarization: Prioritizing Exhaustiveness in Key Point Generation and Introducing an Automatic Coverage Evaluation Metric
Comments: NAACL 2024 Main Conference
Subjects: Computation and Language (cs.CL)

The proliferation of social media platforms has given rise to the amount of online debates and arguments. Consequently, the need for automatic summarization methods for such debates is imperative, however this area of summarization is rather understudied. The Key Point Analysis (KPA) task formulates argument summarization as representing the summary of a large collection of arguments in the form of concise sentences in bullet-style format, called key points. A sub-task of KPA, called Key Point Generation (KPG), focuses on generating these key points given the arguments. This paper introduces a novel extractive approach for key point generation, that outperforms previous state-of-the-art methods for the task. Our method utilizes an extractive clustering based approach that offers concise, high quality generated key points with higher coverage of reference summaries, and less redundant outputs. In addition, we show that the existing evaluation metrics for summarization such as ROUGE are incapable of differentiating between generated key points of different qualities. To this end, we propose a new evaluation metric for assessing the generated key points by their coverage. Our code can be accessed online.

[62]  arXiv:2404.11795 [pdf, other]
Title: Prompt-Driven Feature Diffusion for Open-World Semi-Supervised Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

In this paper, we present a novel approach termed Prompt-Driven Feature Diffusion (PDFD) within a semi-supervised learning framework for Open World Semi-Supervised Learning (OW-SSL). At its core, PDFD deploys an efficient feature-level diffusion model with the guidance of class-specific prompts to support discriminative feature representation learning and feature generation, tackling the challenge of the non-availability of labeled data for unseen classes in OW-SSL. In particular, PDFD utilizes class prototypes as prompts in the diffusion model, leveraging their class-discriminative and semantic generalization ability to condition and guide the diffusion process across all the seen and unseen classes. Furthermore, PDFD incorporates a class-conditional adversarial loss for diffusion model training, ensuring that the features generated via the diffusion process can be discriminatively aligned with the class-conditional features of the real data. Additionally, the class prototypes of the unseen classes are computed using only unlabeled instances with confident predictions within a semi-supervised learning framework. We conduct extensive experiments to evaluate the proposed PDFD. The empirical results show PDFD exhibits remarkable performance enhancements over many state-of-the-art existing methods.

[63]  arXiv:2404.11797 [pdf, other]
Title: When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of language models is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional machine learning (ML) and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.

[64]  arXiv:2404.11798 [pdf, other]
Title: Establishing a Baseline for Gaze-driven Authentication Performance in VR: A Breadth-First Investigation on a Very Large Dataset
Comments: 28 pages, 18 figures, 5 tables, includes supplementary material
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

This paper performs the crucial work of establishing a baseline for gaze-driven authentication performance to begin answering fundamental research questions using a very large dataset of gaze recordings from 9202 people with a level of eye tracking (ET) signal quality equivalent to modern consumer-facing virtual reality (VR) platforms. The size of the employed dataset is at least an order-of-magnitude larger than any other dataset from previous related work. Binocular estimates of the optical and visual axes of the eyes and a minimum duration for enrollment and verification are required for our model to achieve a false rejection rate (FRR) of below 3% at a false acceptance rate (FAR) of 1 in 50,000. In terms of identification accuracy which decreases with gallery size, we estimate that our model would fall below chance-level accuracy for gallery sizes of 148,000 or more. Our major findings indicate that gaze authentication can be as accurate as required by the FIDO standard when driven by a state-of-the-art machine learning architecture and a sufficiently large training dataset.

[65]  arXiv:2404.11800 [pdf, other]
Title: Developing Situational Awareness for Joint Action with Autonomous Vehicles
Comments: 16 pages
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Unanswered questions about how human-AV interaction designers can support rider's informational needs hinders Autonomous Vehicles (AV) adoption. To achieve joint human-AV action goals - such as safe transportation, trust, or learning from an AV - sufficient situational awareness must be held by the human, AV, and human-AV system collectively. We present a systems-level framework that integrates cognitive theories of joint action and situational awareness as a means to tailor communications that meet the criteria necessary for goal success. This framework is based on four components of the shared situation: AV traits, action goals, subject-specific traits and states, and the situated driving context. AV communications should be tailored to these factors and be sensitive when they change. This framework can be useful for understanding individual, shared, and distributed human-AV situational awareness and designing for future AV communications that meet the informational needs and goals of diverse groups and in diverse driving contexts.

[66]  arXiv:2404.11803 [pdf, other]
Title: TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.

[67]  arXiv:2404.11807 [pdf, other]
Title: Continuous Dynamic Bipedal Jumping via Adaptive-model Optimization
Comments: 8 pages, 9 figures, submitted to IEEE RA-L for review and possible publication
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

Dynamic and continuous jumping remains an open yet challenging problem in bipedal robot control. The choice of dynamic models in trajectory optimization (TO) problems plays a huge role in trajectory accuracy and computation efficiency, which normally cannot be ensured simultaneously. In this letter, we propose a novel adaptive-model optimization approach, a unified framework of Adaptive-model TO and Adaptive-frequency Model Predictive Control (MPC), to effectively realize continuous and robust jumping on HECTOR bipedal robot. The proposed Adaptive-model TO fuses adaptive-fidelity dynamics modeling of bipedal jumping motion for model fidelity necessities in different jumping phases to ensure trajectory accuracy and computation efficiency. In addition, conventional approaches have unsynchronized sampling frequencies in TO and real-time control, causing the framework to have mismatched modeling resolutions. We adapt MPC sampling frequency based on TO trajectory resolution in different phases for effective trajectory tracking. In hardware experiments, we have demonstrated robust and dynamic jumps covering a distance of up to 40 cm (57% of robot height). To verify the repeatability of this experiment, we run 53 jumping experiments and achieve 90% success rate. In continuous jumps, we demonstrate continuous bipedal jumping with terrain height perturbations (up to 5 cm) and discontinuities (up to 20 cm gap).

[68]  arXiv:2404.11809 [pdf, other]
Title: Sharing Parameter by Conjugation for Knowledge Graph Embeddings in Complex Space
Comments: 8 pages, 1 figure, 6 tables, accepted at TextGraphs-16 workshop held in conjunction with COLING 2022
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

A Knowledge Graph (KG) is the directed graphical representation of entities and relations in the real world. KG can be applied in diverse Natural Language Processing (NLP) tasks where knowledge is required. The need to scale up and complete KG automatically yields Knowledge Graph Embedding (KGE), a shallow machine learning model that is suffering from memory and training time consumption issues. To mitigate the computational load, we propose a parameter-sharing method, i.e., using conjugate parameters for complex numbers employed in KGE models. Our method improves memory efficiency by 2x in relation embedding while achieving comparable performance to the state-of-the-art non-conjugate models, with faster, or at least comparable, training time. We demonstrated the generalizability of our method on two best-performing KGE models $5^{\bigstar}\mathrm{E}$ and $\mathrm{ComplEx}$ on five benchmark datasets.

[69]  arXiv:2404.11810 [pdf, other]
Title: Holographic Parallax Improves 3D Perceptual Realism
Comments: 33 pages, 34 figures
Subjects: Graphics (cs.GR)

Holographic near-eye displays are a promising technology to solve long-standing challenges in virtual and augmented reality display systems. Over the last few years, many different computer-generated holography (CGH) algorithms have been proposed that are supervised by different types of target content, such as 2.5D RGB-depth maps, 3D focal stacks, and 4D light fields. It is unclear, however, what the perceptual implications are of the choice of algorithm and target content type. In this work, we build a perceptual testbed of a full-color, high-quality holographic near-eye display. Under natural viewing conditions, we examine the effects of various CGH supervision formats and conduct user studies to assess their perceptual impacts on 3D realism. Our results indicate that CGH algorithms designed for specific viewpoints exhibit noticeable deficiencies in achieving 3D realism. In contrast, holograms incorporating parallax cues consistently outperform other formats across different viewing conditions, including the center of the eyebox. This finding is particularly interesting and suggests that the inclusion of parallax cues in CGH rendering plays a crucial role in enhancing the overall quality of the holographic experience. This work represents an initial stride towards delivering a perceptually realistic 3D experience with holographic near-eye displays.

[70]  arXiv:2404.11812 [pdf, other]
Title: Cross-model Mutual Learning for Exemplar-based Medical Image Segmentation
Authors: Qing En, Yuhong Guo
Comments: AISTATS 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Medical image segmentation typically demands extensive dense annotations for model training, which is both time-consuming and skill-intensive. To mitigate this burden, exemplar-based medical image segmentation methods have been introduced to achieve effective training with only one annotated image. In this paper, we introduce a novel Cross-model Mutual learning framework for Exemplar-based Medical image Segmentation (CMEMS), which leverages two models to mutually excavate implicit information from unlabeled data at multiple granularities. CMEMS can eliminate confirmation bias and enable collaborative training to learn complementary information by enforcing consistency at different granularities across models. Concretely, cross-model image perturbation based mutual learning is devised by using weakly perturbed images to generate high-confidence pseudo-labels, supervising predictions of strongly perturbed images across models. This approach enables joint pursuit of prediction consistency at the image granularity. Moreover, cross-model multi-level feature perturbation based mutual learning is designed by letting pseudo-labels supervise predictions from perturbed multi-level features with different resolutions, which can broaden the perturbation space and enhance the robustness of our framework. CMEMS is jointly trained using exemplar data, synthetic data, and unlabeled data in an end-to-end manner. Experimental results on two medical image datasets indicate that the proposed CMEMS outperforms the state-of-the-art segmentation methods with extremely limited supervision.

[71]  arXiv:2404.11815 [pdf, other]
Title: AquaSonic: Acoustic Manipulation of Underwater Data Center Operations and Resource Management
Subjects: Cryptography and Security (cs.CR)

Underwater datacenters (UDCs) hold promise as next-generation data storage due to their energy efficiency and environmental sustainability benefits. While the natural cooling properties of water save power, the isolated aquatic environment and long-range sound propagation in water create unique vulnerabilities which differ from those of on-land data centers. Our research discovers the unique vulnerabilities of fault-tolerant storage devices, resource allocation software, and distributed file systems to acoustic injection attacks in UDCs. With a realistic testbed approximating UDC server operations, we empirically characterize the capabilities of acoustic injection underwater and find that an attacker can reduce fault-tolerant RAID 5 storage system throughput by 17% up to 100%. Our closed-water analyses reveal that attackers can (i) cause unresponsiveness and automatic node removal in a distributed filesystem with only 2.4 minutes of sustained acoustic injection, (ii) induce a distributed database's latency to increase by up to 92.7% to reduce system reliability, and (iii) induce load-balance managers to redirect up to 74% of resources to a target server to cause overload or force resource colocation. Furthermore, we perform open-water experiments in a lake and find that an attacker can cause controlled throughput degradation at a maximum allowable distance of 6.35 m using a commercial speaker. We also investigate and discuss the effectiveness of standard defenses against acoustic injection attacks. Finally, we formulate a novel machine learning-based detection system that reaches 0% False Positive Rate and 98.2% True Positive Rate trained on our dataset of profiled hard disk drives under 30-second FIO benchmark execution. With this work, we aim to help manufacturers proactively protect UDCs against acoustic injection attacks and ensure the security of subsea computing infrastructures.

[72]  arXiv:2404.11816 [pdf, other]
Title: Tailoring Generative Adversarial Networks for Smooth Airfoil Design
Subjects: Machine Learning (cs.LG)

In the realm of aerospace design, achieving smooth curves is paramount, particularly when crafting objects such as airfoils. Generative Adversarial Network (GAN), a widely employed generative AI technique, has proven instrumental in synthesizing airfoil designs. However, a common limitation of GAN is the inherent lack of smoothness in the generated airfoil surfaces. To address this issue, we present a GAN model featuring a customized loss function built to produce seamlessly contoured airfoil designs. Additionally, our model demonstrates a substantial increase in design diversity compared to a conventional GAN augmented with a post-processing smoothing filter.

[73]  arXiv:2404.11817 [pdf, ps, other]
Title: Reinforcement Learning of Multi-robot Task Allocation for Multi-object Transportation with Infeasible Tasks
Comments: 8 pages, 10 figures
Subjects: Robotics (cs.RO)

Multi-object transport using multi-robot systems has the potential for diverse practical applications such as delivery services owing to its efficient individual and scalable cooperative transport. However, allocating transportation tasks of objects with unknown weights remains challenging. Moreover, the presence of infeasible tasks (untransportable objects) can lead to robot stoppage (deadlock). This paper proposes a framework for dynamic task allocation that involves storing task experiences for each task in a scalable manner with respect to the number of robots. First, these experiences are broadcasted from the cloud server to the entire robot system. Subsequently, each robot learns the exclusion levels for each task based on those task experiences, enabling it to exclude infeasible tasks and reset its task priorities. Finally, individual transportation, cooperative transportation, and the temporary exclusion of tasks considered infeasible are achieved. The scalability and versatility of the proposed method were confirmed through numerical experiments with an increased number of robots and objects, including unlearned weight objects. The effectiveness of the temporary deadlock avoidance was also confirmed by introducing additional robots within an episode. The proposed method enables the implementation of task allocation strategies that are feasible for different numbers of robots and various transport tasks without prior consideration of feasibility.

[74]  arXiv:2404.11818 [pdf, other]
Title: Automated Similarity Metric Generation for Recommendation
Subjects: Information Retrieval (cs.IR)

The embedding-based architecture has become the dominant approach in modern recommender systems, mapping users and items into a compact vector space. It then employs predefined similarity metrics, such as the inner product, to calculate similarity scores between user and item embeddings, thereby guiding the recommendation of items that align closely with a user's preferences. Given the critical role of similarity metrics in recommender systems, existing methods mainly employ handcrafted similarity metrics to capture the complex characteristics of user-item interactions. Yet, handcrafted metrics may not fully capture the diverse range of similarity patterns that can significantly vary across different domains.
To address this issue, we propose an Automated Similarity Metric Generation method for recommendations, named AutoSMG, which can generate tailored similarity metrics for various domains and datasets. Specifically, we first construct a similarity metric space by sampling from a set of basic embedding operators, which are then integrated into computational graphs to represent metrics. We employ an evolutionary algorithm to search for the optimal metrics within this metric space iteratively. To improve search efficiency, we utilize an early stopping strategy and a surrogate model to approximate the performance of candidate metrics instead of fully training models. Notably, our proposed method is model-agnostic, which can seamlessly plugin into different recommendation model architectures. The proposed method is validated on three public recommendation datasets across various domains in the Top-K recommendation task, and experimental results demonstrate that AutoSMG outperforms both commonly used handcrafted metrics and those generated by other search strategies.

[75]  arXiv:2404.11819 [pdf, other]
Title: Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training.
Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables.

[76]  arXiv:2404.11822 [pdf, ps, other]
Title: A class of maximum-based iteration methods for the generalized absolute value equation
Subjects: Numerical Analysis (math.NA)

In this paper, by using $|x|=2\max\{0,x\}-x$, a class of maximum-based iteration methods is established to solve the generalized absolute value equation $Ax-B|x|=b$. Some convergence conditions of the proposed method are presented. By some numerical experiments, the effectiveness and feasibility of the proposed method are confirmed.

[77]  arXiv:2404.11824 [pdf, other]
Title: TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation
Comments: 7 pages, 7 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advancements in Text-to-image (T2I) generation have witnessed a shift from adapting text to fixed backgrounds to creating images around text. Traditional approaches are often limited to generate layouts within static images for effective text placement. Our proposed approach, TextCenGen, introduces a dynamic adaptation of the blank region for text-friendly image generation, emphasizing text-centric design and visual harmony generation. Our method employs force-directed attention guidance in T2I models to generate images that strategically reserve whitespace for pre-defined text areas, even for text or icons at the golden ratio. Observing how cross-attention maps affect object placement, we detect and repel conflicting objects using a force-directed graph approach, combined with a Spatial Excluding Cross-Attention Constraint for smooth attention in whitespace areas. As a novel task in graphic design, experiments indicate that TextCenGen outperforms existing methods with more harmonious compositions. Furthermore, our method significantly enhances T2I model outcomes on our specially collected prompt datasets, catering to varied text positions. These results demonstrate the efficacy of TextCenGen in creating more harmonious and integrated text-image compositions.

[78]  arXiv:2404.11825 [pdf, other]
Title: Hypergraph Self-supervised Learning with Sampling-efficient Signals
Comments: 9 pages,4 figures,4 tables
Subjects: Machine Learning (cs.LG)

Self-supervised learning (SSL) provides a promising alternative for representation learning on hypergraphs without costly labels. However, existing hypergraph SSL models are mostly based on contrastive methods with the instance-level discrimination strategy, suffering from two significant limitations: (1) They select negative samples arbitrarily, which is unreliable in deciding similar and dissimilar pairs, causing training bias. (2) They often require a large number of negative samples, resulting in expensive computational costs. To address the above issues, we propose SE-HSSL, a hypergraph SSL framework with three sampling-efficient self-supervised signals. Specifically, we introduce two sampling-free objectives leveraging the canonical correlation analysis as the node-level and group-level self-supervised signals. Additionally, we develop a novel hierarchical membership-level contrast objective motivated by the cascading overlap relationship in hypergraphs, which can further reduce membership sampling bias and improve the efficiency of sample utilization. Through comprehensive experiments on 7 real-world hypergraphs, we demonstrate the superiority of our approach over the state-of-the-art method in terms of both effectiveness and efficiency.

[79]  arXiv:2404.11826 [pdf, other]
Title: AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
Comments: 19 pages, 11 figures
Subjects: Computation and Language (cs.CL)

As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.

[80]  arXiv:2404.11828 [pdf, ps, other]
Title: Aerodynamic Design and Performance Evaluation of Pipe Diffuser for Centrifugal Compressor of Micro Gas Turbine
Subjects: Numerical Analysis (math.NA)

This study introduces a new approach to optimize the geometrical parameters of pipe diffusers in centrifugal compressors for Micro Gas Turbines, tailored for a 100 kW unit. The methodology draws insights from optimized airfoil-type diffusers and addresses the unique topological challenges of pipe diffusers, using diffuser maps to enhance design precision. The effectiveness of this method is validated through 3D-RANS based steady CFD simulations, using the ANSYS CFX solver. Comparative performance assessments at 100 percent rotation speed show that the best-performing pipe diffuser slightly trails its airfoil counterpart in efficiency, achieving 82.2 percent total-to-total isentropic efficiency compared to 84.4 percent. However, it offers a reduced frontal area, enhancing compactness. The analysis also reveals a dualistic impact from the leading-edge geometry of the pipe diffuser, which generates two counter-rotating vortices. These vortices have beneficial effects in pseudo and semi-vaneless spaces while introducing destabilizing factors in channel spaces. This investigation highlights potential trade-offs and outlines conditions under which adverse effects dominate, leading to significant flow separation. These insights pave the way for refining diffuser designs to better balance performance with spatial efficiency, marking a critical step forward in compressor technology of micro gas turbine for decentralized power systems.

[81]  arXiv:2404.11831 [pdf, other]
Title: JointPPO: Diving Deeper into the Effectiveness of PPO in Multi-Agent Reinforcement Learning
Subjects: Multiagent Systems (cs.MA)

While Centralized Training with Decentralized Execution (CTDE) has become the prevailing paradigm in Multi-Agent Reinforcement Learning (MARL), it may not be suitable for scenarios in which agents can fully communicate and share observations with each other. Fully centralized methods, also know as Centralized Training with Centralized Execution (CTCE) methods, can fully utilize observations of all the agents by treating the entire system as a single agent. However, traditional CTCE methods suffer from scalability issues due to the exponential growth of the joint action space. To address these challenges, in this paper we propose JointPPO, a CTCE method that uses Proximal Policy Optimization (PPO) to directly optimize the joint policy of the multi-agent system. JointPPO decomposes the joint policy into conditional probabilities, transforming the decision-making process into a sequence generation task. A Transformer-based joint policy network is constructed, trained with a PPO loss tailored for the joint policy. JointPPO effectively handles a large joint action space and extends PPO to multi-agent setting with theoretical clarity and conciseness. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) testbed demonstrate the superiority of JointPPO over the strong baselines. Ablation experiments and analyses are conducted to explores the factors influencing JointPPO's performance.

[82]  arXiv:2404.11833 [pdf, ps, other]
Title: Planning with Language Models Through The Lens of Efficiency
Subjects: Artificial Intelligence (cs.AI)

We analyse the cost of using LLMs for planning and highlight that recent trends are profoundly uneconomical. We propose a significantly more efficient approach and argue for a responsible use of compute resources; urging research community to investigate LLM-based approaches that upholds efficiency.

[83]  arXiv:2404.11834 [pdf, other]
Title: Actor-Critic Reinforcement Learning with Phased Actor
Subjects: Machine Learning (cs.LG)

Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both $Q$ value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics. Additionally, we show variance reduction in policy gradient estimation. PAAC performance is systematically and quantitatively evaluated in this study using DeepMind Control Suite (DMC). Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate. As PAAC can be piggybacked onto general policy gradient learning frameworks, we select well-known methods such as direct heuristic dynamic programming (dHDP), deep deterministic policy gradient (DDPG) and their variants to demonstrate the effectiveness of PAAC. Consequently we provide a unified view on these related policy gradient algorithms.

[84]  arXiv:2404.11835 [pdf, other]
Title: CAUS: A Dataset for Question Generation based on Human Cognition Leveraging Large Language Models
Comments: 8 pages, 4 figures and 3 tables. This work has been accepted for presentation at CogSci 2024 and is currently under revision
Subjects: Artificial Intelligence (cs.AI)

We introduce the CAUS (Curious About Uncertain Scene) dataset, designed to enable Large Language Models, specifically GPT-4, to emulate human cognitive processes for resolving uncertainties. Leveraging this dataset, we investigate the potential of LLMs to engage in questioning effectively. Our approach involves providing scene descriptions embedded with uncertainties to stimulate the generation of reasoning and queries. The queries are then classified according to multi-dimensional criteria. All procedures are facilitated by a collaborative system involving both LLMs and human researchers. Our results demonstrate that GPT-4 can effectively generate pertinent questions and grasp their nuances, particularly when given appropriate context and instructions. The study suggests that incorporating human-like questioning into AI models improves their ability to manage uncertainties, paving the way for future advancements in Artificial Intelligence (AI).

[85]  arXiv:2404.11841 [pdf, ps, other]
Title: On the Unprovability of Circuit Size Bounds in Intuitionistic $\mathsf{S}^1_2$
Subjects: Logic in Computer Science (cs.LO); Computational Complexity (cs.CC)

We show that there is a constant $k$ such that Buss's intuitionistic theory $\mathsf{IS}^1_2$ does not prove that SAT requires co-nondeterministic circuits of size at least $n^k$. To our knowledge, this is the first unconditional unprovability result in bounded arithmetic in the context of worst-case fixed-polynomial size circuit lower bounds. We complement this result by showing that the upper bound $\mathsf{NP} \subseteq \mathsf{coNSIZE}[n^k]$ is unprovable in $\mathsf{IS}^1_2$.

[86]  arXiv:2404.11844 [pdf, ps, other]
Title: Finding A Taxi with Illegal Driver Substitution Activity via Behavior Modelings
Subjects: Computers and Society (cs.CY)

In our urban life, Illegal Driver Substitution (IDS) activity for a taxi is a grave unlawful activity in the taxi industry, possibly causing severe traffic accidents and painful social repercussions. Currently, the IDS activity is manually supervised by law enforcers, i.e., law enforcers empirically choose a taxi and inspect it. The pressing problem of this scheme is the dilemma between the limited number of law-enforcers and the large volume of taxis. In this paper, motivated by this problem, we propose a computational method that helps law enforcers efficiently find the taxis which tend to have the IDS activity. Firstly, our method converts the identification of the IDS activity to a supervised learning task. Secondly, two kinds of taxi driver behaviors, i.e., the Sleeping Time and Location (STL) behavior and the Pick-Up (PU) behavior are proposed. Thirdly, the multiple scale pooling on self-similarity is proposed to encode the individual behaviors into the universal features for all taxis. Finally, a Multiple Component- Multiple Instance Learning (MC-MIL) method is proposed to handle the deficiency of the behavior features and to align the behavior features simultaneously. Extensive experiments on a real-world data set shows that the proposed behavior features have a good generalization ability across different classifiers, and the proposed MC-MIL method suppresses the baseline methods.

[87]  arXiv:2404.11845 [pdf, other]
Title: Challenging Negative Gender Stereotypes: A Study on the Effectiveness of Automated Counter-Stereotypes
Comments: LREC-COLING2024
Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)

Gender stereotypes are pervasive beliefs about individuals based on their gender that play a significant role in shaping societal attitudes, behaviours, and even opportunities. Recognizing the negative implications of gender stereotypes, particularly in online communications, this study investigates eleven strategies to automatically counter-act and challenge these views. We present AI-generated gender-based counter-stereotypes to (self-identified) male and female study participants and ask them to assess their offensiveness, plausibility, and potential effectiveness. The strategies of counter-facts and broadening universals (i.e., stating that anyone can have a trait regardless of group membership) emerged as the most robust approaches, while humour, perspective-taking, counter-examples, and empathy for the speaker were perceived as less effective. Also, the differences in ratings were more pronounced for stereotypes about the different targets than between the genders of the raters. Alarmingly, many AI-generated counter-stereotypes were perceived as offensive and/or implausible. Our analysis and the collected dataset offer foundational insight into counter-stereotype generation, guiding future efforts to develop strategies that effectively challenge gender stereotypes in online interactions.

[88]  arXiv:2404.11848 [pdf, other]
Title: Partial Large Kernel CNNs for Efficient Super-Resolution
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, in the super-resolution (SR) domain, transformers have outperformed CNNs with fewer FLOPs and fewer parameters since they can deal with long-range dependency and adaptively adjust weights based on instance. In this paper, we demonstrate that CNNs, although less focused on in the current SR domain, surpass Transformers in direct efficiency measures. By incorporating the advantages of Transformers into CNNs, we aim to achieve both computational efficiency and enhanced performance. However, using a large kernel in the SR domain, which mainly processes large images, incurs a large computational overhead. To overcome this, we propose novel approaches to employing the large kernel, which can reduce latency by 86\% compared to the naive large kernel, and leverage an Element-wise Attention module to imitate instance-dependent weights. As a result, we introduce Partial Large Kernel CNNs for Efficient Super-Resolution (PLKSR), which achieves state-of-the-art performance on four datasets at a scale of $\times$4, with reductions of 68.1\% in latency and 80.2\% in maximum GPU memory occupancy compared to SRFormer-light.

[89]  arXiv:2404.11852 [pdf, other]
Title: Cicero: Addressing Algorithmic and Architectural Bottlenecks in Neural Rendering by Radiance Warping and Memory Optimizations
Subjects: Hardware Architecture (cs.AR); Graphics (cs.GR)

Neural Radiance Field (NeRF) is widely seen as an alternative to traditional physically-based rendering. However, NeRF has not yet seen its adoption in resource-limited mobile systems such as Virtual and Augmented Reality (VR/AR), because it is simply extremely slow. On a mobile Volta GPU, even the state-of-the-art NeRF models generally execute only at 0.8 FPS. We show that the main performance bottlenecks are both algorithmic and architectural. We introduce, CICERO, to tame both forms of inefficiencies. We first introduce two algorithms, one fundamentally reduces the amount of work any NeRF model has to execute, and the other eliminates irregular DRAM accesses. We then describe an on-chip data layout strategy that eliminates SRAM bank conflicts. A pure software implementation of CICERO offers an 8.0x speed-up and 7.9x energy saving over a mobile Volta GPU. When compared to a baseline with a dedicated DNN accelerator, our speed-up and energy reduction increase to 28.2x and 37.8x, respectively - all with minimal quality loss (less than 1.0 dB peak signal-to-noise ratio reduction).

[90]  arXiv:2404.11853 [pdf, other]
Title: Oracle-Augmented Prophet Inequalities
Subjects: Computer Science and Game Theory (cs.GT)

In the classical prophet inequality settings, a gambler is given a sequence of $n$ random variables $X_1, \dots, X_n$, taken from known distributions, observes their values in this (potentially adversarial) order, and select one of them, immediately after it is being observed, so that its value is as high as possible. The classical \emph{prophet inequality} shows a strategy that guarantees a value at least half of that an omniscience prophet that picks the maximum, and this ratio is optimal.
Here, we generalize the prophet inequality, allowing the gambler some additional information about the future that is otherwise privy only to the prophet. Specifically, at any point in the process, the gambler is allowed to query an oracle $\mathcal{O}$. The oracle responds with a single bit answer: YES if the current realization is greater than the remaining realizations, and NO otherwise. We show that the oracle model with $m$ oracle calls is equivalent to the \textsc{Top-$1$-of-$(m+1)$} model when the objective is maximizing the probability of selecting the maximum. This equivalence fails to hold when the objective is maximizing the competitive ratio, but we still show that any algorithm for the oracle model implies an equivalent competitive ratio for the \textsc{Top-$1$-of-$(m+1)$} model.
We resolve the oracle model for any $m$, giving tight lower and upper bound on the best possible competitive ratio compared to an almighty adversary. As a consequence, we provide new results as well as improvements on known results for the \textsc{Top-$1$-of-$m$} model.

[91]  arXiv:2404.11854 [pdf, other]
Title: SGRU: A High-Performance Structured Gated Recurrent Unit for Traffic Flow Prediction
Comments: 7 pages, 6 figures, conference
Subjects: Artificial Intelligence (cs.AI)

Traffic flow prediction is an essential task in constructing smart cities and is a typical Multivariate Time Series (MTS) Problem. Recent research has abandoned Gated Recurrent Units (GRU) and utilized dilated convolutions or temporal slicing for feature extraction, and they have the following drawbacks: (1) Dilated convolutions fail to capture the features of adjacent time steps, resulting in the loss of crucial transitional data. (2) The connections within the same temporal slice are strong, while the connections between different temporal slices are too loose. In light of these limitations, we emphasize the importance of analyzing a complete time series repeatedly and the crucial role of GRU in MTS. Therefore, we propose SGRU: Structured Gated Recurrent Units, which involve structured GRU layers and non-linear units, along with multiple layers of time embedding to enhance the model's fitting performance. We evaluate our approach on four publicly available California traffic datasets: PeMS03, PeMS04, PeMS07, and PeMS08 for regression prediction. Experimental results demonstrate that our model outperforms baseline models with average improvements of 11.7%, 18.6%, 18.5%, and 12.0% respectively.

[92]  arXiv:2404.11862 [pdf, ps, other]
Title: A Fast Maximum Clique Algorithm Based on Network Decomposition for Large Sparse Networks
Comments: 12 pages, 2 figures, 1 table
Subjects: Social and Information Networks (cs.SI); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Data Analysis, Statistics and Probability (physics.data-an)

Finding maximum cliques in large networks is a challenging combinatorial problem with many real-world applications. We present a fast algorithm to achieve the exact solution for the maximum clique problem in large sparse networks based on efficient graph decomposition. A bunch of effective techniques is being used to greatly prune the graph and a novel concept called Complete-Upper-Bound-Induced Subgraph (CUBIS) is proposed to ensure that the structures with the potential to form the maximum clique are retained in the process of graph decomposition. Our algorithm first pre-prunes peripheral nodes, subsequently, one or two small-scale CUBISs are constructed guided by the core number and current maximum clique size. Bron-Kerbosch search is performed on each CUBIS to find the maximum clique. Experiments on 50 empirical networks with a scale of up to 20 million show the CUBIS scales are largely independent of the original network scale. This enables an approximately linear runtime, making our algorithm amenable for large networks. Our work provides a new framework for effectively solving maximum clique problems on massive sparse graphs, which not only makes the graph scale no longer the bottleneck but also shows some light on solving other clique-related problems.

[93]  arXiv:2404.11864 [pdf, other]
Title: Progressive Multi-modal Conditional Prompt Tuning
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting, which leverages VLMs as knowledge bases to extract information beneficial for downstream tasks. However, existing methods primarily employ uni-modal prompting, which only engages a uni-modal branch, failing to simultaneously adjust vision-language (V-L) features. Additionally, the one-pass forward pipeline in VLM encoding struggles to align V-L features that have a huge gap. Confronting these challenges, we propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT). ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information. It comprises an initialization and a multi-modal iterative evolution (MIE) module. Initialization is responsible for encoding image and text using a VLM, followed by a feature filter that selects text features similar to image. MIE then facilitates multi-modal prompting through class-conditional vision prompting, instance-conditional text prompting, and feature filtering. In each MIE iteration, vision prompts are obtained from the filtered text features via a vision generator, promoting image features to focus more on target object during vision prompting. The encoded image features are fed into a text generator to produce text prompts that are more robust to class shift. Thus, V-L features are progressively aligned, enabling advance from coarse to exact classifications. Extensive experiments are conducted in three settings to evaluate the efficacy of ProMPT. The results indicate that ProMPT outperforms existing methods on average across all settings, demonstrating its superior generalization.

[94]  arXiv:2404.11865 [pdf, other]
Title: From Image to Video, what do we need in multimodal LLMs?
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these methods.In response to these challenges, this work introduces an efficient method that strategically leverages the priors of Image LLMs, facilitating a resource-efficient transition from Image to Video LLMs. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs, which utilizes a temporal adaptation plug-and-play structure within the image fusion module of Image LLMs. This adaptation extends their understanding capabilities to include temporal information, enabling the development of Video LLMs that not only surpass baseline performances but also do so with minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models, effectively building upon the foundational work of Image LLMs.

[95]  arXiv:2404.11868 [pdf, other]
Title: OPTiML: Dense Semantic Invariance Using Optimal Transport for Self-Supervised Medical Image Representation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.

[96]  arXiv:2404.11869 [pdf, other]
Title: Multi-view Graph Structural Representation Learning via Graph Coarsening
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)

Graph Transformers (GTs) have made remarkable achievements in graph-level tasks. However, most existing works regard graph structures as a form of guidance or bias for enhancing node representations, which focuses on node-central perspectives and lacks explicit representations of edges and structures. One natural question is, can we treat graph structures node-like as a whole to learn high-level features? Through experimental analysis, we explore the feasibility of this assumption. Based on our findings, we propose a novel multi-view graph structural representation learning model via graph coarsening (MSLgo) on GT architecture for graph classification. Specifically, we build three unique views, original, coarsening, and conversion, to learn a thorough structural representation. We compress loops and cliques via hierarchical heuristic graph coarsening and restrict them with well-designed constraints, which builds the coarsening view to learn high-level interactions between structures. We also introduce line graphs for edge embeddings and switch to edge-central perspective to construct the conversion view. Experiments on six real-world datasets demonstrate the improvements of MSLgo over 14 baselines from various architectures.

[97]  arXiv:2404.11870 [pdf, ps, other]
Title: Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory
Comments: Preprint
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

We propose Pointer-Augmented Neural Memory (PANM) to help neural networks understand and apply symbol processing to new, longer sequences of data. PANM integrates an external neural memory that uses novel physical addresses and pointer manipulation techniques to mimic human and computer symbol processing abilities. PANM facilitates pointer assignment, dereference, and arithmetic by explicitly using physical pointers to access memory content. Remarkably, it can learn to perform these operations through end-to-end training on sequence data, powering various sequential models. Our experiments demonstrate PANM's exceptional length extrapolating capabilities and improved performance in tasks that require symbol processing, such as algorithmic reasoning and Dyck language recognition. PANM helps Transformer achieve up to 100% generalization accuracy in compositional learning tasks and significantly better results in mathematical reasoning, question answering and machine translation tasks.

[98]  arXiv:2404.11871 [pdf, other]
Title: Group-On: Boosting One-Shot Segmentation with Supportive Query
Subjects: Computer Vision and Pattern Recognition (cs.CV)

One-shot semantic segmentation aims to segment query images given only ONE annotated support image of the same class. This task is challenging because target objects in the support and query images can be largely different in appearance and pose (i.e., intra-class variation). Prior works suggested that incorporating more annotated support images in few-shot settings boosts performances but increases costs due to additional manual labeling. In this paper, we propose a novel approach for ONE-shot semantic segmentation, called Group-On, which packs multiple query images in batches for the benefit of mutual knowledge support within the same category. Specifically, after coarse segmentation masks of the batch of queries are predicted, query-mask pairs act as pseudo support data to enhance mask predictions mutually, under the guidance of a simple Group-On Voting module. Comprehensive experiments on three standard benchmarks show that, in the ONE-shot setting, our Group-On approach significantly outperforms previous works by considerable margins. For example, on the COCO-20i dataset, we increase mIoU scores by 8.21% and 7.46% on ASNet and HSNet baselines, respectively. With only one support image, Group-On can be even competitive with the counterparts using 5 annotated support images.

[99]  arXiv:2404.11874 [pdf, other]
Title: Using a Local Surrogate Model to Interpret Temporal Shifts in Global Annual Data
Authors: Shou Nakano, Yang Liu
Comments: There are 9 pages and 5 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

This paper focuses on explaining changes over time in globally-sourced, annual temporal data, with the specific objective of identifying pivotal factors that contribute to these temporal shifts. Leveraging such analytical frameworks can yield transformative impacts, including the informed refinement of public policy and the identification of key drivers affecting a country's economic evolution. We employ Local Interpretable Model-agnostic Explanations (LIME) to shed light on national happiness indices, economic freedom, and population metrics, spanning variable time frames. Acknowledging the presence of missing values, we employ three imputation approaches to generate robust multivariate time-series datasets apt for LIME's input requirements. Our methodology's efficacy is substantiated through a series of empirical evaluations involving multiple datasets. These evaluations include comparative analyses against random feature selection, correlation with real-world events as elucidated by LIME, and validation through Individual Conditional Expectation (ICE) plots, a state-of-the-art technique proficient in feature importance detection.

[100]  arXiv:2404.11875 [pdf, other]
Title: Concept Induction using LLMs: a user experiment for assessment
Subjects: Artificial Intelligence (cs.AI)

Explainable Artificial Intelligence (XAI) poses a significant challenge in providing transparent and understandable insights into complex AI models. Traditional post-hoc algorithms, while useful, often struggle to deliver interpretable explanations. Concept-based models offer a promising avenue by incorporating explicit representations of concepts to enhance interpretability. However, existing research on automatic concept discovery methods is often limited by lower-level concepts, costly human annotation requirements, and a restricted domain of background knowledge. In this study, we explore the potential of a Large Language Model (LLM), specifically GPT-4, by leveraging its domain knowledge and common-sense capability to generate high-level concepts that are meaningful as explanations for humans, for a specific setting of image classification. We use minimal textual object information available in the data via prompting to facilitate this process. To evaluate the output, we compare the concepts generated by the LLM with two other methods: concepts generated by humans and the ECII heuristic concept induction system. Since there is no established metric to determine the human understandability of concepts, we conducted a human study to assess the effectiveness of the LLM-generated concepts. Our findings indicate that while human-generated explanations remain superior, concepts derived from GPT-4 are more comprehensible to humans compared to those generated by ECII.

[101]  arXiv:2404.11876 [pdf, other]
Title: CelluloTactix: Towards Empowering Collaborative Online Learning through Tangible Haptic Interaction with Cellulo Robots
Subjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO)

Online learning has soared in popularity in the educational landscape of COVID-19 and carries the benefits of increased flexibility and access to far-away training resources. However, it also restricts communication between peers and teachers, limits physical interactions and confines learning to the computer screen and keyboard. In this project, we designed a novel way to engage students in collaborative online learning by using haptic-enabled tangible robots, Cellulo. We built a library which connects two robots remotely for a learning activity based around the structure of a biological cell. To discover how separate modes of haptic feedback might differentially affect collaboration, two modes of haptic force-feedback were implemented (haptic co-location and haptic consensus). With a case study, we found that the haptic co-location mode seemed to stimulate collectivist behaviour to a greater extent than the haptic consensus mode, which was associated with individualism and less interaction. While the haptic co-location mode seemed to encourage information pooling, participants using the haptic consensus mode tended to focus more on technical co-ordination. This work introduces a novel system that can provide interesting insights on how to integrate haptic feedback into collaborative remote learning activities in future.

[102]  arXiv:2404.11879 [pdf, other]
Title: Public Event Scheduling with Busy Agents
Comments: To appear in IJCAI 2024
Subjects: Data Structures and Algorithms (cs.DS)

We study a public event scheduling problem, where multiple public events are scheduled to coordinate the availability of multiple agents. The availability of each agent is determined by solving a separate flexible interval job scheduling problem, where the jobs are required to be preemptively processed. The agents want to attend as many events as possible, and their agreements are considered to be the total length of time during which they can attend these events. The goal is to find a schedule for events as well as the job schedule for each agent such that the total agreement is maximized.
We first show that the problem is NP-hard, and then prove that a simple greedy algorithm achieves $\frac{1}{2}$-approximation when the whole timeline is polynomially bounded. Our method also implies a $(1-\frac{1}{e})$-approximate algorithm for this case. Subsequently, for the general timeline case, we present an algorithmic framework that extends a $\frac{1}{\alpha}$-approximate algorithm for the one-event instance to the general case that achieves $\frac{1}{\alpha+1}$-approximation. Finally, we give a polynomial time algorithm that solves the one-event instance, and this implies a $\frac{1}{2}$-approximate algorithm for the general case.

[103]  arXiv:2404.11881 [pdf, other]
Title: Joint Transmitter and Receiver Design for Movable Antenna Enhanced Multicast Communications
Comments: 13 double-column single-spaced pages, 9 figures, submitted to IEEE journal for possible publication
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

Movable antenna (MA) is an emerging technology that utilizes localized antenna movement to pursue better channel conditions for enhancing communication performance. In this paper, we study the MA-enhanced multicast transmission from a base station equipped with multiple MAs to multiple groups of single-MA users. Our goal is to maximize the minimum weighted signal-to-interference-plus-noise ratio (SINR) among all the users by jointly optimizing the position of each transmit/receive MA and the transmit beamforming. To tackle this challenging problem, we first consider the single-group scenario and propose an efficient algorithm based on the techniques of alternating optimization and successive convex approximation. Particularly, when optimizing transmit or receive MA positions, we construct a concave lower bound for the signal-to-noise ratio (SNR) of each user by applying only the second-order Taylor expansion, which is more effective than existing works utilizing two-step approximations. The proposed design is then extended to the general multi-group scenario. Simulation results demonstrate that significant performance gains in terms of achievable max-min SNR/SINR can be obtained by our proposed algorithm over benchmark schemes. Additionally, the proposed algorithm can notably reduce the required amount of transmit power or antennas for achieving a target level of max-min SNR/SINR performance compared to benchmark schemes.

[104]  arXiv:2404.11882 [pdf, other]
Title: Hybrid Navigation Acceptability and Safety
Subjects: Systems and Control (eess.SY); Robotics (cs.RO)

Autonomous vessels have emerged as a prominent and accepted solution, particularly in the naval defence sector. However, achieving full autonomy for marine vessels demands the development of robust and reliable control and guidance systems that can handle various encounters with manned and unmanned vessels while operating effectively under diverse weather and sea conditions. A significant challenge in this pursuit is ensuring the autonomous vessels' compliance with the International Regulations for Preventing Collisions at Sea (COLREGs). These regulations present a formidable hurdle for the human-level understanding by autonomous systems as they were originally designed from common navigation practices created since the mid-19th century. Their ambiguous language assumes experienced sailors' interpretation and execution, and therefore demands a high-level (cognitive) understanding of language and agent intentions. These capabilities surpass the current state-of-the-art in intelligent systems. This position paper highlights the critical requirements for a trustworthy control and guidance system, exploring the complexity of adapting COLREGs for safe vessel-on-vessel encounters considering autonomous maritime technology competing and/or cooperating with manned vessels.

[105]  arXiv:2404.11884 [pdf, other]
Title: Seeing Motion at Nighttime with an Event Camera
Comments: Accepted by CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However, they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (120dB), offering an alternative solution. In this work, we present a novel nighttime dynamic imaging method with an event camera. Specifically, we discover that the event at nighttime exhibits temporal trailing characteristics and spatial non-stationary distribution. Consequently, we propose a nighttime event reconstruction network (NER-Net) which mainly includes a learnable event timestamps calibration module (LETC) to align the temporal trailing events and a non-uniform illumination aware module (NIAM) to stabilize the spatiotemporal distribution of events. Moreover, we construct a paired real low-light event dataset (RLED) through a co-axial imaging system, including 64,200 spatially and temporally aligned image GTs and low-light events. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in terms of visual quality and generalization ability on real-world nighttime datasets. The project are available at: https://github.com/Liu-haoyue/NER-Net.

[106]  arXiv:2404.11886 [pdf, other]
Title: A Distributions-based Approach for Data-Consistent Inversion
Comments: 26 pages, 19 figures
Subjects: Numerical Analysis (math.NA)

We formulate a novel approach to solve a class of stochastic problems, referred to as data-consistent inverse (DCI) problems, which involve the characterization of a probability measure on the parameters of a computational model whose subsequent push-forward matches an observed probability measure on specified quantities of interest (QoI) typically associated with the outputs from the computational model. Whereas prior DCI solution methodologies focused on either constructing non-parametric estimates of the densities or the probabilities of events associated with the pre-image of the QoI map, we develop and analyze a constrained quadratic optimization approach based on estimating push-forward measures using weighted empirical distribution functions. The method proposed here is more suitable for low-data regimes or high-dimensional problems than the density-based method, as well as for problems where the probability measure does not admit a density. Numerical examples are included to demonstrate the performance of the method and to compare with the density-based approach where applicable.

[107]  arXiv:2404.11887 [pdf, other]
Title: EN-TensorCore: Advancing TensorCores Performance through Encoder-Based Methodology
Comments: 7 pages, 6 figures
Subjects: Hardware Architecture (cs.AR)

Tensor computations, with matrix multiplication being the primary operation, serve as the fundamental basis for data analysis, physics, machine learning, and deep learning. As the scale and complexity of data continue to grow rapidly, the demand for tensor computations has also increased significantly. To meet this demand, several research institutions have started developing dedicated hardware for tensor computations. To further improve the computational performance of tensor process units, we have reexamined the issue of computation reuse that was previously overlooked in existing architectures. As a result, we propose a novel EN-TensorCore architecture that can significantly reduce chip area and power consumption. Furthermore, our method is compatible with existing tensor processing architectures. We evaluated our method on prevalent microarchitectures, the results demonstrate an average improvement in area efficiency of 8.7\%, 12.2\%, and 11.0\% for tensor computing units at computational scales of 256 GOPS, 1 TOPS, and 4 TOPS, respectively. Similarly, there were energy efficiency enhancements of 13.0\%, 17.5\%, and 15.5\%.

[108]  arXiv:2404.11888 [pdf, other]
Title: The Dog Walking Theory: Rethinking Convergence in Federated Learning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Federated learning (FL) is a collaborative learning paradigm that allows different clients to train one powerful global model without sharing their private data. Although FL has demonstrated promising results in various applications, it is known to suffer from convergence issues caused by the data distribution shift across different clients, especially on non-independent and identically distributed (non-IID) data. In this paper, we study the convergence of FL on non-IID data and propose a novel \emph{Dog Walking Theory} to formulate and identify the missing element in existing research. The Dog Walking Theory describes the process of a dog walker leash walking multiple dogs from one side of the park to the other. The goal of the dog walker is to arrive at the right destination while giving the dogs enough exercise (i.e., space exploration). In FL, the server is analogous to the dog walker while the clients are analogous to the dogs. This analogy allows us to identify one crucial yet missing element in existing FL algorithms: the leash that guides the exploration of the clients. To address this gap, we propose a novel FL algorithm \emph{FedWalk} that leverages an external easy-to-converge task at the server side as a \emph{leash task} to guide the local training of the clients. We theoretically analyze the convergence of FedWalk with respect to data heterogeneity (between server and clients) and task discrepancy (between the leash and the original tasks). Experiments on multiple benchmark datasets demonstrate the superiority of FedWalk over state-of-the-art FL methods under both IID and non-IID settings.

[109]  arXiv:2404.11890 [pdf, other]
Title: FCNCP: A Coupled Nonnegative CANDECOMP/PARAFAC Decomposition Based on Federated Learning
Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG)

In the field of brain science, data sharing across servers is becoming increasingly challenging due to issues such as industry competition, privacy security, and administrative procedure policies and regulations. Therefore, there is an urgent need to develop new methods for data analysis and processing that enable scientific collaboration without data sharing. In view of this, this study proposes to study and develop a series of efficient non-negative coupled tensor decomposition algorithm frameworks based on federated learning called FCNCP for the EEG data arranged on different servers. It combining the good discriminative performance of tensor decomposition in high-dimensional data representation and decomposition, the advantages of coupled tensor decomposition in cross-sample tensor data analysis, and the features of federated learning for joint modelling in distributed servers. The algorithm utilises federation learning to establish coupling constraints for data distributed across different servers. In the experiments, firstly, simulation experiments are carried out using simulated data, and stable and consistent decomposition results are obtained, which verify the effectiveness of the proposed algorithms in this study. Then the FCNCP algorithm was utilised to decompose the fifth-order event-related potential (ERP) tensor data collected by applying proprioceptive stimuli on the left and right hands. It was found that contralateral stimulation induced more symmetrical components in the activation areas of the left and right hemispheres. The conclusions drawn are consistent with the interpretations of related studies in cognitive neuroscience, demonstrating that the method can efficiently process higher-order EEG data and that some key hidden information can be preserved.

[110]  arXiv:2404.11891 [pdf, other]
Title: Large Language Models Can Plan Your Travels Rigorously with Formal Verification Tools
Comments: 31 pages, 3 figures, 4 tables, submitted to ACL RR
Subjects: Artificial Intelligence (cs.AI)

The recent advancements of Large Language Models (LLMs), with their abundant world knowledge and capabilities of tool-using and reasoning, fostered many LLM planning algorithms. However, LLMs have not shown to be able to accurately solve complex combinatorial optimization problems. In Xie et al. (2024), the authors proposed TravelPlanner, a U.S. domestic travel planning benchmark, and showed that LLMs themselves cannot make travel plans that satisfy user requirements with a best success rate of 0.6%. In this work, we propose a framework that enables LLMs to formally formulate and solve the travel planning problem as a satisfiability modulo theory (SMT) problem and use SMT solvers interactively and automatically solve the combinatorial search problem. The SMT solvers guarantee the satisfiable of input constraints and the LLMs can enable a language-based interaction with our framework. When the input constraints cannot be satisfiable, our LLM-based framework will interactively offer suggestions to users to modify their travel requirements via automatic reasoning using the SMT solvers. We evaluate our framework with TravelPlanner and achieve a success rate of 97%. We also create a separate dataset that contain international travel benchmarks and use both dataset to evaluate the effectiveness of our interactive planning framework when the initial user queries cannot be satisfied. Our framework could generate valid plans with an average success rate of 78.6% for our dataset and 85.0% for TravelPlanner according to diverse humans preferences.

[111]  arXiv:2404.11894 [pdf, other]
Title: Rendering Participating Media Using Path Graphs
Subjects: Graphics (cs.GR)

Rendering volumetric scattering media, including clouds, fog, smoke, and other complex materials, is crucial for realism in computer graphics. Traditional path tracing, while unbiased, requires many long path samples to converge in scenes with scattering media, and a lot of work is wasted by paths that make a negligible contribution to the image. Methods to make better use of the information learned during path tracing range from photon mapping to radiance caching, but struggle to support the full range of heterogeneous scattering media. This paper introduces a new volumetric rendering algorithm that extends and adapts the previous \emph{path graph} surface rendering algorithm. Our method leverages the information collected through multiple-scattering transport paths to compute lower-noise estimates, increasing computational efficiency by reducing the required sample count. Our key contributions include an extended path graph for participating media and new aggregation and propagation operators for efficient path reuse in volumes. Compared to previous methods, our approach significantly boosts convergence in scenes with challenging volumetric light transport, including heterogeneous media with high scattering albedos and dense, forward-scattering translucent materials, under complex lighting conditions.

[112]  arXiv:2404.11895 [pdf, other]
Title: FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.

[113]  arXiv:2404.11897 [pdf, other]
Title: AG-NeRF: Attention-guided Neural Radiance Fields for Multi-height Large-scale Outdoor Scene Rendering
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Existing neural radiance fields (NeRF)-based novel view synthesis methods for large-scale outdoor scenes are mainly built on a single altitude. Moreover, they often require a priori camera shooting height and scene scope, leading to inefficient and impractical applications when camera altitude changes. In this work, we propose an end-to-end framework, termed AG-NeRF, and seek to reduce the training cost of building good reconstructions by synthesizing free-viewpoint images based on varying altitudes of scenes. Specifically, to tackle the detail variation problem from low altitude (drone-level) to high altitude (satellite-level), a source image selection method and an attention-based feature fusion approach are developed to extract and fuse the most relevant features of target view from multi-height images for high-fidelity rendering. Extensive experiments demonstrate that AG-NeRF achieves SOTA performance on 56 Leonard and Transamerica benchmarks and only requires a half hour of training time to reach the competitive PSNR as compared to the latest BungeeNeRF.

[114]  arXiv:2404.11898 [pdf, ps, other]
Title: Enhancing Financial Inclusion and Regulatory Challenges: A Critical Analysis of Digital Banks and Alternative Lenders Through Digital Platforms, Machine Learning, and Large Language Models Integration
Authors: Luke Lee
Comments: 17 pages
Subjects: Artificial Intelligence (cs.AI)

This paper explores the dual impact of digital banks and alternative lenders on financial inclusion and the regulatory challenges posed by their business models. It discusses the integration of digital platforms, machine learning (ML), and Large Language Models (LLMs) in enhancing financial services accessibility for underserved populations. Through a detailed analysis of operational frameworks and technological infrastructures, this research identifies key mechanisms that facilitate broader financial access and mitigate traditional barriers. Additionally, the paper addresses significant regulatory concerns involving data privacy, algorithmic bias, financial stability, and consumer protection. Employing a mixed-methods approach, which combines quantitative financial data analysis with qualitative insights from industry experts, this paper elucidates the complexities of leveraging digital technology to foster financial inclusivity. The findings underscore the necessity of evolving regulatory frameworks that harmonize innovation with comprehensive risk management. This paper concludes with policy recommendations for regulators, financial institutions, and technology providers, aiming to cultivate a more inclusive and stable financial ecosystem through prudent digital technology integration.

[115]  arXiv:2404.11900 [pdf, other]
Title: A New Hybrid Automaton Framework with Partial Differential Equation Dynamics
Comments: 17 pages
Subjects: Systems and Control (eess.SY)

This paper presents the syntax and semantics of a novel type of hybrid automaton (HA) with partial differential equation (PDE) dynamic, partial differential hybrid automata (PDHA). In PDHA, we add a spatial domain $X$ and harness a mathematic conception, partition, to help us formally define the spatial relations. While classically the dynamics of HA are described by ordinary differential equations (ODEs) and differential inclusions, PDHA is capable of describing the behavior of cyber-physical systems (CPS) with continuous dynamics that cannot be modelled using the canonical hybrid systems' framework. For the purposes of analyzing PDHA, we propose another model called the discrete space partial differential hybrid automata (DSPDHA) which handles discrete spatial domains using finite difference methods (FDM) and this simple and intuitive approach reduces the PDHA into HA with ODE systems. We conclude with two illustrative examples in order to exhibit the nature of PDHA and DSPDHA.

[116]  arXiv:2404.11903 [pdf, other]
Title: Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition
Comments: 12 pages, 5 figures, submitted to IEEE Transactions on Multimedia
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.

[117]  arXiv:2404.11905 [pdf, other]
Title: FedMID: A Data-Free Method for Using Intermediate Outputs as a Defense Mechanism Against Poisoning Attacks in Federated Learning
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)

Federated learning combines local updates from clients to produce a global model, which is susceptible to poisoning attacks. Most previous defense strategies relied on vectors derived from projections of local updates on a Euclidean space; however, these methods fail to accurately represent the functionality and structure of local models, resulting in inconsistent performance. Here, we present a new paradigm to defend against poisoning attacks in federated learning using functional mappings of local models based on intermediate outputs. Experiments show that our mechanism is robust under a broad range of computing conditions and advanced attack scenarios, enabling safer collaboration among data-sensitive participants via federated learning.

[118]  arXiv:2404.11907 [pdf, other]
Title: Sampling-based Pareto Optimization for Chance-constrained Monotone Submodular Problems
Subjects: Artificial Intelligence (cs.AI)

Recently surrogate functions based on the tail inequalities were developed to evaluate the chance constraints in the context of evolutionary computation and several Pareto optimization algorithms using these surrogates were successfully applied in optimizing chance-constrained monotone submodular problems. However, the difference in performance between algorithms using the surrogates and those employing the direct sampling-based evaluation remains unclear. Within the paper, a sampling-based method is proposed to directly evaluate the chance constraint. Furthermore, to address the problems with more challenging settings, an enhanced GSEMO algorithm integrated with an adaptive sliding window, called ASW-GSEMO, is introduced. In the experiments, the ASW-GSEMO employing the sampling-based approach is tested on the chance-constrained version of the maximum coverage problem with different settings. Its results are compared with those from other algorithms using different surrogate functions. The experimental findings indicate that the ASW-GSEMO with the sampling-based evaluation approach outperforms other algorithms, highlighting that the performances of algorithms using different evaluation methods are comparable. Additionally, the behaviors of ASW-GSEMO are visualized to explain the distinctions between it and the algorithms utilizing the surrogate functions.

[119]  arXiv:2404.11912 [pdf, other]
Title: TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31$\times$ on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/token$\unicode{x2014}$only half as slow as the auto-regressive baseline on an A100, which attains 7.78$\times$ on our optimized offloading system. Additionally, TriForce performs 4.86$\times$ than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.

[120]  arXiv:2404.11916 [pdf, other]
Title: SKIP: Skill-Localized Prompt Tuning for Inference Speed Boost-Up
Comments: 6 pages
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Prompt-tuning methods have shown comparable performance as parameter-efficient fine-tuning (PEFT) methods in various natural language understanding tasks. However, existing prompt tuning methods still utilize the entire model architecture; thus, they fail to accelerate inference speed in the application. In this paper, we propose a novel approach called SKIll-localized Prompt tuning (SKIP), which is extremely efficient in inference time. Our method significantly enhances inference efficiency by investigating and utilizing a skill-localized subnetwork in a language model. Surprisingly, our method improves the inference speed up to 160% while pruning 52% of the parameters. Furthermore, we demonstrate that our method is applicable across various transformer-based architectures, thereby confirming its practicality and scalability.

[121]  arXiv:2404.11917 [pdf, other]
Title: Expected Coordinate Improvement for High-Dimensional Bayesian Optimization
Authors: Dawei Zhan
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Bayesian optimization (BO) algorithm is very popular for solving low-dimensional expensive optimization problems. Extending Bayesian optimization to high dimension is a meaningful but challenging task. One of the major challenges is that it is difficult to find good infill solutions as the acquisition functions are also high-dimensional. In this work, we propose the expected coordinate improvement (ECI) criterion for high-dimensional Bayesian optimization. The proposed ECI criterion measures the potential improvement we can get by moving the current best solution along one coordinate. The proposed approach selects the coordinate with the highest ECI value to refine in each iteration and covers all the coordinates gradually by iterating over the coordinates. The greatest advantage of the proposed ECI-BO (expected coordinate improvement based Bayesian optimization) algorithm over the standard BO algorithm is that the infill selection problem of the proposed algorithm is always a one-dimensional problem thus can be easily solved. Numerical experiments show that the proposed algorithm can achieve significantly better results than the standard BO algorithm and competitive results when compared with five state-of-the-art high-dimensional BOs. This work provides a simple but efficient approach for high-dimensional Bayesian optimization.

[122]  arXiv:2404.11918 [pdf, other]
Title: TeachNow: Enabling Teachers to Provide Spontaneous, Realtime 1:1 Help in Massive Online Courses
Journal-ref: In Proceedings of the 2024 Innovation and Technology in Computer Science Education (ITiCSE 2024)
Subjects: Computers and Society (cs.CY)

One-on-one help from a teacher is highly impactful for students, yet extremely challenging to support in massive online courses (MOOCs). In this work, we present TeachNow: a novel system that lets volunteer teachers from anywhere in the world instantly provide 1:1 help sessions to students in MOOCs, without any scheduling or coordination overhead. TeachNow works by quickly finding an online student to help and putting them in a collaborative working session with the teacher. The spontaneous, on-demand nature of TeachNow gives teachers the flexibility to help whenever their schedule allows.
We share our experiences deploying TeachNow as an experimental feature in a six week online CS1 course with 9,000 students and 600 volunteer teachers. Even as an optional activity, TeachNow was used by teachers to provide over 12,300 minutes of 1:1 help to 375 unique students. Through a carefully designed randomised control trial, we show that TeachNow sessions increased student course retention rate by almost 15%. Moreover, the flexibility of our system captured valuable volunteer time that would otherwise go to waste. Lastly, TeachNow was rated by teachers as one of the most enjoyable and impactful aspects of their involvement in the course. We believe TeachNow is an important step towards providing more human-centered support in massive online courses.

[123]  arXiv:2404.11922 [pdf, other]
Title: Redefining the Shortest Path Problem Formulation of the Linear Non-Gaussian Acyclic Model: Pairwise Likelihood Ratios, Prior Knowledge, and Path Enumeration
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)

Effective causal discovery is essential for learning the causal graph from observational data. The linear non-Gaussian acyclic model (LiNGAM) operates under the assumption of a linear data generating process with non-Gaussian noise in determining the causal graph. Its assumption of unmeasured confounders being absent, however, poses practical limitations. In response, empirical research has shown that the reformulation of LiNGAM as a shortest path problem (LiNGAM-SPP) addresses this limitation. Within LiNGAM-SPP, mutual information is chosen to serve as the measure of independence. A challenge is introduced - parameter tuning is now needed due to its reliance on kNN mutual information estimators. The paper proposes a threefold enhancement to the LiNGAM-SPP framework.
First, the need for parameter tuning is eliminated by using the pairwise likelihood ratio in lieu of kNN-based mutual information. This substitution is validated on a general data generating process and benchmark real-world data sets, outperforming existing methods especially when given a larger set of features. The incorporation of prior knowledge is then enabled by a node-skipping strategy implemented on the graph representation of all causal orderings to eliminate violations based on the provided input of relative orderings. Flexibility relative to existing approaches is achieved. Last among the three enhancements is the utilization of the distribution of paths in the graph representation of all causal orderings. From this, crucial properties of the true causal graph such as the presence of unmeasured confounders and sparsity may be inferred. To some extent, the expected performance of the causal discovery algorithm may be predicted. The refinements above advance the practicality and performance of LiNGAM-SPP, showcasing the potential of graph-search-based methodologies in advancing causal discovery.

[124]  arXiv:2404.11924 [pdf, other]
Title: Toward Short-Term Glucose Prediction Solely Based on CGM Time Series
Subjects: Artificial Intelligence (cs.AI)

The global diabetes epidemic highlights the importance of maintaining good glycemic control. Glucose prediction is a fundamental aspect of diabetes management, facilitating real-time decision-making. Recent research has introduced models focusing on long-term glucose trend prediction, which are unsuitable for real-time decision-making and result in delayed responses. Conversely, models designed to respond to immediate glucose level changes cannot analyze glucose variability comprehensively. Moreover, contemporary research generally integrates various physiological parameters (e.g. insulin doses, food intake, etc.), which inevitably raises data privacy concerns. To bridge such a research gap, we propose TimeGlu -- an end-to-end pipeline for short-term glucose prediction solely based on CGM time series data. We implement four baseline methods to conduct a comprehensive comparative analysis of the model's performance. Through extensive experiments on two contrasting datasets (CGM Glucose and Colas dataset), TimeGlu achieves state-of-the-art performance without the need for additional personal data from patients, providing effective guidance for real-world diabetic glucose management.

[125]  arXiv:2404.11925 [pdf, other]
Title: EdgeFusion: On-Device Text-to-Image Generation
Comments: 4 pages, accepted to CVPR24 First Workshop on Efficient and On-Device Generation (EDGE)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

[126]  arXiv:2404.11932 [pdf, other]
Title: CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment
Comments: 11 pages
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.

[127]  arXiv:2404.11935 [pdf, other]
Title: A variational discretization method for mean curvature flows by the Onsager principle
Authors: Yihe Liu, Xianmin Xu
Subjects: Numerical Analysis (math.NA)

The mean curvature flow describes the evolution of a surface(a curve) with normal velocity proportional to the local mean curvature. It has many applications in mathematics, science and engineering. In this paper, we develop a numerical method for mean curvature flows by using the Onsager principle as an approximation tool. We first show that the mean curvature flow can be derived naturally from the Onsager variational principle. Then we consider a piecewisely linear approximation of the curve and derive a discrete geometric flow. The discrete flow is described by a system of ordinary differential equations for the nodes of the discrete curve. We prove that the discrete system preserve the energy dissipation structure in the framework of the Onsager principle and this implies the energy decreasing property. The ODE system can be solved by an improved Euler scheme and this leads to an efficient fully discrete scheme. We first consider the method for a simple mean curvature flow and then extend it to volume preserving mean curvature flow and also a wetting problem on substrates. Numerical examples show that the method has optimal convergence rate and works well for all the three problems.

[128]  arXiv:2404.11936 [pdf, other]
Title: LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights
Comments: 8 pages, accepted to CVPR24 First Workshop on Efficient and On-Device Generation (EDGE)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Latent Diffusion Models (LDMs) have emerged as powerful generative models, known for delivering remarkable results under constrained computational resources. However, deploying LDMs on resource-limited devices remains a complex issue, presenting challenges such as memory consumption and inference speed. To address this issue, we introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing LDMs. Traditional pruning methods for deep neural networks are not tailored to the unique characteristics of LDMs, such as the high computational cost of training and the absence of a fast, straightforward and task-agnostic method for evaluating model performance. Our method tackles these challenges by leveraging the latent space during the pruning process, enabling us to effectively quantify the impact of pruning on model performance, independently of the task at hand. This targeted pruning of components with minimal impact on the output allows for faster convergence during training, as the model has less information to re-learn, thereby addressing the high computational cost of training. Consequently, our approach achieves a compressed model that offers improved inference speed and reduced parameter count, while maintaining minimal performance degradation. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG). Notably, we reduce the inference time of Stable Diffusion (SD) by 34.9% while simultaneously improving its FID by 5.2% on MS-COCO T2I benchmark. This work paves the way for more efficient pruning methods for LDMs, enhancing their applicability.

[129]  arXiv:2404.11938 [pdf, other]
Title: HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis
Comments: 13 pages, IJCAI-2024
Subjects: Multimedia (cs.MM); Distributed, Parallel, and Cluster Computing (cs.DC); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Multimodal Sentiment Analysis (MSA) aims to identify speakers' sentiment tendencies in multimodal video content, raising serious concerns about privacy risks associated with multimodal data, such as voiceprints and facial images. Recent distributed collaborative learning has been verified as an effective paradigm for privacy preservation in multimodal tasks. However, they often overlook the privacy distinctions among different modalities, struggling to strike a balance between performance and privacy preservation. Consequently, it poses an intriguing question of maximizing multimodal utilization to improve performance while simultaneously protecting necessary modalities. This paper forms the first attempt at modality-specified (i.e., audio and visual) privacy preservation in MSA tasks. We propose a novel Hybrid Distributed cross-modality cGAN framework (HyDiscGAN), which learns multimodality alignment to generate fake audio and visual features conditioned on shareable de-identified textual data. The objective is to leverage the fake features to approximate real audio and visual content to guarantee privacy preservation while effectively enhancing performance. Extensive experiments show that compared with the state-of-the-art MSA model, HyDiscGAN can achieve superior or competitive performance while preserving privacy.

[130]  arXiv:2404.11943 [pdf, other]
Title: AgentCoord: Visually Exploring Coordination Strategy for LLM-based Multi-Agent Collaboration
Subjects: Human-Computer Interaction (cs.HC)

The potential of automatic task-solving through Large Language Model (LLM)-based multi-agent collaboration has recently garnered widespread attention from both the research community and industry. While utilizing natural language to coordinate multiple agents presents a promising avenue for democratizing agent technology for general users, designing coordination strategies remains challenging with existing coordination frameworks. This difficulty stems from the inherent ambiguity of natural language for specifying the collaboration process and the significant cognitive effort required to extract crucial information (e.g. agent relationship, task dependency, result correspondence) from a vast amount of text-form content during exploration. In this work, we present a visual exploration framework to facilitate the design of coordination strategies in multi-agent collaboration. We first establish a structured representation for LLM-based multi-agent coordination strategy to regularize the ambiguity of natural language. Based on this structure, we devise a three-stage generation method that leverages LLMs to convert a user's general goal into an executable initial coordination strategy. Users can further intervene at any stage of the generation process, utilizing LLMs and a set of interactions to explore alternative strategies. Whenever a satisfactory strategy is identified, users can commence the collaboration and examine the visually enhanced execution result. We develop AgentCoord, a prototype interactive system, and conduct a formal user study to demonstrate the feasibility and effectiveness of our approach.

[131]  arXiv:2404.11944 [pdf, other]
Title: Trusted Multi-view Learning with Label Noise
Comments: 9 pages, 5 figures, accepted at IJCAI 2024
Subjects: Machine Learning (cs.LG)

Multi-view learning methods often focus on improving decision accuracy while neglecting the decision uncertainty, which significantly restricts their applications in safety-critical applications. To address this issue, researchers propose trusted multi-view methods that learn the class distribution for each instance, enabling the estimation of classification probabilities and uncertainty. However, these methods heavily rely on high-quality ground-truth labels. This motivates us to delve into a new generalized trusted multi-view learning problem: how to develop a reliable multi-view learning model under the guidance of noisy labels? We propose a trusted multi-view noise refining method to solve this problem. We first construct view-opinions using evidential deep neural networks, which consist of belief mass vectors and uncertainty estimates. Subsequently, we design view-specific noise correlation matrices that transform the original opinions into noisy opinions aligned with the noisy labels. Considering label noises originating from low-quality data features and easily-confused classes, we ensure that the diagonal elements of these matrices are inversely proportional to the uncertainty, while incorporating class relations into the off-diagonal elements. Finally, we aggregate the noisy opinions and employ a generalized maximum likelihood loss on the aggregated opinion for model training, guided by the noisy labels. We empirically compare TMNR with state-of-the-art trusted multi-view learning and label noise learning baselines on 5 publicly available datasets. Experiment results show that TMNR outperforms baseline methods on accuracy, reliability and robustness. We promise to release the code and all datasets on Github and show the link here.

[132]  arXiv:2404.11945 [pdf, other]
Title: Terrain-Aware Stride-Level Trajectory Forecasting for a Powered Hip Exoskeleton via Vision and Kinematics Fusion
Comments: 6 pages, submitted to IEEE RA-L, under review. This work has been submitted to the IEEE Robotics and Automation Letters (RA-L) for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Robotics (cs.RO)

Powered hip exoskeletons have shown the ability for locomotion assistance during treadmill walking. However, providing suitable assistance in real-world walking scenarios which involve changing terrain remains challenging. Recent research suggests that forecasting the lower limb joint's angles could provide target trajectories for exoskeletons and prostheses, and the performance could be improved with visual information. In this letter, We share a real-world dataset of 10 healthy subjects walking through five common types of terrain with stride-level label. We design a network called Sandwich Fusion Transformer for Image and Kinematics (SFTIK), which predicts the thigh angle of the ensuing stride given the terrain images at the beginning of the preceding and the ensuing stride and the IMU time series during the preceding stride. We introduce width-level patchify, tailored for egocentric terrain images, to reduce the computational demands. We demonstrate the proposed sandwich input and fusion mechanism could significantly improve the forecasting performance. Overall, the SFTIK outperforms baseline methods, achieving a computational efficiency of 3.31 G Flops, and root mean square error (RMSE) of 3.445 \textpm \ 0.804\textdegree \ and Pearson's correlation coefficient (PCC) of 0.971 \textpm\ 0.025. The results demonstrate that SFTIK could forecast the thigh's angle accurately with low computational cost, which could serve as a terrain adaptive trajectory planning method for hip exoskeletons. Codes and data are available at https://github.com/RuoqiZhao116/SFTIK.

[133]  arXiv:2404.11946 [pdf, other]
Title: S4TP: Social-Suitable and Safety-Sensitive Trajectory Planning for Autonomous Vehicles
Comments: 12 pages,4 figures, published to IEEE Transactions on Intelligent Vehicles
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

In public roads, autonomous vehicles (AVs) face the challenge of frequent interactions with human-driven vehicles (HDVs), which render uncertain driving behavior due to varying social characteristics among humans. To effectively assess the risks prevailing in the vicinity of AVs in social interactive traffic scenarios and achieve safe autonomous driving, this article proposes a social-suitable and safety-sensitive trajectory planning (S4TP) framework. Specifically, S4TP integrates the Social-Aware Trajectory Prediction (SATP) and Social-Aware Driving Risk Field (SADRF) modules. SATP utilizes Transformers to effectively encode the driving scene and incorporates an AV's planned trajectory during the prediction decoding process. SADRF assesses the expected surrounding risk degrees during AVs-HDVs interactions, each with different social characteristics, visualized as two-dimensional heat maps centered on the AV. SADRF models the driving intentions of the surrounding HDVs and predicts trajectories based on the representation of vehicular interactions. S4TP employs an optimization-based approach for motion planning, utilizing the predicted HDVs'trajectories as input. With the integration of SADRF, S4TP executes real-time online optimization of the planned trajectory of AV within lowrisk regions, thus improving the safety and the interpretability of the planned trajectory. We have conducted comprehensive tests of the proposed method using the SMARTS simulator. Experimental results in complex social scenarios, such as unprotected left turn intersections, merging, cruising, and overtaking, validate the superiority of our proposed S4TP in terms of safety and rationality. S4TP achieves a pass rate of 100% across all scenarios, surpassing the current state-of-the-art methods Fanta of 98.25% and Predictive-Decision of 94.75%.

[134]  arXiv:2404.11947 [pdf, other]
Title: VCC-INFUSE: Towards Accurate and Efficient Selection of Unlabeled Examples in Semi-supervised Learning
Comments: Accepted paper of IJCAI 2024. Shijie Fang and Qianhan Feng contributed equally to this paper
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Despite the progress of Semi-supervised Learning (SSL), existing methods fail to utilize unlabeled data effectively and efficiently. Many pseudo-label-based methods select unlabeled examples based on inaccurate confidence scores from the classifier. Most prior work also uses all available unlabeled data without pruning, making it difficult to handle large amounts of unlabeled data. To address these issues, we propose two methods: Variational Confidence Calibration (VCC) and Influence-Function-based Unlabeled Sample Elimination (INFUSE). VCC is an universal plugin for SSL confidence calibration, using a variational autoencoder to select more accurate pseudo labels based on three types of consistency scores. INFUSE is a data pruning method that constructs a core dataset of unlabeled examples under SSL. Our methods are effective in multiple datasets and settings, reducing classification errors rates and saving training time. Together, VCC-INFUSE reduces the error rate of FlexMatch on the CIFAR-100 dataset by 1.08% while saving nearly half of the training time.

[135]  arXiv:2404.11949 [pdf, other]
Title: Sketch-guided Image Inpainting with Partial Discrete Diffusion Process
Comments: Accepted to NTIRE Workshop @ CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at https://github.com/vl2g/Sketch-Inpainting .

[136]  arXiv:2404.11957 [pdf, other]
Title: The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models
Authors: Cheng Shi, Sibei Yang
Comments: ICLR2024, Code is released at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at https://github.com/ChengShiest/Zip-Your-CLIP

[137]  arXiv:2404.11958 [pdf, other]
Title: Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation
Comments: Accepted by CVPR2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Semantic scene completion, also known as semantic occupancy prediction, can provide dense geometric and semantic information for autonomous vehicles, which attracts the increasing attention of both academia and industry. Unfortunately, existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention, the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels, which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore, the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper, we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then, the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides, self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: https://github.com/songw-zju/HASSC.

[138]  arXiv:2404.11959 [pdf, other]
Title: Segmented Model-Based Hydrogen Delivery Control for PEM Fuel Cells: a Port-Hamiltonian Approach
Comments: 12 pages, 11 Figures
Subjects: Systems and Control (eess.SY)

This paper proposes an extended interconnection and damping assignment passivity-based control technique (IDA-PBC) to control the pressure dynamics in the fuel delivery subsystem (FDS) of proton exchange membrane fuel cells. The fuel cell stack is a distributed parameter model which can be modeled by partial differential equations PDEs). In this paper, the segmentation concept is used to approximate the PDEs model by ordinary differential equations (ODEs) model. Therefore, each segments are having multiple ODEs to obtain the lump-sum model of the segments. Subsequently, a generalized multi-input multi-output lumped parameters model is developed in port-Hamiltonian framework based on mass balance to minimize the modeling error. The modeling errors arises due to the difference between spatially distributed pressures in FDS segments, and also due to the difference between the actual stack pressure and the measured output pressure of the anode. The segments interconnection feasibilities are ensured by maintaining passivity of each segment. With consideration of re-circulation and bleeding of the anode in the modeling, an extended energy-shaping and output tracking IDA-PBC based state-feedback controller is proposed to control the spatially distributed pressure dynamics in the anode. Furthermore, a sliding mode observer of high order is designed to estimate the unmeasurable pressures in FDS with known disturbances. Performance recovery of output feedback control is accomplished with explicit stability analysis. The effectiveness of the proposed IDA-PBC approach is validated by the simulation results.

[139]  arXiv:2404.11960 [pdf, other]
Title: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

The most recent pointwise Large Language Model (LLM) rankers have achieved remarkable ranking results. However, these rankers are hindered by two major drawbacks: (1) they fail to follow a standardized comparison guidance during the ranking process, and (2) they struggle with comprehensive considerations when dealing with complicated passages. To address these shortcomings, we propose to build a ranker that generates ranking scores based on a set of criteria from various perspectives. These criteria are intended to direct each perspective in providing a distinct yet synergistic evaluation. Our research, which examines eight datasets from the BEIR benchmark demonstrates that incorporating this multi-perspective criteria ensemble approach markedly enhanced the performance of pointwise LLM rankers.

[140]  arXiv:2404.11962 [pdf, other]
Title: ©Plug-in Authorization for Human Content Copyright Protection in Text-to-Image Model
Comments: 20 pages, 6 figures
Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

This paper addresses the contentious issue of copyright infringement in images generated by text-to-image models, sparking debates among AI developers, content creators, and legal entities. State-of-the-art models create high-quality content without crediting original creators, causing concern in the artistic community. To mitigate this, we propose the \copyright Plug-in Authorization framework, introducing three operations: addition, extraction, and combination. Addition involves training a \copyright plug-in for specific copyright, facilitating proper credit attribution. Extraction allows creators to reclaim copyright from infringing models, and combination enables users to merge different \copyright plug-ins. These operations act as permits, incentivizing fair use and providing flexibility in authorization. We present innovative approaches,"Reverse LoRA" for extraction and "EasyMerge" for seamless combination. Experiments in artist-style replication and cartoon IP recreation demonstrate \copyright plug-ins' effectiveness, offering a valuable solution for human copyright protection in the age of generative AIs.

[141]  arXiv:2404.11964 [pdf, ps, other]
Title: From Language Models to Practical Self-Improving Computer Agents
Authors: Alex Sheng
Subjects: Artificial Intelligence (cs.AI)

We develop a simple and straightforward methodology to create AI computer agents that can carry out diverse computer tasks and self-improve by developing tools and augmentations to enable themselves to solve increasingly complex tasks. As large language models (LLMs) have been shown to benefit from non-parametric augmentations, a significant body of recent work has focused on developing software that augments LLMs with various capabilities. Rather than manually developing static software to augment LLMs through human engineering effort, we propose that an LLM agent can systematically generate software to augment itself. We show, through a few case studies, that a minimal querying loop with appropriate prompt engineering allows an LLM to generate and use various augmentations, freely extending its own capabilities to carry out real-world computer tasks. Starting with only terminal access, we prompt an LLM agent to augment itself with retrieval, internet search, web navigation, and text editor capabilities. The agent effectively uses these various tools to solve problems including automated software development and web-based tasks.

[142]  arXiv:2404.11968 [pdf, other]
Title: P-NAL: an Effective and Interpretable Entity Alignment Method
Comments: 13 pages, 2 figures
Subjects: Computation and Language (cs.CL)

Entity alignment (EA) aims to find equivalent entities between two Knowledge Graphs. Existing embedding-based EA methods usually encode entities as embeddings, triples as embeddings' constraint and learn to align the embeddings. The structural and side information are usually utilized via embedding propagation, aggregation or interaction. However, the details of the underlying logical inference steps among the alignment process are usually omitted, resulting in inadequate inference process. In this paper, we introduce P-NAL, an entity alignment method that captures two types of logical inference paths with Non-Axiomatic Logic (NAL). Type 1 is the bridge-like inference path between to-be-aligned entity pairs, consisting of two relation/attribute triples and a similarity sentence between the other two entities. Type 2 links the entity pair by their embeddings. P-NAL iteratively aligns entities and relations by integrating the conclusions of the inference paths. Moreover, our method is logically interpretable and extensible due to the expressiveness of NAL. Our proposed method is suitable for various EA settings. Experimental results show that our method outperforms state-of-the-art methods in terms of Hits@1, achieving 0.98+ on all three datasets of DBP15K with both supervised and unsupervised settings. To our knowledge, we present the first in-depth analysis of entity alignment's basic principles from a unified logical perspective.

[143]  arXiv:2404.11972 [pdf, other]
Title: Aligning Language Models to Explicitly Handle Ambiguity
Subjects: Computation and Language (cs.CL)

In spoken languages, utterances are often shaped to be incomplete or vague for efficiency. This can lead to varying interpretations of the same input, based on different assumptions about the context. To ensure reliable user-model interactions in such scenarios, it is crucial for models to adeptly handle the inherent ambiguity in user queries. However, conversational agents built upon even the most recent large language models (LLMs) face challenges in processing ambiguous inputs, primarily due to the following two hurdles: (1) LLMs are not directly trained to handle inputs that are too ambiguous to be properly managed; (2) the degree of ambiguity in an input can vary according to the intrinsic knowledge of the LLMs, which is difficult to investigate. To address these issues, this paper proposes a method to align LLMs to explicitly handle ambiguous inputs. Specifically, we introduce a proxy task that guides LLMs to utilize their intrinsic knowledge to self-disambiguate a given input. We quantify the information gain from the disambiguation procedure as a measure of the extent to which the models perceive their inputs as ambiguous. This measure serves as a cue for selecting samples deemed ambiguous from the models' perspectives, which are then utilized for alignment. Experimental results from several question-answering datasets demonstrate that the LLMs fine-tuned with our approach are capable of handling ambiguous inputs while still performing competitively on clear questions within the task.

[144]  arXiv:2404.11973 [pdf, ps, other]
Title: Exploring the landscape of large language models: Foundations, techniques, and challenges
Subjects: Artificial Intelligence (cs.AI)

In this review paper, we delve into the realm of Large Language Models (LLMs), covering their foundational principles, diverse applications, and nuanced training processes. The article sheds light on the mechanics of in-context learning and a spectrum of fine-tuning approaches, with a special focus on methods that optimize efficiency in parameter usage. Additionally, it explores how LLMs can be more closely aligned with human preferences through innovative reinforcement learning frameworks and other novel methods that incorporate human feedback. The article also examines the emerging technique of retrieval augmented generation, integrating external knowledge into LLMs. The ethical dimensions of LLM deployment are discussed, underscoring the need for mindful and responsible application. Concluding with a perspective on future research trajectories, this review offers a succinct yet comprehensive overview of the current state and emerging trends in the evolving landscape of LLMs, serving as an insightful guide for both researchers and practitioners in artificial intelligence.

[145]  arXiv:2404.11976 [pdf, other]
Title: Large Language Models: From Notes to Musical Form
Authors: Lilac Atassi
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

While many topics of the learning-based approach to automated music generation are under active research, musical form is under-researched. In particular, recent methods based on deep learning models generate music that, at the largest time scale, lacks any structure. In practice, music longer than one minute generated by such models is either unpleasantly repetitive or directionless. Adapting a recent music generation model, this paper proposes a novel method to generate music with form. The experimental results show that the proposed method can generate 2.5-minute-long music that is considered as pleasant as the music used to train the model. The paper first reviews a recent music generation method based on language models (transformer architecture). We discuss why learning musical form by such models is infeasible. Then we discuss our proposed method and the experiments.

[146]  arXiv:2404.11977 [pdf, other]
Title: Corpus Christi: Establishing Replicability when Sharing the Bread is Not Allowed
Comments: Preprint of Submitted Paper
Subjects: Cryptography and Security (cs.CR); Digital Libraries (cs.DL)

In this paper, we provide practical tools to improve the scientific soundness of firmware corpora beyond the state of the art. We identify binary analysis challenges that significantly impact corpus creation. We use them to derive a framework of key corpus requirements that nurture the scientific goals of replicability and representativeness. We apply the framework to 44 top tier papers and collect 704 data points to show that there is currently no common ground on corpus creation. We discover in otherwise excellent work, that incomplete documentation and inflated corpus sizes blur visions on representativeness and hinder replicability. Our results show that the strict framework provides useful and practical guidelines that can identify miniscule step stones in corpus creation with significant impact on soundness.
Finally, we show that it is possible to meet all requirements: We provide a new corpus called LFwC. It is designed for large-scale static analyses on Linux-based firmware and consists of 10,913 high-quality images, covering 2,365 network appliances. We share rich meta data and scripts for replicability with the community. We verify unpacking, perform deduplication, identify contents, and provide bug ground truth. We identify ISAs and Linux kernels. All samples can be unpacked with the open source tool FACT.

[147]  arXiv:2404.11978 [pdf, other]
Title: EVIT: Event-Oriented Instruction Tuning for Event Reasoning
Subjects: Computation and Language (cs.CL)

Events refer to specific occurrences, incidents, or happenings that take place under a particular background. Event reasoning aims to infer events according to certain relations and predict future events. The cutting-edge techniques for event reasoning play a crucial role in various natural language processing applications. Large language models (LLMs) have made significant advancements in event reasoning owing to their wealth of knowledge and reasoning capabilities. However, smaller instruction-tuned models currently in use do not consistently demonstrate exceptional proficiency in managing these tasks. This discrepancy arises from the absence of explicit modeling of events and the interconnections of them within their instruction data. Consequently, these models face challenges in comprehending event structures and semantics while struggling to bridge the gap between their interpretations and human understanding of events. Additionally, their limitations in grasping event relations lead to constrained event reasoning abilities to effectively deduce and incorporate pertinent event knowledge. In this paper, we propose Event-Oriented Instruction Tuning (EvIT) to train our LLM. Specifically, we first propose a novel structure named event quadruple which contains the structure and semantics of events and is complete in the event representation. We then design event-relation learning based on the structures. We encapsulate the learning into the instruction-tuning formulation to better stimulate the event reasoning capacity of our model. We design a heuristic unsupervised method to mine event quadruple from a large-scale corpus. At last, we finetune a Llama model on our Event-Oriented Instruction Tuning. We conduct extensive experiments on event reasoning tasks on several datasets. Automatic and human evaluations demonstrate EvIT achieves competitive performances on event reasoning.

[148]  arXiv:2404.11979 [pdf, other]
Title: MTGA: Multi-view Temporal Granularity aligned Aggregation for Event-based Lip-reading
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Lip-reading is to utilize the visual information of the speaker's lip movements to recognize words and sentences. Existing event-based lip-reading solutions integrate different frame rate branches to learn spatio-temporal features of varying granularities. However, aggregating events into event frames inevitably leads to the loss of fine-grained temporal information within frames. To remedy this drawback, we propose a novel framework termed Multi-view Temporal Granularity aligned Aggregation (MTGA). Specifically, we first present a novel event representation method, namely time-segmented voxel graph list, where the most significant local voxels are temporally connected into a graph list. Then we design a spatio-temporal fusion module based on temporal granularity alignment, where the global spatial features extracted from event frames, together with the local relative spatial and temporal features contained in voxel graph list are effectively aligned and integrated. Finally, we design a temporal aggregation module that incorporates positional encoding, which enables the capture of local absolute spatial and global temporal information. Experiments demonstrate that our method outperforms both the event-based and video-based lip-reading counterparts. Our code will be publicly available.

[149]  arXiv:2404.11981 [pdf, other]
Title: Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.

[150]  arXiv:2404.11982 [pdf, other]
Title: SIGformer: Sign-aware Graph Transformer for Recommendation
Comments: Accepted by SIGIR2024
Subjects: Information Retrieval (cs.IR)

In recommender systems, most graph-based methods focus on positive user feedback, while overlooking the valuable negative feedback. Integrating both positive and negative feedback to form a signed graph can lead to a more comprehensive understanding of user preferences. However, the existing efforts to incorporate both types of feedback are sparse and face two main limitations: 1) They process positive and negative feedback separately, which fails to holistically leverage the collaborative information within the signed graph; 2) They rely on MLPs or GNNs for information extraction from negative feedback, which may not be effective.
To overcome these limitations, we introduce SIGformer, a new method that employs the transformer architecture to sign-aware graph-based recommendation. SIGformer incorporates two innovative positional encodings that capture the spectral properties and path patterns of the signed graph, enabling the full exploitation of the entire graph. Our extensive experiments across five real-world datasets demonstrate the superiority of SIGformer over state-of-the-art methods. The code is available at https://github.com/StupidThree/SIGformer.

[151]  arXiv:2404.11983 [pdf, other]
Title: Monte Carlo method and the random isentropic Euler system
Subjects: Numerical Analysis (math.NA)

We show several results on convergence of the Monte Carlo method applied to consistent approximations of the isentropic Euler system of gas dynamics with uncertain initial data. Our method is based on combination of several new concepts. We work with the dissipative weak solutions that can be seen as a universal closure of consistent approximations. Further, we apply the set-valued version of the Strong law of large numbers for general multivalued mapping with closed range and the Koml\'os theorem on strong converge of empirical averages of integrable functions. Theoretical results are illustrated by a series of numerical simulations obtained by an unconditionally convergent viscosity finite volume method combined with the Monte Carlo method.

[152]  arXiv:2404.11986 [pdf, other]
Title: New Analysis of Overlapping Schwarz Methods for Vector Field Problems in Three Dimensions with Generally Shaped Domains
Subjects: Numerical Analysis (math.NA)

This paper introduces a novel approach to analyzing overlapping Schwarz methods for N\'{e}d\'{e}lec and Raviart--Thomas vector field problems. The theory is based on new regular stable decompositions for vector fields that are robust to the topology of the domain. Enhanced estimates for the condition numbers of the preconditioned linear systems are derived, dependent linearly on the relative overlap between the overlapping subdomains. Furthermore, we present the numerical experiments which support our theoretical results.

[153]  arXiv:2404.11987 [pdf, other]
Title: MultiPhys: Multi-Person Physics-aware 3D Motion Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We introduce MultiPhys, a method designed for recovering multi-person motion from monocular videos. Our focus lies in capturing coherent spatial placement between pairs of individuals across varying degrees of engagement. MultiPhys, being physically aware, exhibits robustness to jittering and occlusions, and effectively eliminates penetration issues between the two individuals. We devise a pipeline in which the motion estimated by a kinematic-based method is fed into a physics simulator in an autoregressive manner. We introduce distinct components that enable our model to harness the simulator's properties without compromising the accuracy of the kinematic estimates. This results in final motion estimates that are both kinematically coherent and physically compliant. Extensive evaluations on three challenging datasets characterized by substantial inter-person interaction show that our method significantly reduces errors associated with penetration and foot skating, while performing competitively with the state-of-the-art on motion accuracy and smoothness. Results and code can be found on our project page (this http URL).

[154]  arXiv:2404.11988 [pdf, other]
Title: The Emerging AI Divide in the United States
Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

The digital divide describes disparities in access to and usage of digital tooling between social and economic groups. Emerging generative artificial intelligence tools, which strongly affect productivity, could magnify the impact of these divides. However, the affordability, multi-modality, and multilingual capabilities of these tools could also make them more accessible to diverse users in comparison with previous forms of digital tooling. In this study, we characterize spatial differences in U.S. residents' knowledge of a new generative AI tool, ChatGPT, through an analysis of state- and county-level search query data. In the first six months after the tool's release, we observe the highest rates of users searching for ChatGPT in West Coast states and persistently low rates of search in Appalachian and Gulf states. Counties with the highest rates of search are relatively more urbanized and have proportionally more educated, more economically advantaged, and more Asian residents in comparison with other counties or with the U.S. average. In multilevel models adjusting for socioeconomic and demographic factors as well as industry makeup, education is the strongest positive predictor of rates of search for generative AI tooling. Although generative AI technologies may be novel, early differences in uptake appear to be following familiar paths of digital marginalization.

[155]  arXiv:2404.11993 [pdf, other]
Title: Knowledge-Aware Multi-Intent Contrastive Learning for Multi-Behavior Recommendation
Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)

Multi-behavioral recommendation optimizes user experiences by providing users with more accurate choices based on their diverse behaviors, such as view, add to cart, and purchase. Current studies on multi-behavioral recommendation mainly explore the connections and differences between multi-behaviors from an implicit perspective. Specifically, they directly model those relations using black-box neural networks. In fact, users' interactions with items under different behaviors are driven by distinct intents. For instance, when users view products, they tend to pay greater attention to information such as ratings and brands. However, when it comes to the purchasing phase, users become more price-conscious. To tackle this challenge and data sparsity problem in the multi-behavioral recommendation, we propose a novel model: Knowledge-Aware Multi-Intent Contrastive Learning (KAMCL) model. This model uses relationships in the knowledge graph to construct intents, aiming to mine the connections between users' multi-behaviors from the perspective of intents to achieve more accurate recommendations. KAMCL is equipped with two contrastive learning schemes to alleviate the data scarcity problem and further enhance user representations. Extensive experiments on three real datasets demonstrate the superiority of our model.

[156]  arXiv:2404.11995 [pdf, other]
Title: Cost and CO2 emissions co-optimisation of green hydrogen production in a grid-connected renewable energy system
Subjects: Systems and Control (eess.SY)

Green hydrogen is essential for producing renewable fuels that are needed in sectors that are hard to electrify directly. Hydrogen production in a grid-connected hybrid renewable energy plant necessitates smart planning to meet long-term hydrogen trading agreements while minimising costs and emissions. Previous research analysed economic and environmental impact of hydrogen production based on full foresight of renewable energy availabilty, electricity price, and CO2 intensity in the electricity grid. However, the full foresight assumption is impractical in day-to-day operation, often leading to underestimations of both the cost and CO2 emissions associated with hydrogen production. Therefore, this research introduces a novel long-term planner that uses historical data and short-term forecasts to plan hydrogen production in the day-to-day operation of a grid-connected hybrid renewable energy plant. The long-term planner co-minimises cost and CO2 emissions to determine the hydrogen production for the next day taking into account the remaining hydrogen production and the time remaining until the end of the delivery period, which can be a week, a month, or a year. Extended delivery periods provide operation flexibility, enabling cost and CO2 emissions reductions. Significant reductions in CO2 emissions can be achieved with relatively small increases in the levelised cost. Under day-to-day operation, the levelised cost of hydrogen is marginally higher than that of the full foresight; the CO2 emissions can be up to 60% higher. Despite a significant portion of the produced hydrogen not meeting the criteria for green hydrogen designation under current rules, CO2 emissions are lower than those from existing alternative hydrogen production methods. These results underscore the importance of balancing cost considerations with environmental impacts in operational decision-making.

[157]  arXiv:2404.11996 [pdf, other]
Title: DST-GTN: Dynamic Spatio-Temporal Graph Transformer Network for Traffic Forecasting
Subjects: Artificial Intelligence (cs.AI)

Accurate traffic forecasting is essential for effective urban planning and congestion management. Deep learning (DL) approaches have gained colossal success in traffic forecasting but still face challenges in capturing the intricacies of traffic dynamics. In this paper, we identify and address this challenges by emphasizing that spatial features are inherently dynamic and change over time. A novel in-depth feature representation, called Dynamic Spatio-Temporal (Dyn-ST) features, is introduced, which encapsulates spatial characteristics across varying times. Moreover, a Dynamic Spatio-Temporal Graph Transformer Network (DST-GTN) is proposed by capturing Dyn-ST features and other dynamic adjacency relations between intersections. The DST-GTN can model dynamic ST relationships between nodes accurately and refine the representation of global and local ST characteristics by adopting adaptive weights in low-pass and all-pass filters, enabling the extraction of Dyn-ST features from traffic time-series data. Through numerical experiments on public datasets, the DST-GTN achieves state-of-the-art performance for a range of traffic forecasting tasks and demonstrates enhanced stability.

[158]  arXiv:2404.11998 [pdf, other]
Title: Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation
Comments: Accepted to CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.

[159]  arXiv:2404.11999 [pdf, other]
Title: Token-level Direct Preference Optimization
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.

[160]  arXiv:2404.12000 [pdf, other]
Title: How far are AI-powered programming assistants from meeting developers' needs?
Subjects: Software Engineering (cs.SE)

Recent In-IDE AI coding assistant tools (ACATs) like GitHub Copilot have significantly impacted developers' coding habits. While some studies have examined their effectiveness, there lacks in-depth investigation into the actual assistance process. To bridge this gap, we simulate real development scenarios encompassing three typical types of software development tasks and recruit 27 computer science students to investigate their behavior with three popular ACATs. Our goal is to comprehensively assess ACATs' effectiveness, explore characteristics of recommended code, identify reasons for modifications, and understand users' challenges and expectations. To facilitate the study, we develop an experimental platform that includes a data collection plugin for VSCode IDE and provides functions for screen recording, code evaluation, and automatic generation of personalized interview and survey questions. Through analysis of the collected data, we find that ACATs generally enhance task completion rates, reduce time, improve code quality, and increase self-perceived productivity. However, the improvement is influenced by both the nature of coding tasks and users' experience level. Notably, for experienced participants, the use of ACATs may even increase completion time. We observe that "edited line completion" is the most frequently recommended way, while "comments completion" and "string completion" have the lowest acceptance rates. The primary reasons for modifying recommended code are disparities between output formats and requirements, flawed logic, and inconsistent code styles. In terms of challenges and expectations, optimization of service access and help documentation is also concerned by participants except for functionality and performance. Our study provides valuable insights into the effectiveness and usability of ACATs, informing further improvements in their design and implementation.

[161]  arXiv:2404.12006 [pdf, other]
Title: Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction
Subjects: Computation and Language (cs.CL)

Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.

[162]  arXiv:2404.12008 [pdf, other]
Title: How Do Recommendation Models Amplify Popularity Bias? An Analysis from the Spectral Perspective
Comments: 23 pages, 7 figures
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Recommendation Systems (RS) are often plagued by popularity bias. Specifically,when recommendation models are trained on long-tailed datasets, they not only inherit this bias but often exacerbate it. This effect undermines both the precision and fairness of RS and catalyzes the so-called Matthew Effect. Despite the widely recognition of this issue, the fundamental causes remain largely elusive. In our research, we delve deeply into popularity bias amplification. Our comprehensive theoretical and empirical investigations lead to two core insights: 1) Item popularity is memorized in the principal singular vector of the score matrix predicted by the recommendation model; 2) The dimension collapse phenomenon amplifies the impact of principal singular vector on model predictions, intensifying the popularity bias. Based on these insights, we propose a novel method to mitigate this bias by imposing penalties on the magnitude of the principal singular value. Considering the heavy computational burden in directly evaluating the gradient of the principal singular value, we develop an efficient algorithm that harnesses the inherent properties of the singular vector. Extensive experiments across seven real-world datasets and three testing scenarios have been conducted to validate the superiority of our method.

[163]  arXiv:2404.12010 [pdf, other]
Title: ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.

[164]  arXiv:2404.12011 [pdf, other]
Title: Pseudo-random generators using linear feedback shift registers with output extraction
Authors: Holger Nobach
Comments: 18 pages, 13 figures
Subjects: Cryptography and Security (cs.CR)

The use of three extractors, fed by linear feedback shift registers (LFSR) for generating pseudo-random bit streams is investigated. Specifically, a standard LFSR is combined with a von Neumann extractor, a modified LFSR, extended by the all-zero state, is combined with an output logic, which translates every three bits from the LFSR into up to two output bits and a run extraction of the input bit stream into single output bits are investigated. The latter two achieve better efficiency in using bits from the primary bit stream, the last one reaches 50\%. Compared to other generator logics, the three extractors investigated are less performant in terms of their cryptographic strength. However, the focus of this report is on the quality of the pseudo-random bit stream in comparison to really random bits and on the efficiency of using the bits of the primary stream from the LFSR and generating valid output bits, while fulfilling a minimum cryptographic strength only, beyond that of the pure LFSR.

[165]  arXiv:2404.12013 [pdf, other]
Title: Sequential Compositional Generalization in Multimodal Models
Comments: Accepted to the main conference of NAACL (2024) as a long paper
Subjects: Computation and Language (cs.CL)

The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{this http URL}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.

[166]  arXiv:2404.12014 [pdf, other]
Title: Enhance Robustness of Language Models Against Variation Attack through Graph Integration
Comments: 12 pages, 4 figures, accepted by COLING 2024
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR)

The widespread use of pre-trained language models (PLMs) in natural language processing (NLP) has greatly improved performance outcomes. However, these models' vulnerability to adversarial attacks (e.g., camouflaged hints from drug dealers), particularly in the Chinese language with its rich character diversity/variation and complex structures, hatches vital apprehension. In this study, we propose a novel method, CHinese vAriatioN Graph Enhancement (CHANGE), to increase the robustness of PLMs against character variation attacks in Chinese content. CHANGE presents a novel approach for incorporating a Chinese character variation graph into the PLMs. Through designing different supplementary tasks utilizing the graph structure, CHANGE essentially enhances PLMs' interpretation of adversarially manipulated text. Experiments conducted in a multitude of NLP tasks show that CHANGE outperforms current language models in combating against adversarial attacks and serves as a valuable contribution to robust language model research. These findings contribute to the groundwork on robust language models and highlight the substantial potential of graph-guided pre-training strategies for real-world applications.

[167]  arXiv:2404.12015 [pdf, other]
Title: What does CLIP know about peeling a banana?
Comments: Accepted to MAR Workshop at CVPR2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.

[168]  arXiv:2404.12018 [pdf, other]
Title: Automated Real-Time Inspection in Indoor and Outdoor 3D Environments with Cooperative Aerial Robots
Comments: 2024 International Conference on Unmanned Aircraft Systems (ICUAS)
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

This work introduces a cooperative inspection system designed to efficiently control and coordinate a team of distributed heterogeneous UAV agents for the inspection of 3D structures in cluttered, unknown spaces. Our proposed approach employs a two-stage innovative methodology. Initially, it leverages the complementary sensing capabilities of the robots to cooperatively map the unknown environment. It then generates optimized, collision-free inspection paths, thereby ensuring comprehensive coverage of the structure's surface area. The effectiveness of our system is demonstrated through qualitative and quantitative results from extensive Gazebo-based simulations that closely replicate real-world inspection scenarios, highlighting its ability to thoroughly inspect real-world-like 3D structures.

[169]  arXiv:2404.12020 [pdf, other]
Title: Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Comments: 16 pages, 9 figures,5 Tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (\textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.68\% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.

[170]  arXiv:2404.12022 [pdf, other]
Title: Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration
Subjects: Computation and Language (cs.CL)

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens.
Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.

[171]  arXiv:2404.12023 [pdf, other]
Title: Context-Aware Orchestration of Energy-Efficient Gossip Learning Schemes
Comments: IEEE AIIOT 2024
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

Fully distributed learning schemes such as Gossip Learning (GL) are gaining momentum due to their scalability and effectiveness even in dynamic settings. However, they often imply a high utilization of communication and computing resources, whose energy footprint may jeopardize the learning process, particularly on battery-operated IoT devices. To address this issue, we present Optimized Gossip Learning (OGL)}, a distributed training approach based on the combination of GL with adaptive optimization of the learning process, which allows for achieving a target accuracy while minimizing the energy consumption of the learning process. We propose a data-driven approach to OGL management that relies on optimizing in real-time for each node the number of training epochs and the choice of which model to exchange with neighbors based on patterns of node contacts, models' quality, and available resources at each node. Our approach employs a DNN model for dynamic tuning of the aforementioned parameters, trained by an infrastructure-based orchestrator function. We performed our assessments on two different datasets, leveraging time-varying random graphs and a measurement-based dynamic urban scenario. Results suggest that our approach is highly efficient and effective in a broad spectrum of network scenarios.

[172]  arXiv:2404.12024 [pdf, other]
Title: Meta-Auxiliary Learning for Micro-Expression Recognition
Comments: 10 pages, 7 figures, 3 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Micro-expressions (MEs) are involuntary movements revealing people's hidden feelings, which has attracted numerous interests for its objectivity in emotion detection. However, despite its wide applications in various scenarios, micro-expression recognition (MER) remains a challenging problem in real life due to three reasons, including (i) data-level: lack of data and imbalanced classes, (ii) feature-level: subtle, rapid changing, and complex features of MEs, and (iii) decision-making-level: impact of individual differences. To address these issues, we propose a dual-branch meta-auxiliary learning method, called LightmanNet, for fast and robust micro-expression recognition. Specifically, LightmanNet learns general MER knowledge from limited data through a dual-branch bi-level optimization process: (i) In the first level, it obtains task-specific MER knowledge by learning in two branches, where the first branch is for learning MER features via primary MER tasks, while the other branch is for guiding the model obtain discriminative features via auxiliary tasks, i.e., image alignment between micro-expressions and macro-expressions since their resemblance in both spatial and temporal behavioral patterns. The two branches of learning jointly constrain the model of learning meaningful task-specific MER knowledge while avoiding learning noise or superficial connections between MEs and emotions that may damage its generalization ability. (ii) In the second level, LightmanNet further refines the learned task-specific knowledge, improving model generalization and efficiency. Extensive experiments on various benchmark datasets demonstrate the superior robustness and efficiency of LightmanNet.

[173]  arXiv:2404.12025 [pdf, other]
Title: PID Tuning using Cross-Entropy Deep Learning: a Lyapunov Stability Analysis
Journal-ref: IFAC-PapersOnLine, Volume 55, Issue 31, 2022
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)

Underwater Unmanned Vehicles (UUVs) have to constantly compensate for the external disturbing forces acting on their body. Adaptive Control theory is commonly used there to grant the control law some flexibility in its response to process variation. Today, learning-based (LB) adaptive methods are leading the field where model-based control structures are combined with deep model-free learning algorithms. This work proposes experiments and metrics to empirically study the stability of such a controller. We perform this stability analysis on a LB adaptive control system whose adaptive parameters are determined using a Cross-Entropy Deep Learning method.

[174]  arXiv:2404.12030 [pdf, other]
Title: Mapping back and forth between model predictive control and neural networks
Comments: 13 pages
Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)

Model predictive control (MPC) for linear systems with quadratic costs and linear constraints is shown to admit an exact representation as an implicit neural network. A method to "unravel" the implicit neural network of MPC into an explicit one is also introduced. As well as building links between model-based and data-driven control, these results emphasize the capability of implicit neural networks for representing solutions of optimisation problems, as such problems are themselves implicitly defined functions.

[175]  arXiv:2404.12031 [pdf, other]
Title: MLS-Track: Multilevel Semantic Interaction in RMOT
Comments: 17 pages 8 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.

[176]  arXiv:2404.12035 [pdf, ps, other]
Title: Monitoring Unmanned Aircraft: Specification, Integration, and Lessons-learned
Subjects: Software Engineering (cs.SE); Logic in Computer Science (cs.LO)

This paper reports on the integration of runtime monitoring into fully-electric aircraft designed by Volocopter, a German aircraft manufacturer of electric multi-rotor helicopters. The runtime monitor recognizes hazardous situations and system faults. Since the correct operation of the monitor is critical for the safety of the aircraft, the development of the monitor must follow strict aeronautical standards. This includes the integration of the monitor into different development environments, such as log-file analysis, hardware/software-in-the-loop testing, and test flights. We have used the stream-based monitoring framework RTLola to generate monitors for a range of requirements. In this paper, we present representative monitoring specifications and our lessons learned from integrating the generated monitors. Our main finding is that the specification and the integration need to be decoupled, because the specification remains stable throughout the development process, whereas the different development stages require a separate integration of the monitor into each environment. We achieve this decoupling with a novel abstraction layer in the monitoring framework that adapts the monitor to each environment without affecting the core component generated from the specification. The decoupling of the integration has also allowed us to react quickly to the frequent changes in the hardware and software environment of the monitor due to the fast-paced development of the aircraft in a startup company.

[177]  arXiv:2404.12037 [pdf, other]
Title: Data-free Knowledge Distillation for Fine-grained Visual Categorization
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Data-free knowledge distillation (DFKD) is a promising approach for addressing issues related to model compression, security privacy, and transmission restrictions. Although the existing methods exploiting DFKD have achieved inspiring achievements in coarse-grained classification, in practical applications involving fine-grained classification tasks that require more detailed distinctions between similar categories, sub-optimal results are obtained. To address this issue, we propose an approach called DFKD-FGVC that extends DFKD to fine-grained visual categorization~(FGVC) tasks. Our approach utilizes an adversarial distillation framework with attention generator, mixed high-order attention distillation, and semantic feature contrast learning. Specifically, we introduce a spatial-wise attention mechanism to the generator to synthesize fine-grained images with more details of discriminative parts. We also utilize the mixed high-order attention mechanism to capture complex interactions among parts and the subtle differences among discriminative features of the fine-grained categories, paying attention to both local features and semantic context relationships. Moreover, we leverage the teacher and student models of the distillation framework to contrast high-level semantic feature maps in the hyperspace, comparing variances of different categories. We evaluate our approach on three widely-used FGVC benchmarks (Aircraft, Cars196, and CUB200) and demonstrate its superior performance.

[178]  arXiv:2404.12038 [pdf, other]
Title: Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector
Subjects: Computation and Language (cs.CL)

Current open-source large language models (LLMs) are often undergone careful safety alignment before public release. Some attack methods have also been proposed that help check for safety vulnerabilities in LLMs to ensure alignment robustness. However, many of these methods have moderate attack success rates. Even when successful, the harmfulness of their outputs cannot be guaranteed, leading to suspicions that these methods have not accurately identified the safety vulnerabilities of LLMs. In this paper, we introduce a LLM attack method utilizing concept-based model explanation, where we extract safety concept activation vectors (SCAVs) from LLMs' activation space, enabling efficient attacks on well-aligned LLMs like LLaMA-2, achieving near 100% attack success rate as if LLMs are completely unaligned. This suggests that LLMs, even after thorough safety alignment, could still pose potential risks to society upon public release. To evaluate the harmfulness of outputs resulting with various attack methods, we propose a comprehensive evaluation method that reduces the potential inaccuracies of existing evaluations, and further validate that our method causes more harmful content. Additionally, we discover that the SCAVs show some transferability across different open-source LLMs.

[179]  arXiv:2404.12041 [pdf, other]
Title: Can We Catch the Elephant? The Evolvement of Hallucination Evaluation on Natural Language Generation: A Survey
Comments: 19 pages in total, with 9 pages as main body. Under review as a conference paper at CoLM 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text. For Large Language Models (LLMs), hallucinations can happen in various downstream tasks and casual conversations, which need accurate assessment to enhance reliability and safety. However, current studies on hallucination evaluation vary greatly, and people still find it difficult to sort out and select the most appropriate evaluation methods. Moreover, as NLP research gradually shifts to the domain of LLMs, it brings new challenges to this direction. This paper provides a comprehensive survey on the evolvement of hallucination evaluation methods, aiming to address three key aspects: 1) Diverse definitions and granularity of facts; 2) The categories of automatic evaluators and their applicability; 3) Unresolved issues and future directions.

[180]  arXiv:2404.12042 [pdf, other]
Title: Exploring Boundaries and Intensities in Offensive and Hate Speech: Unveiling the Complex Spectrum of Social Media Discourse
Subjects: Computation and Language (cs.CL)

The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia's sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The Afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.

[181]  arXiv:2404.12043 [pdf, other]
Title: Using Real-world Bug Bounty Programs in Secure Coding Course: Experience Report
Comments: 7 pages, 1 figure, to be published at ACM conference on Innovation and Technology in Computer Science Education (ITiCSE 2024)
Subjects: Cryptography and Security (cs.CR)

To keep up with the growing number of cyber-attacks and associated threats, there is an ever-increasing demand for cybersecurity professionals and new methods and technologies. Training new cybersecurity professionals is a challenging task due to the broad scope of the area. One particular field where there is a shortage of experts is Ethical Hacking. Due to its complexity, it often faces educational constraints. Recognizing these challenges, we propose a solution: integrating a real-world bug bounty programme into cybersecurity curriculum. This innovative approach aims to fill the gap in practical cybersecurity education and also brings additional positive benefits. To evaluate our idea, we include the proposed solution to a secure coding course for IT-oriented faculty. We let students choose to participate in a bug bounty programme as an option for the semester assignment in a secure coding course. We then collected responses from the students to evaluate the outcomes (improved skills, reported vulnerabilities, a better relationship with security, etc.). Evaluation of the assignment showed that students enjoyed solving such real-world problems, could find real vulnerabilities, and that it helped raise their skills and cybersecurity awareness. Participation in real bug bounty programmes also positively affects the security level of the tested products. We also discuss the potential risks of this approach and how to mitigate them.

[182]  arXiv:2404.12045 [pdf, other]
Title: RAM: Towards an Ever-Improving Memory System by Learning from Communications
Subjects: Artificial Intelligence (cs.AI)

We introduce RAM, an innovative RAG-based framework with an ever-improving memory. Inspired by humans' pedagogical process, RAM utilizes recursively reasoning-based retrieval and experience reflections to continually update the memory and learn from users' communicative feedback, namely communicative learning. Extensive experiments with both simulated and real users demonstrate significant improvements over traditional RAG and self-knowledge methods, particularly excelling in handling false premise and multi-hop questions. Furthermore, RAM exhibits promising adaptability to various feedback and retrieval method chain types, showcasing its potential for advancing AI capabilities in dynamic knowledge acquisition and lifelong learning.

[183]  arXiv:2404.12047 [pdf, ps, other]
Title: Self-Adjusting Evolutionary Algorithms Are Slow on Multimodal Landscapes
Subjects: Neural and Evolutionary Computing (cs.NE)

The one-fifth rule and its generalizations are a classical parameter control mechanism in discrete domains. They have also been transferred to control the offspring population size of the $(1, \lambda)$-EA. This has been shown to work very well for hill-climbing, and combined with a restart mechanism it was recently shown by Hevia Fajardo and Sudholt to improve performance on the multi-modal problem Cliff drastically.
In this work we show that the positive results do not extend to other types of local optima. On the distorted OneMax benchmark, the self-adjusting $(1, \lambda)$-EA is slowed down just as elitist algorithms because self-adaptation prevents the algorithm from escaping from local optima. This makes the self-adaptive algorithm considerably worse than good static parameter choices, which do allow to escape from local optima efficiently. We show this theoretically and complement the result with empirical runtime results.

[184]  arXiv:2404.12048 [pdf, other]
Title: Symbolic Computation for All the Fun
Subjects: Logic in Computer Science (cs.LO)

Motivated by the recent 10 million dollar AIMO challenge, this paper targets the problem of finding all functions conforming to a given specification. This is a popular problem at mathematical competitions and it brings about a number of challenges, primarily, synthesizing the possible solutions and proving that no other solutions exist. Often, there are infinitely many solutions and then the set of solutions has to be captured symbolically. We propose an approach to solving this problem and evaluate it on a set of problems that appeared in mathematical competitions and olympics.

[185]  arXiv:2404.12050 [pdf, other]
Title: emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information
Comments: The dataset is available in this https URL
Subjects: Computation and Language (cs.CL)

Machine Reading Comprehension (MRC) holds a pivotal role in shaping Medical Question Answering Systems (QAS) and transforming the landscape of accessing and applying medical information. However, the inherent challenges in the medical field, such as complex terminology and question ambiguity, necessitate innovative solutions. One key solution involves integrating specialized medical datasets and creating dedicated datasets. This strategic approach enhances the accuracy of QAS, contributing to advancements in clinical decision-making and medical research. To address the intricacies of medical terminology, a specialized dataset was integrated, exemplified by a novel Span extraction dataset derived from emrQA but restructured into 163,695 questions and 4,136 manually obtained answers, this new dataset was called emrQA-msquad dataset. Additionally, for ambiguous questions, a dedicated medical dataset for the Span extraction task was introduced, reinforcing the system's robustness. The fine-tuning of models such as BERT, RoBERTa, and Tiny RoBERTa for medical contexts significantly improved response accuracy within the F1-score range of 0.75 to 1.00 from 10.1% to 37.4%, 18.7% to 44.7% and 16.0% to 46.8%, respectively. Finally, emrQA-msquad dataset is publicy available at https://huggingface.co/datasets/Eladio/emrqa-msquad.

[186]  arXiv:2404.12055 [pdf, other]
Title: Improving the perception of visual fiducial markers in the field using Adaptive Active Exposure Control
Comments: Paper accepted by ISER 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Accurate localization is fundamental for autonomous underwater vehicles (AUVs) to carry out precise tasks, such as manipulation and construction. Vision-based solutions using fiducial marker are promising, but extremely challenging underwater because of harsh lighting condition underwater. This paper introduces a gradient-based active camera exposure control method to tackle sharp lighting variations during image acquisition, which can establish better foundation for subsequent image enhancement procedures. Considering a typical scenario for underwater operations where visual tags are used, we proposed several experiments comparing our method with other state-of-the-art exposure control method including Active Exposure Control (AEC) and Gradient-based Exposure Control (GEC). Results show a significant improvement in the accuracy of robot localization. This method is an important component that can be used in visual-based state estimation pipeline to improve the overall localization accuracy.

[187]  arXiv:2404.12056 [pdf, other]
Title: Deconstructing Human-AI Collaboration: Agency, Interaction, and Adaptation
Comments: 10 pages, 4 figures. Accepted to Proceedings of EuroVis 2024
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)

As full AI-based automation remains out of reach in most real-world applications, the focus has instead shifted to leveraging the strengths of both human and AI agents, creating effective collaborative systems. The rapid advances in this area have yielded increasingly more complex systems and frameworks, while the nuance of their characterization has gotten more vague. Similarly, the existing conceptual models no longer capture the elaborate processes of these systems nor describe the entire scope of their collaboration paradigms. In this paper, we propose a new unified set of dimensions through which to analyze and describe human-AI systems. Our conceptual model is centered around three high-level aspects - agency, interaction, and adaptation - and is developed through a multi-step process. Firstly, an initial design space is proposed by surveying the literature and consolidating existing definitions and conceptual frameworks. Secondly, this model is iteratively refined and validated by conducting semi-structured interviews with nine researchers in this field. Lastly, to illustrate the applicability of our design space, we utilize it to provide a structured description of selected human-AI systems.

[188]  arXiv:2404.12059 [pdf, other]
Title: Constituents Correspond to Word Sequence Patterns among Sentences with Equivalent Predicate-Argument Structures: Unsupervised Constituency Parsing by Span Matching
Subjects: Computation and Language (cs.CL)

Unsupervised constituency parsing is about identifying word sequences that form a syntactic unit (i.e., constituents) in a target sentence. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent corresponds to frequent word sequences. However, such information is unavailable to previous parsing methods which identify the constituent by observing sentences with diverse PAS. In this study, we empirically verify that \textbf{constituents correspond to word sequence patterns in the PAS-equivalent sentence set}. We propose a frequency-based method \emph{span-overlap}, applying the word sequence pattern to computational unsupervised parsing for the first time. Parsing experiments show that the span-overlap parser outperforms state-of-the-art parsers in eight out of ten languages. Further discrimination analysis confirms that the span-overlap method can non-trivially separate constituents from non-constituents. This result highlights the utility of the word sequence pattern. Additionally, we discover a multilingual phenomenon: \textbf{participant-denoting constituents are more frequent than event-denoting constituents}. The phenomenon indicates a behavioral difference between the two constituent types, laying the foundation for future labeled unsupervised parsing.

[189]  arXiv:2404.12062 [pdf, other]
Title: MIDGET: Music Conditioned 3D Dance Generation
Comments: 12 pages, 6 figures Published in AI 2023: Advances in Artificial Intelligence
Journal-ref: In Australasian Joint Conference on Artificial Intelligence (pp. 277-288). Singapore: Springer Nature Singapore 2023
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Audio and Speech Processing (eess.AS)

In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.

[190]  arXiv:2404.12063 [pdf, other]
Title: FastVPINNs: Tensor-Driven Acceleration of VPINNs for Complex Geometries
Comments: 31 pages, 19 figures, 4 algorithms
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)

Variational Physics-Informed Neural Networks (VPINNs) utilize a variational loss function to solve partial differential equations, mirroring Finite Element Analysis techniques. Traditional hp-VPINNs, while effective for high-frequency problems, are computationally intensive and scale poorly with increasing element counts, limiting their use in complex geometries. This work introduces FastVPINNs, a tensor-based advancement that significantly reduces computational overhead and improves scalability. Using optimized tensor operations, FastVPINNs achieve a 100-fold reduction in the median training time per epoch compared to traditional hp-VPINNs. With proper choice of hyperparameters, FastVPINNs surpass conventional PINNs in both speed and accuracy, especially in problems with high-frequency solutions. Demonstrated effectiveness in solving inverse problems on complex domains underscores FastVPINNs' potential for widespread application in scientific and engineering challenges, opening new avenues for practical implementations in scientific machine learning.

[191]  arXiv:2404.12064 [pdf, other]
Title: PureForest: A Large-scale Aerial Lidar and Aerial Imagery Dataset for Tree Species Classification in Monospecific Forests
Comments: 14 pages | 5 figures | Dataset is available at this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Knowledge of tree species distribution is fundamental to managing forests. New deep learning approaches promise significant accuracy gains for forest mapping, and are becoming a critical tool for mapping multiple tree species at scale. To advance the field, deep learning researchers need large benchmark datasets with high-quality annotations. To this end, we present the PureForest dataset: a large-scale, open, multimodal dataset designed for tree species classification from both Aerial Lidar Scanning (ALS) point clouds and Very High Resolution (VHR) aerial images. Most current public Lidar datasets for tree species classification have low diversity as they only span a small area of a few dozen annotated hectares at most. In contrast, PureForest has 18 tree species grouped into 13 semantic classes, and spans 339 km$^2$ across 449 distinct monospecific forests, and is to date the largest and most comprehensive Lidar dataset for the identification of tree species. By making PureForest publicly available, we hope to provide a challenging benchmark dataset to support the development of deep learning approaches for tree species identification from Lidar and/or aerial imagery. In this data paper, we describe the annotation workflow, the dataset, the recommended evaluation methodology, and establish a baseline performance from both 3D and 2D modalities.

[192]  arXiv:2404.12065 [pdf, other]
Title: RAGAR, Your Falsehood RADAR: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models
Comments: 8 pages, submitted to ACL Rolling Review
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)

The escalating challenge of misinformation, particularly in the context of political discourse, necessitates advanced solutions for fact-checking. We introduce innovative approaches to enhance the reliability and efficiency of multimodal fact-checking through the integration of Large Language Models (LLMs) with Retrieval-augmented Generation (RAG)- based advanced reasoning techniques. This work proposes two novel methodologies, Chain of RAG (CoRAG) and Tree of RAG (ToRAG). The approaches are designed to handle multimodal claims by reasoning the next questions that need to be answered based on previous evidence. Our approaches improve the accuracy of veracity predictions and the generation of explanations over the traditional fact-checking approach of sub-question generation with chain of thought veracity prediction. By employing multimodal LLMs adept at analyzing both text and images, this research advances the capability of automated systems in identifying and countering misinformation.

[193]  arXiv:2404.12069 [pdf, other]
Title: Developing Application Profiles for Enhancing Data and Workflows in Cultural Heritage Digitisation Processes
Subjects: Digital Libraries (cs.DL)

As a result of the proliferation of 3D digitisation in the context of cultural heritage projects, digital assets and digitisation processes - being considered as proper research objects - must prioritise adherence to FAIR principles. Existing standards and ontologies, such as CIDOC CRM, play a crucial role in this regard, but they are often over-engineered for the need of a particular application context, thus making their understanding and adoption difficult. Application profiles of a given standard - defined as sets of ontological entities drawn from one or more semantic artefacts for a particular context or application - are usually proposed as tools for promoting interoperability and reuse while being tied entirely to the particular application context they refer to. In this paper, we present an adaptation and application of an ontology development methodology, i.e. SAMOD, to guide the creation of robust, semantically sound application profiles of large standard models. Using an existing pilot study we have developed in a project dedicated to leveraging virtual technologies to preserve and valorise cultural heritage, we introduce an application profile named CHAD-AP, that we have developed following our customised version of SAMOD. We reflect on the use of SAMOD and similar ontology development methodologies for this purpose, highlighting its strengths and current limitations, future developments, and possible adoption in other similar projects.

[194]  arXiv:2404.12071 [pdf, ps, other]
Title: Complexity-Aware Theoretical Performance Analysis of SDM MIMO Equalizers
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

We propose a theoretical framework to compute, rapidly and accurately, the signal-to-noise ratio at the output of spatial-division multiplexing (SDM) linear MIMO equalizers with arbitrary numbers of spatial modes and filter taps and demonstrate three orders of magnitude of speed-up compared to Monte Carlo simulations.

[195]  arXiv:2404.12074 [pdf, other]
Title: A Flexible Architecture for Web-based GIS Applications using Docker and Graph Databases
Comments: 11 pages, 4 figures
Subjects: Human-Computer Interaction (cs.HC)

Regional planning processes and associated redevelopment projects can be complex due to the vast amount of diverse data involved. However, all of this data shares a common geographical reference, especially in the renaturation of former open-cast mining areas. To ensure safety, it is crucial to maintain a comprehensive overview of the interrelated data and draw accurate conclusions. This requires special tools and can be a very time-consuming process. A geographical information system (GIS) is well-suited for this purpose, but even a GIS has limitations when dealing with multiple data types and sources. Additional tools are often necessary to process and view all the data, which can complicate the planning process. Our paper describes a system architecture that addresses the aforementioned issues and provides a simple, yet flexible tool for these activities. The architecture is based on microservices using Docker and is divided into a backend and a frontend. The backend simplifies and generalizes the integration of different data types, while a graph database is used to link relevant data and reveal potential new relationships between them. Finally, a modern web frontend displays the data and relationships.

[196]  arXiv:2404.12075 [pdf, other]
Title: E-Vote Your Conscience: Perceptions of Coercion and Vote Buying, and the Usability of Fake Credentials in Online Voting
Comments: 23 pages, 2024 IEEE Symposium on Security and Privacy
Subjects: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)

Online voting is attractive for convenience and accessibility, but is more susceptible to voter coercion and vote buying than in-person voting. One mitigation is to give voters fake voting credentials that they can yield to a coercer. Fake credentials appear identical to real ones, but cast votes that are silently omitted from the final tally. An important unanswered question is how ordinary voters perceive such a mitigation: whether they could understand and use fake credentials, and whether the coercion risks justify the costs of mitigation. We present the first systematic study of these questions, involving 150 diverse individuals in Boston, Massachusetts. All participants "registered" and "voted" in a mock election: 120 were exposed to coercion resistance via fake credentials, the rest forming a control group. Of the 120 participants exposed to fake credentials, 96% understood their use. 53% reported that they would create fake credentials in a real-world voting scenario, given the opportunity. 10% mistakenly voted with a fake credential, however. 22% reported either personal experience with or direct knowledge of coercion or vote-buying incidents. These latter participants rated the coercion-resistant system essentially as trustworthy as in-person voting via hand-marked paper ballots. Of the 150 total participants to use the system, 87% successfully created their credentials without assistance; 83% both successfully created and properly used their credentials. Participants give a System Usability Scale score of 70.4, which is slightly above the industry's average score of 68. Our findings appear to support the importance of the coercion problem in general, and the promise of fake credentials as a possible mitigation, but user error rates remain an important usability challenge for future work.

[197]  arXiv:2404.12076 [pdf, other]
Title: Evolutionary Multi-Objective Optimisation for Fairness-Aware Self Adjusting Memory Classifiers in Data Streams
Comments: This paper has been accepted by GECCO 2024
Subjects: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

This paper introduces a novel approach, evolutionary multi-objective optimisation for fairness-aware self-adjusting memory classifiers, designed to enhance fairness in machine learning algorithms applied to data stream classification. With the growing concern over discrimination in algorithmic decision-making, particularly in dynamic data stream environments, there is a need for methods that ensure fair treatment of individuals across sensitive attributes like race or gender. The proposed approach addresses this challenge by integrating the strengths of the self-adjusting memory K-Nearest-Neighbour algorithm with evolutionary multi-objective optimisation. This combination allows the new approach to efficiently manage concept drift in streaming data and leverage the flexibility of evolutionary multi-objective optimisation to maximise accuracy and minimise discrimination simultaneously. We demonstrate the effectiveness of the proposed approach through extensive experiments on various datasets, comparing its performance against several baseline methods in terms of accuracy and fairness metrics. Our results show that the proposed approach maintains competitive accuracy and significantly reduces discrimination, highlighting its potential as a robust solution for fairness-aware data stream classification. Further analyses also confirm the effectiveness of the strategies to trigger evolutionary multi-objective optimisation and adapt classifiers in the proposed approach.

[198]  arXiv:2404.12077 [pdf, other]
Title: TIMIT Speaker Profiling: A Comparison of Multi-task learning and Single-task learning Approaches
Authors: Rong Wang, Kun Sun
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

This study employs deep learning techniques to explore four speaker profiling tasks on the TIMIT dataset, namely gender classification, accent classification, age estimation, and speaker identification, highlighting the potential and challenges of multi-task learning versus single-task models. The motivation for this research is twofold: firstly, to empirically assess the advantages and drawbacks of multi-task learning over single-task models in the context of speaker profiling; secondly, to emphasize the undiminished significance of skillful feature engineering for speaker recognition tasks. The findings reveal challenges in accent classification, and multi-task learning is found advantageous for tasks of similar complexity. Non-sequential features are favored for speaker recognition, but sequential ones can serve as starting points for complex models. The study underscores the necessity of meticulous experimentation and parameter tuning for deep learning models.

[199]  arXiv:2404.12079 [pdf, other]
Title: Trajectory Planning for Autonomous Vehicle Using Iterative Reward Prediction in Reinforcement Learning
Authors: Hyunwoo Park
Comments: 8 pages, 6 figures
Subjects: Robotics (cs.RO)

Traditional trajectory planning methods for autonomous vehicles have several limitations. Heuristic and explicit simple rules make trajectory lack generality and complex motion. One of the approaches to resolve the above limitations of traditional trajectory planning methods is trajectory planning using reinforcement learning. However, reinforcement learning suffers from instability of learning and prior works of trajectory planning using reinforcement learning didn't consider the uncertainties. In this paper, we propose a trajectory planning method for autonomous vehicles using reinforcement learning. The proposed method includes iterative reward prediction method that stabilizes the learning process, and uncertainty propagation method that makes the reinforcement learning agent to be aware of the uncertainties. The proposed method is experimented in the CARLA simulator. Compared to the baseline method, we have reduced the collision rate by 60.17%, and increased the average reward to 30.82 times.

[200]  arXiv:2404.12080 [pdf, other]
Title: A Mathematical Formalisation of the γ-contraction Problem
Authors: Elia Onofri
Subjects: Data Structures and Algorithms (cs.DS); Combinatorics (math.CO)

Networks play an ubiquitous role in computer science and real-world applications, offering multiple kind of information that can be retrieved with adequate methods. With the continuous growing in the amount of data available, networks are becoming larger day by day. Consequently, the tasks that were easily achievable on smaller networks, often becomes impractical on huge amount of data, either due to the high computational cost or due to the impracticality to visualise corresponding data. Using distinctive node features to group large amount of connected data into a limited number of clusters, hence represented by a representative per cluster, proves to be a valuable approach. The resulting contracted graphs are more manageable in size and can reveal previously hidden characteristics of the original networks. Furthermore, in many real-world use cases, a definition of cluster is intrinsic with the data, eventually obtained with the injection of some expert knowledge represent by a categorical function. Clusters then results in set of connected vertices taking the same values in a finite set C. In the recent literature, Lombardi and Onofri proposed a novel, fast, and easily parallelisable approach under the name of $\gamma$-contraction to contract a graph given a categorical function. In this work, we formally define such approach by providing a rigorous mathematical definition of the problem, which, to the best of our knowledge, was missing in the existing literature. Specifically, we explore the variadic nature of the contraction operation and use it to introduce the weaker version of the colour contraction, under the name of $\beta$-contraction, that the algorithmic solution exploits. We finally dive into the details of the algorithm and we provide a full assesment on its convergence complexity relying on two constructive proofs that deeply unveil its mode of operation.

[201]  arXiv:2404.12081 [pdf, other]
Title: MaskCD: A Remote Sensing Change Detection Network Based on Mask Classification
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Change detection (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature. It is typically regarded as a pixel-wise labeling task that aims to classify each pixel as changed or unchanged. Although per-pixel classification networks in encoder-decoder structures have shown dominance, they still suffer from imprecise boundaries and incomplete object delineation at various scenes. For high-resolution RS images, partly or totally changed objects are more worthy of attention rather than a single pixel. Therefore, we revisit the CD task from the mask prediction and classification perspective and propose MaskCD to detect changed areas by adaptively generating categorized masks from input image pairs. Specifically, it utilizes a cross-level change representation perceiver (CLCRP) to learn multiscale change-aware representations and capture spatiotemporal relations from encoded features by exploiting deformable multihead self-attention (DeformMHSA). Subsequently, a masked-attention-based detection transformers (MA-DETR) decoder is developed to accurately locate and identify changed objects based on masked attention and self-attention mechanisms. It reconstructs the desired changed objects by decoding the pixel-wise representations into learnable mask proposals and making final predictions from these candidates. Experimental results on five benchmark datasets demonstrate the proposed approach outperforms other state-of-the-art models. Codes and pretrained models are available online (https://github.com/EricYu97/MaskCD).

[202]  arXiv:2404.12083 [pdf, other]
Title: MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking
Comments: Accepted by CVPR 2024 Workshop (AIS: Vision, Graphics and AI for Streaming), top solution of challenge Event-based Eye Tracking, see this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However, the diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system, this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically, the MambaPupil network is proposed, which consists of the multi-layer convolutional encoder to extract features from the event representations, a bidirectional Gated Recurrent Unit (GRU), and a Linear Time-Varying State Space Module (LTV-SSM), to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore, the Bina-rep is utilized as a compact event representation, and the tailor-made data augmentation, called as Event-Cutout, is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation on the ThreeET-plus benchmark shows the superior performance of the MambaPupil, which secured the 1st place in CVPR'2024 AIS Event-based Eye Tracking challenge.

[203]  arXiv:2404.12086 [pdf, other]
Title: Preserving Nature's Ledger: Blockchains in Biodiversity Conservation
Subjects: Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)

In the contemporary era, biodiversity conservation emerges as a paramount challenge, necessitating innovative approaches to monitoring, preserving, and enhancing the natural world. This paper explores the integration of blockchain technology in biodiversity conservation, offering a novel perspective on how digital resilience can be built within ecological contexts. Blockchain, with its decentralized and immutable ledger and tokenization affordances, presents a groundbreaking solution for the accurate monitoring and tracking of environmental assets, thereby addressing the critical need for transparency and trust in conservation efforts. Unlike previous more theoretical approaches, by addressing the research question of how blockchain supports digital resilience in biodiversity conservation, this study presents a grounded framework that justifies which blockchain features are essential to decipher specific data contribution and data leveraging processes in an effort to protect our planet's biodiversity, while boosting potential economic benefits for all actors involved, from local farmers, to hardware vendors and artificial intelligence experts, to investors and regular users, volunteers and donors.

[204]  arXiv:2404.12087 [pdf, other]
Title: Optimizing the diffusion of overdamped Langevin dynamics
Comments: 76 pages, 11 figures
Subjects: Numerical Analysis (math.NA)

This is a preleminary work. Overdamped Langevin dynamics are reversible stochastic differential equations which are commonly used to sample probability measures in high dimensional spaces, such as the ones appearing in computational statistical physics and Bayesian inference. By varying the diffusion coefficient, there are in fact infinitely many reversible overdamped Langevin dynamics which preserve the target probability measure at hand. This suggests to optimize the diffusion coefficient in order to increase the convergence rate of the dynamics, as measured by the spectral gap of the generator associated with the stochastic differential equation. We analytically study this problem here, obtaining in particular necessary conditions on the optimal diffusion coefficient. We also derive an explicit expression of the optimal diffusion in some homogenized limit. Numerical results, both relying on discretizations of the spectral gap problem and Monte Carlo simulations of the stochastic dynamics, demonstrate the increased quality of the sampling arising from an appropriate choice of the diffusion coefficient.

[205]  arXiv:2404.12090 [pdf, other]
Title: X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner
Comments: Accepted by IJCAI 2024
Subjects: Artificial Intelligence (cs.AI)

The effectiveness of traffic light control has been significantly improved by current reinforcement learning-based approaches via better cooperation among multiple traffic lights. However, a persisting issue remains: how to obtain a multi-agent traffic signal control algorithm with remarkable transferability across diverse cities? In this paper, we propose a Transformer on Transformer (TonT) model for cross-city meta multi-agent traffic signal control, named as X-Light: We input the full Markov Decision Process trajectories, and the Lower Transformer aggregates the states, actions, rewards among the target intersection and its neighbors within a city, and the Upper Transformer learns the general decision trajectories across different cities. This dual-level approach bolsters the model's robust generalization and transferability. Notably, when directly transferring to unseen scenarios, ours surpasses all baseline methods with +7.91% on average, and even +16.3% in some cases, yielding the best results.

[206]  arXiv:2404.12091 [pdf, other]
Title: Harnessing Joint Rain-/Detail-aware Representations to Eliminate Intricate Rains
Comments: 21 pages, 14 figures
Journal-ref: International Conference on Learning Representations 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advances in image deraining have focused on training powerful models on mixed multiple datasets comprising diverse rain types and backgrounds. However, this approach tends to overlook the inherent differences among rainy images, leading to suboptimal results. To overcome this limitation, we focus on addressing various rainy images by delving into meaningful representations that encapsulate both the rain and background components. Leveraging these representations as instructive guidance, we put forth a Context-based Instance-level Modulation (CoI-M) mechanism adept at efficiently modulating CNN- or Transformer-based models. Furthermore, we devise a rain-/detail-aware contrastive learning strategy to help extract joint rain-/detail-aware representations. By integrating CoI-M with the rain-/detail-aware Contrastive learning, we develop CoIC, an innovative and potent algorithm tailored for training models on mixed datasets. Moreover, CoIC offers insight into modeling relationships of datasets, quantitatively assessing the impact of rain and details on restoration, and unveiling distinct behaviors of models given diverse inputs. Extensive experiments validate the efficacy of CoIC in boosting the deraining ability of CNN and Transformer models. CoIC also enhances the deraining prowess remarkably when real-world dataset is included.

[207]  arXiv:2404.12093 [pdf, ps, other]
Title: Evaluating the Security of Merkle Trees in the Internet of Things: An Analysis of Data Falsification Probabilities
Comments: 7 pages
Subjects: Cryptography and Security (cs.CR)

Addressing the critical challenge of ensuring data integrity in decentralized systems, this paper delves into the underexplored area of data falsification probabilities within Merkle Trees, which are pivotal in blockchain and Internet of Things (IoT) technologies. Despite their widespread use, a comprehensive understanding of the probabilistic aspects of data security in these structures remains a gap in current research. Our study aims to bridge this gap by developing a theoretical framework to calculate the probability of data falsification, taking into account various scenarios based on the length of the Merkle path and hash length. The research progresses from the derivation of an exact formula for falsification probability to an approximation suitable for cases with significantly large hash lengths. Empirical experiments validate the theoretical models, exploring simulations with diverse hash lengths and Merkle path lengths. The findings reveal a decrease in falsification probability with increasing hash length and an inverse relationship with longer Merkle paths. A numerical analysis quantifies the discrepancy between exact and approximate probabilities, underscoring the conditions for the effective application of the approximation. This work offers crucial insights into optimizing Merkle Tree structures for bolstering security in blockchain and IoT systems, achieving a balance between computational efficiency and data integrity.

[208]  arXiv:2404.12096 [pdf, other]
Title: LongEmbed: Extending Embedding Models for Long Context Retrieval
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

[209]  arXiv:2404.12097 [pdf, other]
Title: MPC of Uncertain Nonlinear Systems with Meta-Learning for Fast Adaptation of Neural Predictive Models
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)

In this paper, we consider the problem of reference tracking in uncertain nonlinear systems. A neural State-Space Model (NSSM) is used to approximate the nonlinear system, where a deep encoder network learns the nonlinearity from data, and a state-space component captures the temporal relationship. This transforms the nonlinear system into a linear system in a latent space, enabling the application of model predictive control (MPC) to determine effective control actions. Our objective is to design the optimal controller using limited data from the \textit{target system} (the system of interest). To this end, we employ an implicit model-agnostic meta-learning (iMAML) framework that leverages information from \textit{source systems} (systems that share similarities with the target system) to expedite training in the target system and enhance its control performance. The framework consists of two phases: the (offine) meta-training phase learns a aggregated NSSM using data from source systems, and the (online) meta-inference phase quickly adapts this aggregated model to the target system using only a few data points and few online training iterations, based on local loss function gradients. The iMAML algorithm exploits the implicit function theorem to exactly compute the gradient during training, without relying on the entire optimization path. By focusing solely on the optimal solution, rather than the path, we can meta-train with less storage complexity and fewer approximations than other contemporary meta-learning algorithms. We demonstrate through numerical examples that our proposed method can yield accurate predictive models by adaptation, resulting in a downstream MPC that outperforms several baselines.

[210]  arXiv:2404.12103 [pdf, other]
Title: S3R-Net: A Single-Stage Approach to Self-Supervised Shadow Removal
Comments: NTIRE workshop @ CVPR 2024. Code & models available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

In this paper we present S3R-Net, the Self-Supervised Shadow Removal Network. The two-branch WGAN model achieves self-supervision relying on the unify-and-adaptphenomenon - it unifies the style of the output data and infers its characteristics from a database of unaligned shadow-free reference images. This approach stands in contrast to the large body of supervised frameworks. S3R-Net also differentiates itself from the few existing self-supervised models operating in a cycle-consistent manner, as it is a non-cyclic, unidirectional solution. The proposed framework achieves comparable numerical scores to recent selfsupervised shadow removal models while exhibiting superior qualitative performance and keeping the computational cost low.

[211]  arXiv:2404.12104 [pdf, other]
Title: Ethical-Lens: Curbing Malicious Usages of Open-Source Text-to-Image Models
Comments: 42 pages, 17 figures, 29 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

The burgeoning landscape of text-to-image models, exemplified by innovations such as Midjourney and DALLE 3, has revolutionized content creation across diverse sectors. However, these advancements bring forth critical ethical concerns, particularly with the misuse of open-source models to generate content that violates societal norms. Addressing this, we introduce Ethical-Lens, a framework designed to facilitate the value-aligned usage of text-to-image tools without necessitating internal model revision. Ethical-Lens ensures value alignment in text-to-image models across toxicity and bias dimensions by refining user commands and rectifying model outputs. Systematic evaluation metrics, combining GPT4-V, HEIM, and FairFace scores, assess alignment capability. Our experiments reveal that Ethical-Lens enhances alignment capabilities to levels comparable with or superior to commercial models like DALLE 3, ensuring user-generated content adheres to ethical standards while maintaining image quality. This study indicates the potential of Ethical-Lens to ensure the sustainable development of open-source text-to-image tools and their beneficial integration into society. Our code is available at https://github.com/yuzhu-cai/Ethical-Lens.

[212]  arXiv:2404.12107 [pdf, other]
Title: Effective Individual Fairest Community Search over Heterogeneous Information Networks
Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Social and Information Networks (cs.SI)

Community search over heterogeneous information networks has been applied to wide domains, such as activity organization and team formation. From these scenarios, the members of a group with the same treatment often have different levels of activity and workloads, which causes unfairness in the treatment between active members and inactive members (called individual unfairness). However, existing works do not pay attention to individual fairness and do not sufficiently consider the rich semantics of HINs (e.g., high-order structure), which disables complex queries. To fill the gap, we formally define the issue of individual fairest community search over HINs (denoted as IFCS), which aims to find a set of vertices from the HIN that own the same type, close relationships, and small difference of activity level and has been demonstrated to be NP-hard. To do this, we first develop an exploration-based filter that reduces the search space of the community effectively. Further, to avoid repeating computation and prune unfair communities in advance, we propose a message-based scheme and a lower bound-based scheme. At last, we conduct extensive experiments on four real-world datasets to demonstrate the effectiveness and efficiency of our proposed algorithms, which achieve at least X3 times faster than the baseline solution.

[213]  arXiv:2404.12115 [pdf, other]
Title: Caging in Motion: Characterizing Robustness in Manipulation through Energy Margin and Dynamic Caging Analysis
Comments: 8 pages
Subjects: Robotics (cs.RO)

To develop robust manipulation policies, quantifying robustness is essential. Evaluating robustness in general dexterous manipulation, nonetheless, poses significant challenges due to complex hybrid dynamics, combinatorial explosion of possible contact interactions, global geometry, etc. This paper introduces ``caging in motion'', an approach for analyzing manipulation robustness through energy margins and caging-based analysis. Our method assesses manipulation robustness by measuring the energy margin to failure and extends traditional caging concepts for a global analysis of dynamic manipulation. This global analysis is facilitated by a kinodynamic planning framework that naturally integrates global geometry, contact changes, and robot compliance. We validate the effectiveness of our approach in the simulation and real-world experiments of multiple dynamic manipulation scenarios, highlighting its potential to predict manipulation success and robustness.

[214]  arXiv:2404.12120 [pdf, other]
Title: Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

This paper presents RADAR-Robust Adversarial Detection via Adversarial Retraining-an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets demonstrate that our proposed algorithm significantly improves a detector's ability to accurately identify adaptive adversarial attacks -- without sacrificing clean accuracy.

[215]  arXiv:2404.12121 [pdf, ps, other]
Title: A Simplified Analysis of the Ascending Auction to Sell a Matroid Base
Subjects: Computer Science and Game Theory (cs.GT)

We give a simpler analysis of the ascending auction of Bikhchandani, de Vries, Schummer, and Vohra to sell a welfare-maximizing base of a matroid at Vickrey prices. The new proofs for economic efficiency and the charge of Vickrey prices only require a few matroid folklore theorems, therefore shortening the analysis of the design goals of the auction significantly.

[216]  arXiv:2404.12125 [pdf, other]
Title: Intelligence Education made in Europe
Comments: 16 pages, 2 figures. No potential conflict of interest was reported by the authors
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

Global conflicts and trouble spots have thrown the world into turmoil. Intelligence services have never been as necessary as they are today when it comes to providing political decision-makers with concrete, accurate, and up-to-date decision-making knowledge. This requires a common co-operation, a common working language and a common understanding of each other. The best way to create this "intelligence community" is through a harmonized intelligence education.
In this paper, we show how joint intelligence education can succeed. We draw on the experience of Germany, where all intelligence services and the Bundeswehr are academically educated together in a single degree program that lays the foundations for a common working language. We also show how these experiences have been successfully transferred to a European level, namely to ICE, the Intelligence College in Europe. Our experience has shown that three aspects are particularly important: firstly, interdisciplinarity or better, transdisciplinarity, secondly, the integration of IT knowhow and thirdly, the development and learning of methodological skills. Using the example of the cyber intelligence module with a special focus on data-driven decision support, additionally with its many points of reference to numerous other academic modules, we show how the specific analytic methodology presented is embedded in our specific European teaching context.

[217]  arXiv:2404.12127 [pdf, other]
Title: Personalized Forgetting Mechanism with Concept-Driven Knowledge Tracing
Subjects: Artificial Intelligence (cs.AI)

Knowledge Tracing (KT) aims to trace changes in students' knowledge states throughout their entire learning process by analyzing their historical learning data and predicting their future learning performance. Existing forgetting curve theory based knowledge tracing models only consider the general forgetting caused by time intervals, ignoring the individualization of students and the causal relationship of the forgetting process. To address these problems, we propose a Concept-driven Personalized Forgetting knowledge tracing model (CPF) which integrates hierarchical relationships between knowledge concepts and incorporates students' personalized cognitive abilities. First, we integrate the students' personalized capabilities into both the learning and forgetting processes to explicitly distinguish students' individual learning gains and forgetting rates according to their cognitive abilities. Second, we take into account the hierarchical relationships between knowledge points and design a precursor-successor knowledge concept matrix to simulate the causal relationship in the forgetting process, while also integrating the potential impact of forgetting prior knowledge points on subsequent ones. The proposed personalized forgetting mechanism can not only be applied to the learning of specifc knowledge concepts but also the life-long learning process. Extensive experimental results on three public datasets show that our CPF outperforms current forgetting curve theory based methods in predicting student performance, demonstrating CPF can better simulate changes in students' knowledge status through the personalized forgetting mechanism.

[218]  arXiv:2404.12128 [pdf, other]
Title: Optimizing Intensive Database Tasks Through Caching Proxy Mechanisms
Comments: 8 pages, submitted at conference
Subjects: Databases (cs.DB)

Web caching is essential for the World Wide Web, saving processing power, bandwidth, and reducing latency. Many proxy caching solutions focus on buffering data from the main server, neglecting cacheable information meant for server writes. Existing systems addressing this issue are often intrusive, requiring modifications to the main application for integration. We identify opportunities for enhancement in conventional caching proxies. This paper explores, designs, and implements a potential prototype for such an application. Our focus is on harnessing a faster bulk-data-write approach compared to single-data-write within the context of relational databases. If a (upload) request matches a specified cacheable URL, then the data will be extracted and buffered on the local disk for later bulk-write. In contrast with already existing caching proxies, Squid, for example, in a similar uploading scenario, the request would simply get redirected, leaving out potential gains such as minimized processing power, lower server load, and bandwidth. After prototyping and testing the suggested application against Squid, concerning data uploads with 1, 100, 1.000, ..., and 100.000 requests, we consistently observed query execution improvements ranging from 5 to 9 times. This enhancement was achieved through buffering and bulk-writing the data, the extent of which depended on the specific test conditions.

[219]  arXiv:2404.12130 [pdf, other]
Title: One-Shot Sequential Federated Learning for Non-IID Data by Enhancing Local Model Diversity
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

Traditional federated learning mainly focuses on parallel settings (PFL), which can suffer significant communication and computation costs. In contrast, one-shot and sequential federated learning (SFL) have emerged as innovative paradigms to alleviate these costs. However, the issue of non-IID (Independent and Identically Distributed) data persists as a significant challenge in one-shot and SFL settings, exacerbated by the restricted communication between clients. In this paper, we improve the one-shot sequential federated learning for non-IID data by proposing a local model diversity-enhancing strategy. Specifically, to leverage the potential of local model diversity for improving model performance, we introduce a local model pool for each client that comprises diverse models generated during local training, and propose two distance measurements to further enhance the model diversity and mitigate the effect of non-IID data. Consequently, our proposed framework can improve the global model performance while maintaining low communication costs. Extensive experiments demonstrate that our method exhibits superior performance to existing one-shot PFL methods and achieves better accuracy compared with state-of-the-art one-shot SFL methods on both label-skew and domain-shift tasks (e.g., 6%+ accuracy improvement on the CIFAR-10 dataset).

[220]  arXiv:2404.12132 [pdf, other]
Title: Enhancing Suicide Risk Assessment: A Speech-Based Automated Approach in Emergency Medicine
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we have collected a novel dataset of speech recordings from $20$ patients from which we extract three sets of features, including wav2vec, interpretable speech and acoustic features, and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of $66.2\,\%$. Moreover, we show that integrating our speech model with a series of patients' metadata, such as the history of suicide attempts or access to firearms, improves the overall result. The metadata integration yields a balanced accuracy of $94.4\,\%$, marking an absolute improvement of $28.2\,\%$, demonstrating the efficacy of our proposed approaches for automatic suicide risk assessment in emergency medicine.

[221]  arXiv:2404.12133 [pdf, ps, other]
Title: On Target Detection in the Presence of Clutter in Joint Communication and Sensing Cellular Networks
Journal-ref: 2023 16th International Conference on Signal Processing and Communication System (ICSPCS)
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

Recent works on joint communication and sensing (JCAS) cellular networks have proposed to use time division mode (TDM) and concurrent mode (CM), as alternative methods for sharing the resources between communication and sensing signals. While the performance of these JCAS schemes for object tracking and parameter estimation has been studied in previous works, their performance on target detection in the presence of clutter has not been analyzed. In this paper, we propose a detection scheme for estimating the number of targets in JCAS cellular networks that employ TDM or CM resource sharing. The proposed detection method allows for the presence of clutter and/or temporally correlated noise. This scheme is studied with respect to the JCAS trade-off parameters that allow to control the time slots in TDM and the power resources in CM allocated to sensing and communications. The performance of two fundamental transmit beamforming schemes, typical for JCAS, is compared in terms of the receiver operating characteristics curves. Our results indicate that in general the TDM scheme gives a somewhat better detection performance compared to the CM scheme, although both schemes outperform existing approaches provided that their respective trade-off parameters are tuned properly.

[222]  arXiv:2404.12134 [pdf, other]
Title: Warped Time Series Anomaly Detection
Subjects: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

This paper addresses the problem of detecting time series outliers, focusing on systems with repetitive behavior, such as industrial robots operating on production lines.Notable challenges arise from the fact that a task performed multiple times may exhibit different duration in each repetition and that the time series reported by the sensors are irregularly sampled because of data gaps. The anomaly detection approach presented in this paper consists of three stages.The first stage identifies the repetitive cycles in the lengthy time series and segments them into individual time series corresponding to one task cycle, while accounting for possible temporal distortions.The second stage computes a prototype for the cycles using a GPU-based barycenter algorithm, specifically tailored for very large time series.The third stage uses the prototype to detect abnormal cycles by computing an anomaly score for each cycle.The overall approach, named WarpEd Time Series ANomaly Detection (WETSAND), makes use of the Dynamic Time Warping algorithm and its variants because they are suited to the distorted nature of the time series.The experiments show that \wetsand scales to large signals, computes human-friendly prototypes, works with very little data, and outperforms some general purpose anomaly detection approaches such as autoencoders.

[223]  arXiv:2404.12135 [pdf, other]
Title: mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture
Subjects: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)

The escalating complexity of micro-services architecture in cloud-native technologies poses significant challenges for maintaining system stability and efficiency. To conduct root cause analysis (RCA) and resolution of alert events, we propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), to revolutionize the AI for IT operations (AIOps) domain, where multiple agents based on the powerful large language models (LLMs) perform blockchain-inspired voting to reach a final agreement following a standardized process for processing tasks and queries provided by Agent Workflow. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. To avoid potential instability issues in LLMs and fully leverage the transparent and egalitarian advantages inherent in a decentralized structure, mABC adopts a decision-making process inspired by blockchain governance principles while considering the contribution index and expertise index of each agent. Experimental results on the public benchmark AIOps challenge dataset and our created train-ticket dataset demonstrate superior performance in accurately identifying root causes and formulating effective solutions, compared to previous strong baselines. The ablation study further highlights the significance of each component within mABC, with Agent Workflow, multi-agent, and blockchain-inspired voting being crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and achieves a significant improvement in the AIOps domain compared to existing baselines

[224]  arXiv:2404.12138 [pdf, other]
Title: Character is Destiny: Can Large Language Models Simulate Persona-Driven Decisions in Role-Playing?
Subjects: Artificial Intelligence (cs.AI)

Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided with the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 1,401 character decision points from 395 books. Then, we conduct comprehensive experiments on LIFECHOICE, with various LLMs and methods for LLM role-playing. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet there is substantial room for improvement. Hence, we further propose the CHARMAP method, which achieves a 6.01% increase in accuracy via persona-based memory retrieval. We will make our datasets and code publicly available.

[225]  arXiv:2404.12139 [pdf, other]
Title: Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
Comments: 20 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable success in computer vision and particularly demonstrated superior robustness to distribution shifts of 2D images. However, their robustness under 3D viewpoint variations is still limited, which can hinder the development for real-world applications. This paper successfully addresses this concern while keeping VLPs' original performance by breaking through two primary obstacles: 1) the scarcity of training data and 2) the suboptimal fine-tuning paradigms. To combat data scarcity, we build the Multi-View Caption (MVCap) dataset -- a comprehensive collection of over four million multi-view image-text pairs across more than 100K objects, providing more potential for VLP models to develop generalizable viewpoint-invariant representations. To address the limitations of existing paradigms in performance trade-offs and training efficiency, we design a novel fine-tuning framework named Omniview-Tuning (OVT). Specifically, OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy, which effectively aligns representations of identical objects from diverse viewpoints without causing overfitting. Additionally, OVT fine-tunes VLP models in a parameter-efficient manner, leading to minimal computational cost. Extensive experiments on various VLP models with different architectures validate that OVT significantly improves the models' resilience to viewpoint shifts and keeps the original performance, establishing a pioneering standard for boosting the viewpoint invariance of VLP models.

[226]  arXiv:2404.12142 [pdf, other]
Title: SDIP: Self-Reinforcement Deep Image Prior Framework for Image Processing
Authors: Ziyu Shu, Zhixin Pan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Deep image prior (DIP) proposed in recent research has revealed the inherent trait of convolutional neural networks (CNN) for capturing substantial low-level image statistics priors. This framework efficiently addresses the inverse problems in image processing and has induced extensive applications in various domains. However, as the whole algorithm is initialized randomly, the DIP algorithm often lacks stability. Thus, this method still has space for further improvement. In this paper, we propose the self-reinforcement deep image prior (SDIP) as an improved version of the original DIP. We observed that the changes in the DIP networks' input and output are highly correlated during each iteration. SDIP efficiently utilizes this trait in a reinforcement learning manner, where the current iteration's output is utilized by a steering algorithm to update the network input for the next iteration, guiding the algorithm toward improved results. Experimental results across multiple applications demonstrate that our proposed SDIP framework offers improvement compared to the original DIP method and other state-of-the-art methods.

[227]  arXiv:2404.12143 [pdf, other]
Title: The Neutrality Fallacy: When Algorithmic Fairness Interventions are (Not) Positive Action
Journal-ref: 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24)
Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Various metrics and interventions have been developed to identify and mitigate unfair outputs of machine learning systems. While individuals and organizations have an obligation to avoid discrimination, the use of fairness-aware machine learning interventions has also been described as amounting to 'algorithmic positive action' under European Union (EU) non-discrimination law. As the Court of Justice of the European Union has been strict when it comes to assessing the lawfulness of positive action, this would impose a significant legal burden on those wishing to implement fair-ml interventions. In this paper, we propose that algorithmic fairness interventions often should be interpreted as a means to prevent discrimination, rather than a measure of positive action. Specifically, we suggest that this category mistake can often be attributed to neutrality fallacies: faulty assumptions regarding the neutrality of fairness-aware algorithmic decision-making. Our findings raise the question of whether a negative obligation to refrain from discrimination is sufficient in the context of algorithmic decision-making. Consequently, we suggest moving away from a duty to 'not do harm' towards a positive obligation to actively 'do no harm' as a more adequate framework for algorithmic decision-making and fair ml-interventions.

[228]  arXiv:2404.12144 [pdf, other]
Title: Mushroom Segmentation and 3D Pose Estimation from Point Clouds using Fully Convolutional Geometric Features and Implicit Pose Encoding
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Modern agricultural applications rely more and more on deep learning solutions. However, training well-performing deep networks requires a large amount of annotated data that may not be available and in the case of 3D annotation may not even be feasible for human annotators. In this work, we develop a deep learning approach to segment mushrooms and estimate their pose on 3D data, in the form of point clouds acquired by depth sensors. To circumvent the annotation problem, we create a synthetic dataset of mushroom scenes, where we are fully aware of 3D information, such as the pose of each mushroom. The proposed network has a fully convolutional backbone, that parses sparse 3D data, and predicts pose information that implicitly defines both instance segmentation and pose estimation task. We have validated the effectiveness of the proposed implicit-based approach for a synthetic test set, as well as provided qualitative results for a small set of real acquired point clouds with depth sensors. Code is publicly available at https://github.com/georgeretsi/mushroom-pose.

[229]  arXiv:2404.12145 [pdf, other]
Title: From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

[230]  arXiv:2404.12147 [pdf, other]
Title: Faster Optimization Through Genetic Drift
Subjects: Neural and Evolutionary Computing (cs.NE); Probability (math.PR)

The compact Genetic Algorithm (cGA), parameterized by its hypothetical population size $K$, offers a low-memory alternative to evolving a large offspring population of solutions. It evolves a probability distribution, biasing it towards promising samples. For the classical benchmark OneMax, the cGA has to two different modes of operation: a conservative one with small step sizes $\Theta(1/(\sqrt{n}\log n))$, which is slow but prevents genetic drift, and an aggressive one with large step sizes $\Theta(1/\log n)$, in which genetic drift leads to wrong decisions, but those are corrected efficiently. On OneMax, an easy hill-climbing problem, both modes lead to optimization times of $\Theta(n\log n)$ and are thus equally efficient.
In this paper we study how both regimes change when we replace OneMax by the harder hill-climbing problem DynamicBinVal. It turns out that the aggressive mode is not affected and still yields quasi-linear runtime $O(n\cdot polylog (n))$. However, the conservative mode becomes substantially slower, yielding a runtime of $\Omega(n^2)$, since genetic drift can only be avoided with smaller step sizes of $O(1/n)$. We complement our theoretical results with simulations.

[231]  arXiv:2404.12149 [pdf, other]
Title: AccidentBlip2: Accident Detection With Multi-View MotionBlip2
Subjects: Artificial Intelligence (cs.AI)

Multimodal Large Language Models (MLLMs) have shown outstanding capabilities in many areas of multimodal reasoning. Therefore, we use the reasoning ability of Multimodal Large Language Models for environment description and scene understanding in complex transportation environments. In this paper, we propose AccidentBlip2, a multimodal large language model that can predict in real time whether an accident risk will occur. Our approach involves feature extraction based on the temporal scene of the six-view surround view graphs and temporal inference using the temporal blip framework through the vision transformer. We then input the generated temporal token into the MLLMs for inference to determine whether an accident will occur or not. Since AccidentBlip2 does not rely on any BEV images and LiDAR, the number of inference parameters and the inference cost of MLLMs can be significantly reduced, and it also does not incur a large training overhead during training. AccidentBlip2 outperforms existing solutions on the DeepAccident dataset and can also provide a reference solution for end-to-end automated driving accident prediction.

[232]  arXiv:2404.12150 [pdf, other]
Title: Aligning language models with human preferences
Authors: Tomasz Korbak
Comments: PhD thesis
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.

[233]  arXiv:2404.12152 [pdf, other]
Title: FecTek: Enhancing Term Weight in Lexicon-Based Retrieval with Feature Context and Term-level Knowledge
Subjects: Computation and Language (cs.CL)

Lexicon-based retrieval has gained siginificant popularity in text retrieval due to its efficient and robust performance. To further enhance performance of lexicon-based retrieval, researchers have been diligently incorporating state-of-the-art methodologies like Neural retrieval and text-level contrastive learning approaches. Nonetheless, despite the promising outcomes, current lexicon-based retrieval methods have received limited attention in exploring the potential benefits of feature context representations and term-level knowledge guidance. In this paper, we introduce an innovative method by introducing FEature Context and TErm-level Knowledge modules(FecTek). To effectively enrich the feature context representations of term weight, the Feature Context Module (FCM) is introduced, which leverages the power of BERT's representation to determine dynamic weights for each element in the embedding. Additionally, we develop a term-level knowledge guidance module (TKGM) for effectively utilizing term-level knowledge to intelligently guide the modeling process of term weight. Evaluation of the proposed method on MS Marco benchmark demonstrates its superiority over the previous state-of-the-art approaches.

[234]  arXiv:2404.12154 [pdf, other]
Title: StyleBooth: Image Style Editing with Multimodal Instruction
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Given an original image, image editing aims to generate an image that align with the provided instruction. The challenges are to accept multimodal inputs as instructions and a scarcity of high-quality training data, including crucial triplets of source/target image pairs and multimodal (text and image) instructions. In this paper, we focus on image style editing and present StyleBooth, a method that proposes a comprehensive framework for image editing and a feasible strategy for building a high-quality style editing dataset. We integrate encoded textual instruction and image exemplar as a unified condition for diffusion model, enabling the editing of original image following multimodal instructions. Furthermore, by iterative style-destyle tuning and editing and usability filtering, the StyleBooth dataset provides content-consistent stylized/plain image pairs in various categories of styles. To show the flexibility of StyleBooth, we conduct experiments on diverse tasks, such as text-based style editing, exemplar-based style editing and compositional style editing. The results demonstrate that the quality and variety of training data significantly enhance the ability to preserve content and improve the overall quality of generated images in editing tasks. Project page can be found at https://ali-vilab.github.io/stylebooth-page/.

[235]  arXiv:2404.12155 [pdf, other]
Title: Low-rank alternating direction doubling algorithm for solving large-scale continuous time algebraic Riccati equations
Authors: Juan Zhang, Wenlu Xun
Comments: 14 pages, 2 figures
Subjects: Numerical Analysis (math.NA)

This paper proposes an effective low-rank alternating direction doubling algorithm (R-ADDA) for computing numerical low-rank solutions to large-scale sparse continuous-time algebraic Riccati matrix equations. The method is based on the alternating direction doubling algorithm (ADDA), utilizing the low-rank property of matrices and employing Cholesky factorization for solving. The advantage of the new algorithm lies in computing only the $2^k$-th approximation during the iterative process, instead of every approximation. Its efficient low-rank formula saves storage space and is highly effective from a computational perspective. Finally, the effectiveness of the new algorithm is demonstrated through theoretical analysis and numerical experiments.

[236]  arXiv:2404.12160 [pdf, other]
Title: A general alternating direction implicit iteration method for solving complex symmetric linear systems
Authors: Juan Zhang, Wenlu Xun
Comments: 24 pages, 7 figures
Subjects: Numerical Analysis (math.NA)

We have introduced the generalized alternating direction implicit iteration (GADI) method for solving large sparse complex symmetric linear systems and proved its convergence properties. Additionally, some numerical results have demonstrated the effectiveness of this algorithm. Furthermore, as an application of the GADI method in solving complex symmetric linear systems, we utilized the flattening operator and Kronecker product properties to solve Lyapunov and Riccati equations with complex coefficients using the GADI method. In solving the Riccati equation, we combined inner and outer iterations, first simplifying the Riccati equation into a Lyapunov equation using the Newton method, and then applying the GADI method for solution. Finally, we provided convergence analysis of the method and corresponding numerical results.

[237]  arXiv:2404.12165 [pdf, other]
Title: Stability Certificates for Receding Horizon Games
Subjects: Systems and Control (eess.SY)

Game-theoretic MPC (or Receding Horizon Games) is an emerging control methodology for multi-agent systems that generates control actions by solving a dynamic game with coupling constraints in a receding-horizon fashion. This control paradigm has recently received an increasing attention in various application fields, including robotics, autonomous driving, traffic networks, and energy grids, due to its ability to model the competitive nature of self-interested agents with shared resources while incorporating future predictions, dynamic models, and constraints into the decision-making process. In this work, we present the first formal stability analysis based on dissipativity and monotone operator theory that is valid also for non-potential games. Specifically, we derive LMI-based certificates that ensure asymptotic stability and are numerically verifiable. Moreover, we show that, if the agents have decoupled dynamics, the numerical verification can be performed in a scalable manner. Finally, we present tuning guidelines for the agents' cost function weights to fulfill the certificates and, thus, ensure stability.

[238]  arXiv:2404.12168 [pdf, other]
Title: Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization
Comments: CVPR2024 Camera-Ready
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

As recent advances in mobile camera technology have enabled the capability to capture high-resolution images, such as 4K images, the demand for an efficient deblurring model handling large motion has increased. In this paper, we discover that the image residual errors, i.e., blur-sharp pixel differences, can be grouped into some categories according to their motion blur type and how complex their neighboring pixels are. Inspired by this, we decompose the deblurring (regression) task into blur pixel discretization (pixel-level blur classification) and discrete-to-continuous conversion (regression with blur class map) tasks. Specifically, we generate the discretized image residual errors by identifying the blur pixels and then transform them to a continuous form, which is computationally more efficient than naively solving the original regression problem with continuous values. Here, we found that the discretization result, i.e., blur segmentation map, remarkably exhibits visual similarity with the image residual errors. As a result, our efficient model shows comparable performance to state-of-the-art methods in realistic benchmarks, while our method is up to 10 times computationally more efficient.

[239]  arXiv:2404.12169 [pdf, other]
Title: Shotit: compute-efficient image-to-video search engine for the cloud
Authors: Leslie Wong
Comments: Submitted to ACM ICMR 2024
Subjects: Multimedia (cs.MM); Information Retrieval (cs.IR)

With the rapid growth of information technology, users are exposed to a massive amount of data online, including image, music, and video. This has led to strong needs to provide effective corresponsive search services such as image, music, and video search services. Most of them are operated based on keywords, namely using keywords to find related image, music, and video. Additionally, there are image-to-image search services that enable users to find similar images using one input image. Given that videos are essentially composed of image frames, then similar videos can be searched by one input image or screenshot. We want to target this scenario and provide an efficient method and implementation in this paper.
We present Shotit, a cloud-native image-to-video search engine that tailors this search scenario in a compute-efficient approach. One main limitation faced in this scenario is the scale of its dataset. A typical image-to-image search engine only handles one-to-one relationships, colloquially, one image corresponds to another single image. But image-to-video proliferates. Take a 24-min length video as an example, it will generate roughly 20,000 image frames. As the number of videos grows, the scale of the dataset explodes exponentially. In this case, a compute-efficient approach ought to be considered, and the system design should cater to the cloud-native trend. Choosing an emerging technology - vector database as its backbone, Shotit fits these two metrics performantly. Experiments for two different datasets, a 50 thousand-scale Blender Open Movie dataset, and a 50 million-scale proprietary TV genre dataset at a 4 Core 32GB RAM Intel Xeon Gold 6271C cloud machine with object storage reveal the effectiveness of Shotit. A demo regarding the Blender Open Movie dataset is illustrated within this paper.

[240]  arXiv:2404.12171 [pdf, other]
Title: Stance Detection on Social Media with Fine-Tuned Large Language Models
Subjects: Computation and Language (cs.CL); Social and Information Networks (cs.SI)

Stance detection, a key task in natural language processing, determines an author's viewpoint based on textual analysis. This study evaluates the evolution of stance detection methods, transitioning from early machine learning approaches to the groundbreaking BERT model, and eventually to modern Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While ChatGPT's closed-source nature and associated costs present challenges, the open-source models like LLaMa-2 and Mistral-7B offers an encouraging alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2, and Mistral-7B using several publicly available datasets. Subsequently, to provide a comprehensive comparison, we assess the performance of these models in zero-shot and few-shot learning scenarios. The results underscore the exceptional ability of LLMs in accurately detecting stance, with all tested models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B demonstrate remarkable efficiency and potential for stance detection, despite their smaller sizes compared to ChatGPT. This study emphasizes the potential of LLMs in stance detection and calls for more extensive research in this field.

[241]  arXiv:2404.12172 [pdf, other]
Title: How to Benchmark Vision Foundation Models for Semantic Segmentation?
Comments: CVPR 2024 Workshop Proceedings for the Second Workshop on Foundation Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: https://tue-mps.github.io/benchmark-vfm-ss/.

[242]  arXiv:2404.12174 [pdf, other]
Title: Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?
Subjects: Computation and Language (cs.CL)

The increasing threat of disinformation calls for automating parts of the fact-checking pipeline. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets from diverse domains, each utilizing a different worthiness criterion. We investigate two key aspects: (1) how best to distill factuality and worthiness criteria into a prompt and (2) what amount of context to provide for each claim. To this end, we experiment with varying the level of prompt verbosity and the amount of contextual information provided to the model. Our results show that optimal prompt verbosity is domain-dependent, adding context does not improve performance, and confidence scores can be directly used to produce reliable check-worthiness rankings.

[243]  arXiv:2404.12177 [pdf, ps, other]
Title: EuSQuAD: Automatically Translated and Aligned SQuAD2.0 for Basque
Comments: Under review in the journal of Procesamiento de Lenguaje Natural
Subjects: Computation and Language (cs.CL)

The widespread availability of Question Answering (QA) datasets in English has greatly facilitated the advancement of the Natural Language Processing (NLP) field. However, the scarcity of such resources for minority languages, such as Basque, poses a substantial challenge for these communities. In this context, the translation and alignment of existing QA datasets plays a crucial role in narrowing this technological gap. This work presents EuSQuAD, the first initiative dedicated to automatically translating and aligning SQuAD2.0 into Basque, resulting in more than 142k QA examples. We demonstrate EuSQuAD's value through extensive qualitative analysis and QA experiments supported with EuSQuAD as training data. These experiments are evaluated with a new human-annotated dataset.

[244]  arXiv:2404.12183 [pdf, other]
Title: Gait Recognition from Highly Compressed Videos
Comments: Accepted at 2nd Workshop on Learning with Few or without Annotated Face, Body and Gesture Data
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.

[245]  arXiv:2404.12185 [pdf, other]
Title: An Adaptive Metaheuristic Framework for Changing Environments
Authors: Bestoun S. Ahmed
Comments: Accepted in 2024 IEEE Congress on Evolutionary Computation (CEC)
Subjects: Artificial Intelligence (cs.AI)

The rapidly changing landscapes of modern optimization problems require algorithms that can be adapted in real-time. This paper introduces an Adaptive Metaheuristic Framework (AMF) designed for dynamic environments. It is capable of intelligently adapting to changes in the problem parameters. The AMF combines a dynamic representation of problems, a real-time sensing system, and adaptive techniques to navigate continuously changing optimization environments. Through a simulated dynamic optimization problem, the AMF's capability is demonstrated to detect environmental changes and proactively adjust its search strategy. This framework utilizes a differential evolution algorithm that is improved with an adaptation module that adjusts solutions in response to detected changes. The capability of the AMF to adjust is tested through a series of iterations, demonstrating its resilience and robustness in sustaining solution quality despite the problem's development. The effectiveness of AMF is demonstrated through a series of simulations on a dynamic optimization problem. Robustness and agility characterize the algorithm's performance, as evidenced by the presented fitness evolution and solution path visualizations. The findings show that AMF is a practical solution to dynamic optimization and a major step forward in the creation of algorithms that can handle the unpredictability of real-world problems.

[246]  arXiv:2404.12186 [pdf, other]
Title: Privacy-Preserving UCB Decision Process Verification via zk-SNARKs
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)

With the increasingly widespread application of machine learning, how to strike a balance between protecting the privacy of data and algorithm parameters and ensuring the verifiability of machine learning has always been a challenge. This study explores the intersection of reinforcement learning and data privacy, specifically addressing the Multi-Armed Bandit (MAB) problem with the Upper Confidence Bound (UCB) algorithm. We introduce zkUCB, an innovative algorithm that employs the Zero-Knowledge Succinct Non-Interactive Argument of Knowledge (zk-SNARKs) to enhance UCB. zkUCB is carefully designed to safeguard the confidentiality of training data and algorithmic parameters, ensuring transparent UCB decision-making. Experiments highlight zkUCB's superior performance, attributing its enhanced reward to judicious quantization bit usage that reduces information entropy in the decision-making process. zkUCB's proof size and verification time scale linearly with the execution steps of zkUCB. This showcases zkUCB's adept balance between data security and operational efficiency. This approach contributes significantly to the ongoing discourse on reinforcing data privacy in complex decision-making processes, offering a promising solution for privacy-sensitive applications.

[247]  arXiv:2404.12187 [pdf, other]
Title: Stability-informed Bayesian Optimization for MPC Cost Function Learning
Comments: 7 pages, 3 figures, accepted for NMPC 2024
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)

Designing predictive controllers towards optimal closed-loop performance while maintaining safety and stability is challenging. This work explores closed-loop learning for predictive control parameters under imperfect information while considering closed-loop stability. We employ constrained Bayesian optimization to learn a model predictive controller's (MPC) cost function parametrized as a feedforward neural network, optimizing closed-loop behavior as well as minimizing model-plant mismatch. Doing so offers a high degree of freedom and, thus, the opportunity for efficient and global optimization towards the desired and optimal closed-loop behavior. We extend this framework by stability constraints on the learned controller parameters, exploiting the optimal value function of the underlying MPC as a Lyapunov candidate. The effectiveness of the proposed approach is underlined in simulations, highlighting its performance and safety capabilities.

[248]  arXiv:2404.12190 [pdf, other]
Title: Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted Trees
Comments: SIGIR2024 conference Short paper track
Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR)

Stochastic learning to rank (LTR) is a recent branch in the LTR field that concerns the optimization of probabilistic ranking models. Their probabilistic behavior enables certain ranking qualities that are impossible with deterministic models. For example, they can increase the diversity of displayed documents, increase fairness of exposure over documents, and better balance exploitation and exploration through randomization. A core difficulty in LTR is gradient estimation, for this reason, existing stochastic LTR methods have been limited to differentiable ranking models (e.g., neural networks). This is in stark contrast with the general field of LTR where Gradient Boosted Decision Trees (GBDTs) have long been considered the state-of-the-art.
In this work, we address this gap by introducing the first stochastic LTR method for GBDTs. Our main contribution is a novel estimator for the second-order derivatives, i.e., the Hessian matrix, which is a requirement for effective GBDTs. To efficiently compute both the first and second-order derivatives simultaneously, we incorporate our estimator into the existing PL-Rank framework, which was originally designed for first-order derivatives only. Our experimental results indicate that stochastic LTR without the Hessian has extremely poor performance, whilst the performance is competitive with the current state-of-the-art with our estimated Hessian. Thus, through the contribution of our novel Hessian estimation method, we have successfully introduced GBDTs to stochastic LTR.

[249]  arXiv:2404.12192 [pdf, other]
Title: Aligning Actions and Walking to LLM-Generated Textual Descriptions
Comments: Accepted at 2nd Workshop on Learning with Few or without Annotated Face, Body and Gesture Data
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, including data augmentation and synthetic data generation. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns. We leverage the expressive power of LLMs to align motion representations with high-level linguistic cues, addressing two distinct tasks: action recognition and retrieval of walking sequences based on appearance attributes. For action recognition, we employ LLMs to generate textual descriptions of actions in the BABEL-60 dataset, facilitating the alignment of motion sequences with linguistic representations. In the domain of gait analysis, we investigate the impact of appearance attributes on walking patterns by generating textual descriptions of motion sequences from the DenseGait dataset using LLMs. These descriptions capture subtle variations in walking styles influenced by factors such as clothing choices and footwear. Our approach demonstrates the potential of LLMs in augmenting structured motion attributes and aligning multi-modal representations. The findings contribute to the advancement of comprehensive motion understanding and open up new avenues for leveraging LLMs in multi-modal alignment and data augmentation for motion analysis. We make the code publicly available at https://github.com/Radu1999/WalkAndText

[250]  arXiv:2404.12195 [pdf, other]
Title: OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
Comments: 25 pages, 27 Figures, 8 Tables
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc and our codebase at https://bitbucket.org/paladinanalytics/workspace/projects/OP.

[251]  arXiv:2404.12203 [pdf, other]
Title: GraFIQs: Face Image Quality Assessment Using Gradient Magnitudes
Comments: Accepted at CVPR Workshop 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Face Image Quality Assessment (FIQA) estimates the utility of face images for automated face recognition (FR) systems. We propose in this work a novel approach to assess the quality of face images based on inspecting the required changes in the pre-trained FR model weights to minimize differences between testing samples and the distribution of the FR training dataset. To achieve that, we propose quantifying the discrepancy in Batch Normalization statistics (BNS), including mean and variance, between those recorded during FR training and those obtained by processing testing samples through the pretrained FR model. We then generate gradient magnitudes of pretrained FR weights by backpropagating the BNS through the pretrained model. The cumulative absolute sum of these gradient magnitudes serves as the FIQ for our approach. Through comprehensive experimentation, we demonstrate the effectiveness of our training-free and quality labeling-free approach, achieving competitive performance to recent state-of-theart FIQA approaches without relying on quality labeling, the need to train regression networks, specialized architectures, or designing and optimizing specific loss functions.

[252]  arXiv:2404.12208 [pdf, ps, other]
Title: The Explicit values of the UBCT, the LBCT and the DBCT of the inverse function
Comments: This manuscript was submitted to Finite Fields and Their Application on April 8, 2024. arXiv admin note: text overlap with arXiv:2309.01881
Subjects: Cryptography and Security (cs.CR); Information Theory (cs.IT)

Substitution boxes (S-boxes) play a significant role in ensuring the resistance of block ciphers against various attacks. The Upper Boomerang Connectivity Table (UBCT), the Lower Boomerang Connectivity Table (LBCT) and the Double Boomerang Connectivity Table (DBCT) of a given S-box are crucial tools to analyze its security concerning specific attacks. However, there are currently no related results for this research. The inverse function is crucial for constructing S-boxes of block ciphers with good cryptographic properties in symmetric cryptography. Therefore, extensive research has been conducted on the inverse function, exploring various properties related to standard attacks. Thanks to the recent advancements in boomerang cryptanalysis, particularly the introduction of concepts such as UBCT, LBCT, and DBCT, this paper aims to further investigate the properties of the inverse function $F(x)=x^{2^n-2}$ over $\gf_{2^n}$ for arbitrary $n$. As a consequence, by carrying out certain finer manipulations of solving specific equations over $\gf_{2^n}$, we give all entries of the UBCT, LBCT of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Besides, based on the results of the UBCT and LBCT for the inverse function, we determine that $F(x)$ is hard when $n$ is odd. Furthermore, we completely compute all entries of the DBCT of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Additionally, we provide the precise number of elements with a given entry by means of the values of some Kloosterman sums. Further, we determine the double boomerang uniformity of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Our in-depth analysis of the DBCT of $F(x)$ contributes to a better evaluation of the S-box's resistance against boomerang attacks.

[253]  arXiv:2404.12209 [pdf, other]
Title: Partial-to-Partial Shape Matching with Geometric Consistency
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. A prominent challenge are partial-to-partial shape matching settings, which occur when the shapes to match are only observed incompletely (e.g. from 3D scanning). Although partial-to-partial matching is a highly relevant setting in practice, it is rarely explored. Our work bridges the gap between existing (rather artificial) 3D full shape matching and partial-to-partial real-world settings by exploiting geometric consistency as a strong constraint. We demonstrate that it is indeed possible to solve this challenging problem in a variety of settings. For the first time, we achieve geometric consistency for partial-to-partial matching, which is realized by a novel integer non-linear program formalism building on triangle product spaces, along with a new pruning algorithm based on linear integer programming. Further, we generate a new inter-class dataset for partial-to-partial shape-matching. We show that our method outperforms current SOTA methods on both an established intra-class dataset and our novel inter-class dataset.

[254]  arXiv:2404.12210 [pdf, other]
Title: Observation, Analysis, and Solution: Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.

[255]  arXiv:2404.12215 [pdf, other]
Title: Quantifying Aleatoric and Epistemic Uncertainty with Proper Scoring Rules
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Uncertainty representation and quantification are paramount in machine learning and constitute an important prerequisite for safety-critical applications. In this paper, we propose novel measures for the quantification of aleatoric and epistemic uncertainty based on proper scoring rules, which are loss functions with the meaningful property that they incentivize the learner to predict ground-truth (conditional) probabilities. We assume two common representations of (epistemic) uncertainty, namely, in terms of a credal set, i.e. a set of probability distributions, or a second-order distribution, i.e., a distribution over probability distributions. Our framework establishes a natural bridge between these representations. We provide a formal justification of our approach and introduce new measures of epistemic and aleatoric uncertainty as concrete instantiations.

[256]  arXiv:2404.12216 [pdf, other]
Title: ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (\textit{ProTA}) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, \textit{ProTA} achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).

[257]  arXiv:2404.12219 [pdf, other]
Title: A Quadrature Approach for General-Purpose Batch Bayesian Optimization via Probabilistic Lifting
Comments: 48 pages, 11 figures
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)

Parallelisation in Bayesian optimisation is a common strategy but faces several challenges: the need for flexibility in acquisition functions and kernel choices, flexibility dealing with discrete and continuous variables simultaneously, model misspecification, and lastly fast massive parallelisation. To address these challenges, we introduce a versatile and modular framework for batch Bayesian optimisation via probabilistic lifting with kernel quadrature, called SOBER, which we present as a Python library based on GPyTorch/BoTorch. Our framework offers the following unique benefits: (1) Versatility in downstream tasks under a unified approach. (2) A gradient-free sampler, which does not require the gradient of acquisition functions, offering domain-agnostic sampling (e.g., discrete and mixed variables, non-Euclidean space). (3) Flexibility in domain prior distribution. (4) Adaptive batch size (autonomous determination of the optimal batch size). (5) Robustness against a misspecified reproducing kernel Hilbert space. (6) Natural stopping criterion.

[258]  arXiv:2404.12220 [pdf, other]
Title: Hybrid Dynamics Modeling and Trajectory Planning for a Cable-Trailer System with a Quadruped Robot
Comments: 8 pages, 8 figures, submitted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Subjects: Robotics (cs.RO)

Inspired by the utilization of dogs in sled-pulling for transportation, we introduce a cable-trailer system with a quadruped robot. The motion planning of the proposed robot system presents challenges arising from the nonholonomic constraints of the trailer, system underactuation, and hybrid interaction through the cable. To tackle these challenges, we develop a hybrid dynamics model that accounts for the cable's taut/slack status. Since it is computationally intense to directly optimize the trajectory, we first propose a search algorithm to compute a sub-optimal trajectory as the initial solution. Then, a novel collision avoidance constraint based on the geometric shapes of objects is proposed to formulate the trajectory optimization problem for the hybrid system. The proposed trajectory planning method is implemented on a Unitree A1 quadruped robot with a customized cable-trailer and validated through experiments.

[259]  arXiv:2404.12224 [pdf, other]
Title: Length Generalization of Causal Transformers without Position Encoding
Subjects: Computation and Language (cs.CL)

Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

[260]  arXiv:2404.12226 [pdf, other]
Title: A cooperative strategy for diagnosing the root causes of quality requirement violations in multiagent systems
Subjects: Software Engineering (cs.SE)

Many modern software systems are built as a set of autonomous software components (also called agents) that collaborate with each other and are situated in an environment. To keep these multiagent systems operational under abnormal circumstances, it is crucial to make them resilient. Existing solutions are often centralised and rely on information manually provided by experts at design time, making such solutions rigid and limiting the autonomy and adaptability of the system. In this work, we propose a cooperative strategy focused on the identification of the root causes of quality requirement violations in multiagent systems. This strategy allows agents to cooperate with each other in order to identify whether these violations come from service providers, associated components, or the communication infrastructure. From this identification process, agents are able to adapt their behaviour in order to mitigate and solve existing abnormalities with the aim of normalising system operation. This strategy consists of an interaction protocol that, together with the proposed algorithms, allow agents playing the protocol roles to diagnose problems to be repaired. We evaluate our proposal with the implementation of a service-oriented system. The results demonstrate that our solution enables the correct identification of different sources of failures, favouring the selection of the most suitable actions to be taken to overcome abnormal situations.

[261]  arXiv:2404.12228 [pdf, other]
Title: Relationship Discovery for Drug Recommendation
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Medication recommendation systems are designed to deliver personalized drug suggestions that are closely aligned with individual patient needs. Previous studies have primarily concentrated on developing medication embeddings, achieving significant progress. Nonetheless, these approaches often fall short in accurately reflecting individual patient profiles, mainly due to challenges in distinguishing between various patient conditions and the inability to establish precise correlations between specific conditions and appropriate medications. In response to these issues, we introduce DisMed, a model that focuses on patient conditions to enhance personalization. DisMed employs causal inference to discern clear, quantifiable causal links. It then examines patient conditions in depth, recognizing and adapting to the evolving nuances of these conditions, and mapping them directly to corresponding medications. Additionally, DisMed leverages data from multiple patient visits to propose combinations of medications. Comprehensive testing on real-world datasets demonstrates that DisMed not only improves the customization of patient profiles but also surpasses leading models in both precision and safety.

[262]  arXiv:2404.12229 [pdf, other]
Title: A minimal base or a direct base? That is the question!
Subjects: Logic in Computer Science (cs.LO)

In this paper we revisit the problem of computing the closure of a set of attributes, given a set of Armstrong dependencies. This problem is of main interest in logics, in the relational database model, in lattice theory and in Formal Concept Analysis as well. We consider here three main closure algorithms, namely Closure, LinClosure and Wild's Closure, which are combined with implication bases which may have different characteristics, among which being "minimal", e.g., the Duquenne-Guigues basis, and being "direct", e.g., the Canonical Basis and the D-basis. The impacts of minimality and directness on the closure algorithms are then deeply studied also experimentally. The results are extensively analyzed and propose a different and fresh look at computing the closure of a set of attributes.
This paper has been submitted to the International Journal of Approximate Reasoning.

[263]  arXiv:2404.12230 [pdf, other]
Title: Estimates for the quantized tensor train ranks for the power functions
Comments: 6 pages, 1 figure, 20 references
Subjects: Numerical Analysis (math.NA)

In this work we provide theoretical estimates for the ranks of the power functions $f(k) = k^{-\alpha}$, $\alpha>1$ in the quantized tensor train (QTT) format for $k = 1, 2, 3, \ldots, 2^{d}$. Such functions and their several generalizations (e.~g. $f(k) = k^{-\alpha} \cdot e^{-\lambda k}, \lambda > 0$) play an important role in studies of the asymptotic solutions of the aggregation-fragmentation kinetic equations. In order to support the constructed theory we verify the values of QTT-ranks of these functions in practice with the use of the TTSVD procedure and show an agreement between the numerical and analytical results.

[264]  arXiv:2404.12235 [pdf, other]
Title: Beyond Average: Individualized Visual Scanpath Prediction
Comments: To appear in CVPR2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Understanding how attention varies across individuals has significant scientific and societal impacts. However, existing visual scanpath models treat attention uniformly, neglecting individual differences. To bridge this gap, this paper focuses on individualized scanpath prediction (ISP), a new attention modeling task that aims to accurately predict how different individuals shift their attention in diverse visual tasks. It proposes an ISP method featuring three novel technical components: (1) an observer encoder to characterize and integrate an observer's unique attention traits, (2) an observer-centric feature integration approach that holistically combines visual features, task guidance, and observer-specific characteristics, and (3) an adaptive fixation prioritization mechanism that refines scanpath predictions by dynamically prioritizing semantic feature maps based on individual observers' attention traits. These novel components allow scanpath models to effectively address the attention variations across different observers. Our method is generally applicable to different datasets, model architectures, and visual tasks, offering a comprehensive tool for transforming general scanpath models into individualized ones. Comprehensive evaluations using value-based and ranking-based metrics verify the method's effectiveness and generalizability.

[265]  arXiv:2404.12237 [pdf, ps, other]
Title: De-DSI: Decentralised Differentiable Search Index
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

This study introduces De-DSI, a novel framework that fuses large language models (LLMs) with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.

[266]  arXiv:2404.12238 [pdf, other]
Title: Neural Networks with Causal Graph Constraints: A New Approach for Treatment Effects Estimation
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)

In recent years, there has been a growing interest in using machine learning techniques for the estimation of treatment effects. Most of the best-performing methods rely on representation learning strategies that encourage shared behavior among potential outcomes to increase the precision of treatment effect estimates. In this paper we discuss and classify these models in terms of their algorithmic inductive biases and present a new model, NN-CGC, that considers additional information from the causal graph. NN-CGC tackles bias resulting from spurious variable interactions by implementing novel constraints on models, and it can be integrated with other representation learning methods. We test the effectiveness of our method using three different base models on common benchmarks. Our results indicate that our model constraints lead to significant improvements, achieving new state-of-the-art results in treatment effects estimation. We also show that our method is robust to imperfect causal graphs and that using partial causal information is preferable to ignoring it.

[267]  arXiv:2404.12240 [pdf, other]
Title: A Time-Inhomogeneous Markov Model for Resource Availability under Sparse Observations
Comments: 11 pages, long version of a paper published at 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2018)
Journal-ref: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 460-463) 2018
Subjects: Artificial Intelligence (cs.AI)

Accurate spatio-temporal information about the current situation is crucial for smart city applications such as modern routing algorithms. Often, this information describes the state of stationary resources, e.g. the availability of parking bays, charging stations or the amount of people waiting for a vehicle to pick them up near a given location. To exploit this kind of information, predicting future states of the monitored resources is often mandatory because a resource might change its state within the time until it is needed. To train an accurate predictive model, it is often not possible to obtain a continuous time series on the state of the resource. For example, the information might be collected from traveling agents visiting the resource with an irregular frequency. Thus, it is necessary to develop methods which work on sparse observations for training and prediction. In this paper, we propose time-inhomogeneous discrete Markov models to allow accurate prediction even when the frequency of observation is very rare. Our new model is able to blend recent observations with historic data and also provide useful probabilistic estimates for future states. Since resources availability in a city is typically time-dependent, our Markov model is time-inhomogeneous and cyclic within a predefined time interval. To train our model, we propose a modified Baum-Welch algorithm. Evaluations on real-world datasets of parking bay availability show that our new method indeed yields good results compared to methods being trained on complete data and non-cyclic variants.

[268]  arXiv:2404.12241 [pdf, other]
Title: Introducing v0.5 of the AI Safety Benchmark from MLCommons
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.

[269]  arXiv:2404.12242 [pdf, other]
Title: CMNEE: A Large-Scale Document-Level Event Extraction Dataset based on Open-Source Chinese Military News
Comments: 13 pages, 7 figures, accepted to LREC-COLING 2024
Subjects: Computation and Language (cs.CL)

Extracting structured event knowledge, including event triggers and corresponding arguments, from military texts is fundamental to many applications, such as intelligence analysis and decision assistance. However, event extraction in the military field faces the data scarcity problem, which impedes the research of event extraction models in this domain. To alleviate this problem, we propose CMNEE, a large-scale, document-level open-source Chinese Military News Event Extraction dataset. It contains 17,000 documents and 29,223 events, which are all manually annotated based on a pre-defined schema for the military domain including 8 event types and 11 argument role types. We designed a two-stage, multi-turns annotation strategy to ensure the quality of CMNEE and reproduced several state-of-the-art event extraction models with a systematic evaluation. The experimental results on CMNEE fall shorter than those on other domain datasets obviously, which demonstrates that event extraction for military domain poses unique challenges and requires further research efforts. Our code and data can be obtained from https://github.com/Mzzzhu/CMNEE.

[270]  arXiv:2404.12244 [pdf, other]
Title: PyTOaCNN: Topology optimization using an adaptive convolutional neural network in Python
Comments: 24 pages
Subjects: Computational Engineering, Finance, and Science (cs.CE)

This paper introduces an adaptive convolutional neural network (CNN) architecture capable of automating various topology optimization (TO) problems with diverse underlying physics. The proposed architecture has an encoder-decoder-type structure with dense layers added at the bottleneck region to capture complex geometrical features. The network is trained using datasets obtained by the problem-specific open-source TO codes. Tensorflow and Keras are the main libraries employed to develop and to train the model. Effectiveness and robustness of the proposed adaptive CNN model are demonstrated through its performance in compliance minimization problems involving constant and design-dependent loads and in addressing bulk modulus optimization. Once trained, the model takes user's input of the volume fraction as an image and instantly generates an output image of optimized design. The proposed CNN produces high-quality results resembling those obtained via open-source TO codes with negligible performance and volume fraction errors. The paper includes complete associated Python code (Appendix A) for the proposed CNN architecture and explains each part of the code to facilitate reproducibility and ease of learning.

[271]  arXiv:2404.12246 [pdf, other]
Title: Blind Localization and Clustering of Anomalies in Textures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Anomaly detection and localization in images is a growing field in computer vision. In this area, a seemingly understudied problem is anomaly clustering, i.e., identifying and grouping different types of anomalies in a fully unsupervised manner. In this work, we propose a novel method for clustering anomalies in largely stationary images (textures) in a blind setting. That is, the input consists of normal and anomalous images without distinction and without labels. What contributes to the difficulty of the task is that anomalous regions are often small and may present only subtle changes in appearance, which can be easily overshadowed by the genuine variance in the texture. Moreover, each anomaly type may have a complex appearance distribution. We introduce a novel scheme for solving this task using a combination of blind anomaly localization and contrastive learning. By identifying the anomalous regions with high fidelity, we can restrict our focus to those regions of interest; then, contrastive learning is employed to increase the separability of different anomaly types and reduce the intra-class variation. Our experiments show that the proposed solution yields significantly better results compared to prior work, setting a new state of the art. Project page: https://reality.tf.fau.de/pub/ardelean2024blind.html.

[272]  arXiv:2404.12251 [pdf, other]
Title: Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities
Comments: 15 pages
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.

[273]  arXiv:2404.12252 [pdf, other]
Title: Deep Gaussian mixture model for unsupervised image segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The recent emergence of deep learning has led to a great deal of work on designing supervised deep semantic segmentation algorithms. As in many tasks sufficient pixel-level labels are very difficult to obtain, we propose a method which combines a Gaussian mixture model (GMM) with unsupervised deep learning techniques. In the standard GMM the pixel values with each sub-region are modelled by a Gaussian distribution. In order to identify the different regions, the parameter vector that minimizes the negative log-likelihood (NLL) function regarding the GMM has to be approximated. For this task, usually iterative optimization methods such as the expectation-maximization (EM) algorithm are used. In this paper, we propose to estimate these parameters directly from the image using a convolutional neural network (CNN). We thus change the iterative procedure in the EM algorithm replacing the expectation-step by a gradient-step with regard to the networks parameters. This means that the network is trained to minimize the NLL function of the GMM which comes with at least two advantages. As once trained, the network is able to predict label probabilities very quickly compared with time consuming iterative optimization methods. Secondly, due to the deep image prior our method is able to partially overcome one of the main disadvantages of GMM, which is not taking into account correlation between neighboring pixels, as it assumes independence between them. We demonstrate the advantages of our method in various experiments on the example of myocardial infarct segmentation on multi-sequence MRI images.

[274]  arXiv:2404.12253 [pdf, other]
Title: Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.

[275]  arXiv:2404.12256 [pdf, other]
Title: An Online Spatial-Temporal Graph Trajectory Planner for Autonomous Vehicles
Comments: This is the accepted version and published in the "Early Access" area of IEEE Xplore for the IEEE Transactions on Intelligent Vehicles on 16 April 2024. Article statistics: 11 pages, 9 figures, 2 tables
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The autonomous driving industry is expected to grow by over 20 times in the coming decade and, thus, motivate researchers to delve into it. The primary focus of their research is to ensure safety, comfort, and efficiency. An autonomous vehicle has several modules responsible for one or more of the aforementioned items. Among these modules, the trajectory planner plays a pivotal role in the safety of the vehicle and the comfort of its passengers. The module is also responsible for respecting kinematic constraints and any applicable road constraints. In this paper, a novel online spatial-temporal graph trajectory planner is introduced to generate safe and comfortable trajectories. First, a spatial-temporal graph is constructed using the autonomous vehicle, its surrounding vehicles, and virtual nodes along the road with respect to the vehicle itself. Next, the graph is forwarded into a sequential network to obtain the desired states. To support the planner, a simple behavioral layer is also presented that determines kinematic constraints for the planner. Furthermore, a novel potential function is also proposed to train the network. Finally, the proposed planner is tested on three different complex driving tasks, and the performance is compared with two frequently used methods. The results show that the proposed planner generates safe and feasible trajectories while achieving similar or longer distances in the forward direction and comparable comfort ride.

[276]  arXiv:2404.12257 [pdf, other]
Title: Food Portion Estimation via 3D Object Scaling
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)

Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D images by leveraging the power of 3D food models and physical reference in the eating scene. Our method estimates the pose of the camera and the food object in the input image and recreates the eating occasion by rendering an image of a 3D model of the food with the estimated poses. We also introduce a new dataset, SimpleFood45, which contains 2D images of 45 food items and associated annotations including food volume, weight, and energy. Our method achieves an average error of 31.10 kCal (17.67%) on this dataset, outperforming existing portion estimation methods.

[277]  arXiv:2404.12258 [pdf, ps, other]
Title: DeepLocalization: Using change point detection for Temporal Action Localization
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this study, we introduce DeepLocalization, an innovative framework devised for the real-time localization of actions tailored explicitly for monitoring driver behavior. Utilizing the power of advanced deep learning methodologies, our objective is to tackle the critical issue of distracted driving-a significant factor contributing to road accidents. Our strategy employs a dual approach: leveraging Graph-Based Change-Point Detection for pinpointing actions in time alongside a Video Large Language Model (Video-LLM) for precisely categorizing activities. Through careful prompt engineering, we customize the Video-LLM to adeptly handle driving activities' nuances, ensuring its classification efficacy even with sparse data. Engineered to be lightweight, our framework is optimized for consumer-grade GPUs, making it vastly applicable in practical scenarios. We subjected our method to rigorous testing on the SynDD2 dataset, a complex benchmark for distracted driving behaviors, where it demonstrated commendable performance-achieving 57.5% accuracy in event classification and 51% in event detection. These outcomes underscore the substantial promise of DeepLocalization in accurately identifying diverse driver behaviors and their temporal occurrences, all within the bounds of limited computational resources.

[278]  arXiv:2404.12259 [pdf, other]
Title: Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM
Comments: To appear at CHI 2024
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)

Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.

[279]  arXiv:2404.12260 [pdf, other]
Title: Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models
Comments: 15 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Facial expression recognition is a pivotal component in machine learning, facilitating various applications. However, convolutional neural networks (CNNs) are often plagued by catastrophic forgetting, impeding their adaptability. The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks. Moreover, ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images. This dual approach enables CNNs to retain past knowledge while learning new tasks, enhancing their performance in emotion recognition. The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset while making the CNN retain previously learned knowledge.

[280]  arXiv:2404.12261 [pdf, ps, other]
Title: Design And Flight Testing Of LQRi Attitude Control For Quadcopter UAV
Comments: This research is still work under progress. The paper has been posted here for wider review by community
Subjects: Robotics (cs.RO)

This paper presents the design, implementation, and flight test results of linear quadratic integral regulator (LQRi) based attitude control for a quadcopter UAV. We present the derivation of the mathematical model for the kinematics and dynamics of the UAV, along with the linearized state space representation of the system about hover conditions. LQR and LQRi controllers are then designed to stabilize the UAV in hover conditions and to track desired attitude commands. The controllers are then implemented onboard the Pixhawk flight controller and flight test results are discussed. Finally, the code related to this paper has been published open-source for replication and further research

[281]  arXiv:2404.12262 [pdf, other]
Title: Tree-Based Nonlinear Reduced Modeling
Subjects: Numerical Analysis (math.NA)

This paper is concerned with model order reduction of parametric Partial Differential Equations (PDEs) using tree-based library approximations. Classical approaches are formulated for PDEs on Hilbert spaces and involve one single linear space to approximate the set of PDE solutions. Here, we develop reduced models relying on a collection of linear or nonlinear approximation spaces called a library, and which can also be formulated on general metric spaces. To build the spaces of the library, we rely on greedy algorithms involving different splitting strategies which lead to a hierarchical tree-based representation. We illustrate through numerical examples that the proposed strategies have a much wider range of applicability in terms of the parametric PDEs that can successfully be addressed. While the classical approach is very efficient for elliptic problems with strong coercivity, we show that the tree-based library approaches can deal with diffusion problems with weak coercivity, convection-diffusion problems, and with transport-dominated PDEs posed on general metric spaces such as the $L^2$-Wasserstein space.

[282]  arXiv:2404.12267 [pdf, other]
Title: Physics-integrated generative modeling using attentive planar normalizing flow based variational autoencoder
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.

[283]  arXiv:2404.12268 [pdf, ps, other]
Title: How Population Diversity Influences the Efficiency of Crossover
Comments: pre-print, 30 pages
Subjects: Neural and Evolutionary Computing (cs.NE)

Our theoretical understanding of crossover is limited by our ability to analyze how population diversity evolves. In this study, we provide one of the first rigorous analyses of population diversity and optimization time in a setting where large diversity and large population sizes are required to speed up progress. We give a formal and general criterion which amount of diversity is necessary and sufficient to speed up the $(\mu+1)$ Genetic Algorithm on LeadingOnes. We show that the naturally evolving diversity falls short of giving a substantial speed-up for any $\mu=O(\sqrt{n}/\log^2 n)$. On the other hand, we show that even for $\mu=2$, if we simply break ties in favor of diversity then this increases diversity so much that optimization is accelerated by a constant factor.

[284]  arXiv:2404.12272 [pdf, other]
Title: Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
Comments: 16 pages, 4 figures, 2 tables
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

[285]  arXiv:2404.12273 [pdf, other]
Title: FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom
Comments: In Progress
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.

[286]  arXiv:2404.12274 [pdf, other]
Title: Advancing the Robustness of Large Language Models through Self-Denoised Smoothing
Comments: Accepted by NAACL 2024. Jiabao, Bairu, Zhen, Guanhua contributed equally. This is an updated version of the paper: arXiv:2307.07171
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Although large language models (LLMs) have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model's parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model's robustness largely depends on the model's performance on these noise corrupted data. Its effectiveness is often limited by the model's sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at https://github.com/UCSB-NLP-Chang/SelfDenoise

[287]  arXiv:2404.12278 [pdf, other]
Title: DF-DM: A foundational process model for multimodal data fusion in the artificial intelligence era
Comments: 6 figures, 5 tables
Subjects: Artificial Intelligence (cs.AI)

In the big data era, integrating diverse data modalities poses significant challenges, particularly in complex fields like healthcare. This paper introduces a new process model for multimodal Data Fusion for Data Mining, integrating embeddings and the Cross-Industry Standard Process for Data Mining with the existing Data Fusion Information Group model. Our model aims to decrease computational costs, complexity, and bias while improving efficiency and reliability. We also propose "disentangled dense fusion", a novel embedding fusion method designed to optimize mutual information and facilitate dense inter-modality feature interaction, thereby minimizing redundant information.
We demonstrate the model's efficacy through three use cases: predicting diabetic retinopathy using retinal images and patient metadata, domestic violence prediction employing satellite imagery, internet, and census data, and identifying clinical and demographic features from radiography images and clinical notes. The model achieved a Macro F1 score of 0.92 in diabetic retinopathy prediction, an R-squared of 0.854 and sMAPE of 24.868 in domestic violence prediction, and a macro AUC of 0.92 and 0.99 for disease prediction and sex classification, respectively, in radiological analysis.
These results underscore the Data Fusion for Data Mining model's potential to significantly impact multimodal data processing, promoting its adoption in diverse, resource-constrained settings.

[288]  arXiv:2404.12281 [pdf, other]
Title: RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective
Subjects: Robotics (cs.RO)

Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: rise-policy.github.io.

[289]  arXiv:2404.12282 [pdf, ps, other]
Title: Investigating Guiding Information for Adaptive Collocation Point Sampling in PINNs
Comments: 15 pages, 8 figures, 2 tables. Accepted for publication in the conference proceedings of the International Conference on Computational Science (ICCS) 2024
Subjects: Machine Learning (cs.LG)

Physics-informed neural networks (PINNs) provide a means of obtaining approximate solutions of partial differential equations and systems through the minimisation of an objective function which includes the evaluation of a residual function at a set of collocation points within the domain. The quality of a PINNs solution depends upon numerous parameters, including the number and distribution of these collocation points. In this paper we consider a number of strategies for selecting these points and investigate their impact on the overall accuracy of the method. In particular, we suggest that no single approach is likely to be ``optimal'' but we show how a number of important metrics can have an impact in improving the quality of the results obtained when using a fixed number of residual evaluations. We illustrate these approaches through the use of two benchmark test problems: Burgers' equation and the Allen-Cahn equation.

[290]  arXiv:2404.12283 [pdf, ps, other]
Title: Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting
Subjects: Computation and Language (cs.CL)

Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.

[291]  arXiv:2404.12284 [pdf, other]
Title: A hybrid boundary integral-PDE approach for the approximation of the demagnetization potential in micromagnetics
Subjects: Numerical Analysis (math.NA)

The demagnetization field in micromagnetism is given as the gradient of a potential which solves a partial differential equation (PDE) posed in R^d. In its most general form, this PDE is supplied with continuity condition on the boundary of the magnetic domain and the equation includes a discontinuity in the gradient of the potential over the boundary. Typical numerical algorithms to solve this problem relies on the representation of the potential via the Green's function, where a volume and a boundary integral terms need to be accurately approximated. From a computational point of view, the volume integral dominates the computational cost and can be difficult to approximate due to the singularities of the Green's function. In this article, we propose a hybrid model, where the overall potential can be approximated by solving two uncoupled PDEs posed in bounded domains, whereby the boundary conditions of one of the PDEs is obtained by a low cost boundary integral. Moreover, we provide a convergence analysis of the method under two separate theoretical settings; periodic magnetisation, and high-frequency magnetisation. Numerical examples are given to verify the convergence rates.

[292]  arXiv:2404.12285 [pdf, other]
Title: Performance Evaluation of Segment Anything Model with Variational Prompting for Application to Non-Visible Spectrum Imagery
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising ~11M images, it mostly consists of natural photographic images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities. This work assesses SAM capabilities in segmenting objects of interest in the X-ray/infrared modalities. Our approach reuses the pre-trained SAM with three different prompts: bounding box, centroid and random points. We present quantitative/qualitative results to showcase the performance on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. We find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. This study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal generalisation of SAM is needed when considering use on X-ray/infrared imagery.

[293]  arXiv:2404.12289 [pdf, other]
Title: Resilience through Scene Context in Visual Referring Expression Generation
Subjects: Computation and Language (cs.CL)

Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.

[294]  arXiv:2404.12291 [pdf, ps, other]
Title: Augmenting emotion features in irony detection with Large language modeling
Comments: 11 pages, 3 tables, 2 figures. Submitted to the 25th Chinese Lexical Semantics Workshop
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.

[295]  arXiv:2404.12292 [pdf, other]
Title: Reducing Bias in Pre-trained Models by Tuning while Penalizing Change
Comments: 12 pages, 12 figures, presented at VISAPP 2024
Journal-ref: Proceedings of the 19th International Joint Conference on Computer Vision (2024), Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP, ISBN 978-989-758-679-8, ISSN 2184-4321, SciTePress, pages 90-101
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep models trained on large amounts of data often incorporate implicit biases present during training time. If later such a bias is discovered during inference or deployment, it is often necessary to acquire new data and retrain the model. This behavior is especially problematic in critical areas such as autonomous driving or medical decision-making. In these scenarios, new data is often expensive and hard to come by. In this work, we present a method based on change penalization that takes a pre-trained model and adapts the weights to mitigate a previously detected bias. We achieve this by tuning a zero-initialized copy of a frozen pre-trained network. Our method needs very few, in extreme cases only a single, examples that contradict the bias to increase performance. Additionally, we propose an early stopping criterion to modify baselines and reduce overfitting. We evaluate our approach on a well-known bias in skin lesion classification and three other datasets from the domain shift literature. We find that our approach works especially well with very few images. Simple fine-tuning combined with our early stopping also leads to performance benefits for a larger number of tuning samples.

[296]  arXiv:2404.12293 [pdf, other]
Title: Singular-limit analysis of gradient descent with noise injection
Subjects: Machine Learning (cs.LG); Probability (math.PR)

We study the limiting dynamics of a large class of noisy gradient descent systems in the overparameterized regime. In this regime the set of global minimizers of the loss is large, and when initialized in a neighbourhood of this zero-loss set a noisy gradient descent algorithm slowly evolves along this set. In some cases this slow evolution has been related to better generalisation properties. We characterize this evolution for the broad class of noisy gradient descent systems in the limit of small step size. Our results show that the structure of the noise affects not just the form of the limiting process, but also the time scale at which the evolution takes place. We apply the theory to Dropout, label noise and classical SGD (minibatching) noise, and show that these evolve on different two time scales. Classical SGD even yields a trivial evolution on both time scales, implying that additional noise is required for regularization. The results are inspired by the training of neural networks, but the theorems apply to noisy gradient descent of any loss that has a non-trivial zero-loss set.

[297]  arXiv:2404.12295 [pdf, other]
Title: When Medical Imaging Met Self-Attention: A Love Story That Didn't Quite Work Out
Comments: 10 pages, 2 figures, 5 tables, presented at VISAPP 2024
Journal-ref: Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP (2024), ISBN 978-989-758-679-8, ISSN 2184-4321, SciTePress, pages 149-158
Subjects: Computer Vision and Pattern Recognition (cs.CV)

A substantial body of research has focused on developing systems that assist medical professionals during labor-intensive early screening processes, many based on convolutional deep-learning architectures. Recently, multiple studies explored the application of so-called self-attention mechanisms in the vision domain. These studies often report empirical improvements over fully convolutional approaches on various datasets and tasks. To evaluate this trend for medical imaging, we extend two widely adopted convolutional architectures with different self-attention variants on two different medical datasets. With this, we aim to specifically evaluate the possible advantages of additional self-attention. We compare our models with similarly sized convolutional and attention-based baselines and evaluate performance gains statistically. Additionally, we investigate how including such layers changes the features learned by these models during the training. Following a hyperparameter search, and contrary to our expectations, we observe no significant improvement in balanced accuracy over fully convolutional models. We also find that important features, such as dermoscopic structures in skin lesion images, are still not learned by employing self-attention. Finally, analyzing local explanations, we confirm biased feature usage. We conclude that merely incorporating attention is insufficient to surpass the performance of existing fully convolutional methods.

[298]  arXiv:2404.12296 [pdf, other]
Title: Long Duration Battery Sizing, Siting, and Operation Under Wildfire Risk Using Progressive Hedging
Subjects: Systems and Control (eess.SY)

Battery sizing and siting problems are computationally challenging due to the need to make long-term planning decisions that are cognizant of short-term operational decisions. This paper considers sizing, siting, and operating batteries in a power grid to maximize their benefits, including price arbitrage and load shed mitigation, during both normal operations and periods with high wildfire ignition risk. We formulate a multi-scenario optimization problem for long duration battery storage while considering the possibility of load shedding during Public Safety Power Shutoff (PSPS) events that de-energize lines to mitigate severe wildfire ignition risk. To enable a computationally scalable solution of this problem with many scenarios of wildfire risk and power injection variability, we develop a customized temporal decomposition method based on a progressive hedging framework. Extending traditional progressive hedging techniques, we consider coupling in both placement variables across all scenarios and state-of-charge variables at temporal boundaries. This enforces consistency across scenarios while enabling parallel computations despite both spatial and temporal coupling. The proposed decomposition facilitates efficient and scalable modeling of a full year of hourly operational decisions to inform the sizing and siting of batteries. With this decomposition, we model a year of hourly operational decisions to inform optimal battery placement for a 240-bus WECC model in under 70 minutes of wall-clock time.

[299]  arXiv:2404.12299 [pdf, other]
Title: Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair
Comments: 23 pages, 9 figures
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{https://github.com/yusuke1997/LLM-SI-Corpus}.

[300]  arXiv:2404.12300 [pdf, ps, other]
Title: Proactive Software Supply Chain Risk Management Framework (P-SSCRM) Version 1
Authors: Laurie Williams (North Carolina State University), Sammy Migues (Imbricate Security), Jamie Boote (Synopsys), Ben Hutchison (Synopsys)
Comments: 17 pages, 3 figures, 2 tables, will not be submitted to a conference
Subjects: Cryptography and Security (cs.CR)

The Proactive Software Supply Chain Risk Management Framework (P SSCRM) described in this document is designed to help you understand and plan a secure software supply chain risk management initiative. P SSCRM was created through a process of understanding and analyzing real world data from nine industry leading software supply chain risk management initiatives as well as through the analysis and unification of ten government and industry documents, frameworks, and standards. Although individual methodologies and standards differ, many initiatives and standards share common ground. P SSCRM describes this common ground and presents a model for understanding, quantifying, and developing a secure software supply chain risk management program and determining where your organization's existing efforts stand when contrasted with other real world software supply chain risk management initiatives.

[301]  arXiv:2404.12305 [pdf, other]
Title: SAFLA: Semantic-aware Full Lifecycle Assurance Designed for Intent-Driven Networks
Comments: 11 pages, 10 figures, 3 tables
Subjects: Networking and Internet Architecture (cs.NI)

Intent-driven Networks (IDNs) are crucial in enhancing network management efficiency by enabling the translation of high-level intents into executable configurations via a top-down approach. The escalating complexity of network architectures, however, has led to a semantic gap between these intents and their actual configurations, posing significant challenges to the accuracy and reliability of IDNs. While existing methodologies attempt to address this gap through a bottom-up analysis of network metadata, they often fall short, focusing primarily on intent extraction or reasoning without fully leveraging insights to tackle the inherent challenges of IDNs. To mitigate this, we introduce SAFLA, a semantic-aware framework specifically designed to assure the full lifecycle of intents within IDNs. By seamlessly integrating top-down and bottom-up approaches, SAFLA not only provides comprehensive intent assurance but also effectively bridges the semantic gap. This integration facilitates a self-healing mechanism, substantially reducing the need for manual intervention even in dynamically changing network environments. Experimental results demonstrate the framework's feasibility and efficiency, confirming its capacity to quickly adapt intents in response to network changes, thus marking an important advancement in the field of IDNs.

[302]  arXiv:2404.12306 [pdf, ps, other]
Title: Switchable Single/Dual Edge Registers for Pipeline Architecture
Subjects: Hardware Architecture (cs.AR)

The demand for low power processing is increasing due to mobile and portable devices. In a processor unit, an adder is an important building block since it is used in Floating Point Units (FPU) and Arithmetic Logic Units (ALU). Also, pipeline techniques are used extensively to improve the throughput of the processing unit. To implement a pipeline requires adding a register at each sub-stage that result in increasing the latency. Moreover, designing a low power pipeline adder with low latency has drawn a lot of attention. In a pipelined architecture that uses Dual Edge Triggered (DET) based registers can help in reducing the latency since they can capture input data at both clock edges. However, for high input activity, a DET flip-flop consumes more power than a Single-Edge Triggered (SET) flip-flop. Moreover, it is required to replace each Flip-Flop (FF) in the processor with Dual Edge Triggered (DET) Flip-Flop which will be a considerable area and power overhead. Therefore, it is desirable to have a switchable DET to SET depending on input activity or load condition to reduce the dynamic power consumption. In this paper, we are proposing a new shift register which imitates DET FF based shift register without the need of special DET FF. The proposed shift register improved the latency in a 4-bit pipelined adder by two-fold. Additionally, the power delay product was reduced by 44.16 %.

[303]  arXiv:2404.12308 [pdf, other]
Title: ASID: Active Exploration for System Identification in Robotic Manipulation
Comments: Project website at this https URL
Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)

Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer. Project website at https://weirdlabuw.github.io/asid

[304]  arXiv:2404.12309 [pdf, other]
Title: iRAG: An Incremental Retrieval Augmented Generation System for Videos
Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, developing a system for multimodal to text conversion and interactive querying of multimodal data is challenging.
To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying.

[305]  arXiv:2404.12312 [pdf, ps, other]
Title: A Mean-Field Analysis of Neural Gradient Descent-Ascent: Applications to Functional Conditional Moment Equations
Comments: 72 pages, submitted
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

We study minimax optimization problems defined over infinite-dimensional function classes. In particular, we restrict the functions to the class of overparameterized two-layer neural networks and study (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural network. As an initial step, we consider the minimax optimization problem stemming from estimating a functional equation defined by conditional expectations via adversarial estimation, where the objective function is quadratic in the functional space. For this problem, we establish convergence under the mean-field regime by considering the continuous-time and infinite-width limit of the optimization dynamics. Under this regime, gradient descent-ascent corresponds to a Wasserstein gradient flow over the space of probability measures defined over the space of neural network parameters. We prove that the Wasserstein gradient flow converges globally to a stationary point of the minimax objective at a $\mathcal{O}(T^{-1} + \alpha^{-1} ) $ sublinear rate, and additionally finds the solution to the functional equation when the regularizer of the minimax objective is strongly convex. Here $T$ denotes the time and $\alpha$ is a scaling parameter of the neural network. In terms of representation learning, our results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $\mathcal{O}(\alpha^{-1})$, measured in terms of the Wasserstein distance. Finally, we apply our general results to concrete examples including policy evaluation, nonparametric instrumental variable regression, and asset pricing.

[306]  arXiv:2404.12314 [pdf, other]
Title: Guided Discrete Diffusion for Electronic Health Record Generation
Comments: 24 pages, 9 figures, 12 tables
Subjects: Machine Learning (cs.LG)

Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.

[307]  arXiv:2404.12315 [pdf, other]
Title: Adjoint Sensitivities of Chaotic Flows without Adjoint Solvers: A Data-Driven Approach
Subjects: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)

In one calculation, adjoint sensitivity analysis provides the gradient of a quantity of interest with respect to all system's parameters. Conventionally, adjoint solvers need to be implemented by differentiating computational models, which can be a cumbersome task and is code-specific. To propose an adjoint solver that is not code-specific, we develop a data-driven strategy. We demonstrate its application on the computation of gradients of long-time averages of chaotic flows. First, we deploy a parameter-aware echo state network (ESN) to accurately forecast and simulate the dynamics of a dynamical system for a range of system's parameters. Second, we derive the adjoint of the parameter-aware ESN. Finally, we combine the parameter-aware ESN with its adjoint version to compute the sensitivities to the system parameters. We showcase the method on a prototypical chaotic system. Because adjoint sensitivities in chaotic regimes diverge for long integration times, we analyse the application of ensemble adjoint method to the ESN. We find that the adjoint sensitivities obtained from the ESN match closely with the original system. This work opens possibilities for sensitivity analysis without code-specific adjoint solvers.

[308]  arXiv:2404.12317 [pdf, ps, other]
Title: Large Language Models for Synthetic Participatory Planning of Shared Automated Electric Mobility Systems
Authors: Jiangbo Yu
Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

Unleashing the synergies of rapidly evolving mobility technologies in a multi-stakeholder landscape presents unique challenges and opportunities for addressing urban transportation problems. This paper introduces a novel synthetic participatory method, critically leveraging large language models (LLMs) to create digital avatars representing diverse stakeholders to plan shared automated electric mobility systems (SAEMS). These calibratable agents collaboratively identify objectives, envision and evaluate SAEMS alternatives, and strategize implementation under risks and constraints. The results of a Montreal case study indicate that a structured and parameterized workflow provides outputs with high controllability and comprehensiveness on an SAEMS plan than generated using a single LLM-enabled expert agent. Consequently, the approach provides a promising avenue for cost-efficiently improving the inclusivity and interpretability of multi-objective transportation planning, suggesting a paradigm shift in how we envision and strategize for sustainable and equitable transportation systems.

[309]  arXiv:2404.12318 [pdf, other]
Title: Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Subjects: Computation and Language (cs.CL)

Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.

[310]  arXiv:2404.12322 [pdf, other]
Title: Generalizable Face Landmarking Guided by Conditional Face Warping
Comments: Accepted in CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

As a significant step for human face modeling, editing, and generation, face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images, e.g., the avatars in animations and games, are often stylized in various ways. However, achieving generalizable face landmarking is challenging due to the diversity of facial styles and the scarcity of labeled stylized faces. In this study, we propose a simple but effective paradigm to learn a generalizable face landmarker based on labeled real human faces and unlabeled stylized faces. Our method learns the face landmarker as the key module of a conditional face warper. Given a pair of real and stylized facial images, the conditional face warper predicts a warping field from the real face to the stylized one, in which the face landmarker predicts the ending points of the warping field and provides us with high-quality pseudo landmarks for the corresponding stylized facial images. Applying an alternating optimization strategy, we learn the face landmarker to minimize $i)$ the discrepancy between the stylized faces and the warped real ones and $ii)$ the prediction errors of both real and pseudo landmarks. Experiments on various datasets show that our method outperforms existing state-of-the-art domain adaptation methods in face landmarking tasks, leading to a face landmarker with better generalizability. Code is available at https://plustwo0.github.io/project-face-landmarker}{https://plustwo0.github.io/project-face-landmarker.

[311]  arXiv:2404.12329 [pdf, other]
Title: Practical Considerations for Discrete-Time Implementations of Continuous-Time Control Barrier Function-Based Safety Filters
Comments: 7 pages, 4 figures, accepted for publication at the IEEE American Control Conference, 2024
Subjects: Systems and Control (eess.SY)

Safety filters based on control barrier functions (CBFs) have become a popular method to guarantee safety for uncertified control policies, e.g., as resulting from reinforcement learning. Here, safety is defined as staying in a pre-defined set, the safe set, that adheres to the system's state constraints, e.g., as given by lane boundaries for a self-driving vehicle. In this paper, we examine one commonly overlooked problem that arises in practical implementations of continuous-time CBF-based safety filters. In particular, we look at the issues caused by discrete-time implementations of the continuous-time CBF-based safety filter, especially for cases where the magnitude of the Lie derivative of the CBF with respect to the control input is zero or close to zero. When overlooked, this filter can result in undesirable chattering effects or constraint violations. In this work, we propose three mitigation strategies that allow us to use a continuous-time safety filter in a discrete-time implementation with a local relative degree. Using these strategies in augmented CBF-based safety filters, we achieve safety for all states in the safe set by either using an additional penalty term in the safety filtering objective or modifying the CBF such that those undesired states are not encountered during closed-loop operation. We demonstrate the presented issue and validate our three proposed mitigation strategies in simulation and on a real-world quadrotor.

[312]  arXiv:2404.12330 [pdf, other]
Title: A Perspective on Deep Vision Performance with Standard Image and Video Codecs
Comments: Accepted at CVPR 2024 Workshop on AI for Streaming (AIS)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance, strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings, our analysis extends beyond image and action classification to localization and dense prediction tasks, thus providing a more comprehensive perspective.

[313]  arXiv:2404.12333 [pdf, other]
Title: Customizing Text-to-Image Diffusion with Camera Viewpoint Control
Comments: project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Model customization introduces new concepts to existing text-to-image models, enabling the generation of the new concept in novel contexts. However, such methods lack accurate camera view control w.r.t the object, and users must resort to prompt engineering (e.g., adding "top-view") to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of camera viewpoint for model customization. This allows us to modify object properties amongst various background scenes via text prompts, all while incorporating the target camera pose as additional control. This new task presents significant challenges in merging a 3D representation from the multi-view images of the new concept with a general, 2D text-to-image model. To bridge this gap, we propose to condition the 2D diffusion process on rendered, view-dependent features of the new object. During training, we jointly adapt the 2D diffusion modules and 3D feature predictions to reconstruct the object's appearance and geometry while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model personalization baselines in preserving the custom object's identity while following the input text prompt and the object's camera pose.

[314]  arXiv:2404.12335 [pdf, other]
Title: Normative Requirements Operationalization with Large Language Models
Subjects: Software Engineering (cs.SE)

Normative non-functional requirements specify constraints that a system must observe in order to avoid violations of social, legal, ethical, empathetic, and cultural norms. As these requirements are typically defined by non-technical system stakeholders with different expertise and priorities (ethicists, lawyers, social scientists, etc.), ensuring their well-formedness and consistency is very challenging. Recent research has tackled this challenge using a domain-specific language to specify normative requirements as rules whose consistency can then be analysed with formal methods. In this paper, we propose a complementary approach that uses Large Language Models to extract semantic relationships between abstract representations of system capabilities. These relations, which are often assumed implicitly by non-technical stakeholders (e.g., based on common sense or domain knowledge), are then used to enrich the automated reasoning techniques for eliciting and analyzing the consistency of normative requirements. We show the effectiveness of our approach to normative requirements elicitation and operationalization through a range of real-world case studies.

[315]  arXiv:2404.12336 [pdf, other]
Title: Combining Power and Arithmetic Optimization via Datapath Rewriting
Subjects: Hardware Architecture (cs.AR)

Industrial datapath designers consider dynamic power consumption to be a key metric. Arithmetic circuits contribute a major component of total chip power consumption and are therefore a common target for power optimization. While arithmetic circuit area and dynamic power consumption are often correlated, there is also a tradeoff to consider, as additional gates can be added to explicitly reduce arithmetic circuit activity and hence reduce power consumption. In this work, we consider two forms of power optimization and their interaction: circuit area reduction via arithmetic optimization, and the elimination of redundant computations using both data and clock gating. By encoding both these classes of optimization as local rewrites of expressions, our tool flow can simultaneously explore them, uncovering new opportunities for power saving through arithmetic rewrites using the e-graph data structure. Since power consumption is highly dependent upon the workload performed by the circuit, our tool flow facilitates a data dependent design paradigm, where an implementation is automatically tailored to particular contexts of data activity. We develop an automated RTL to RTL optimization framework, ROVER, that takes circuit input stimuli and generates power-efficient architectures. We evaluate the effectiveness on both open-source arithmetic benchmarks and benchmarks derived from Intel production examples. The tool is able to reduce the total power consumption by up to 33.9%.

[316]  arXiv:2404.12339 [pdf, other]
Title: SPOT: Point Cloud Based Stereo Visual Place Recognition for Similar and Opposing Viewpoints
Comments: Accepted to ICRA 2024, project website: this https URL
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Recognizing places from an opposing viewpoint during a return trip is a common experience for human drivers. However, the analogous robotics capability, visual place recognition (VPR) with limited field of view cameras under 180 degree rotations, has proven to be challenging to achieve. To address this problem, this paper presents Same Place Opposing Trajectory (SPOT), a technique for opposing viewpoint VPR that relies exclusively on structure estimated through stereo visual odometry (VO). The method extends recent advances in lidar descriptors and utilizes a novel double (similar and opposing) distance matrix sequence matching method. We evaluate SPOT on a publicly available dataset with 6.7-7.6 km routes driven in similar and opposing directions under various lighting conditions. The proposed algorithm demonstrates remarkable improvement over the state-of-the-art, achieving up to 91.7% recall at 100% precision in opposing viewpoint cases, while requiring less storage than all baselines tested and running faster than all but one. Moreover, the proposed method assumes no a priori knowledge of whether the viewpoint is similar or opposing, and also demonstrates competitive performance in similar viewpoint cases.

[317]  arXiv:2404.12341 [pdf, other]
Title: Measuring Feature Dependency of Neural Networks by Collapsing Feature Dimensions in the Data Manifold
Comments: Accepted and will be pulished in International Symposium on Biomedical Imaging (ISBI) 2024
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

This paper introduces a new technique to measure the feature dependency of neural network models. The motivation is to better understand a model by querying whether it is using information from human-understandable features, e.g., anatomical shape, volume, or image texture. Our method is based on the principle that if a model is dependent on a feature, then removal of that feature should significantly harm its performance. A targeted feature is "removed" by collapsing the dimension in the data distribution that corresponds to that feature. We perform this by moving data points along the feature dimension to a baseline feature value while staying on the data manifold, as estimated by a deep generative model. Then we observe how the model's performance changes on the modified test data set, with the target feature dimension removed. We test our method on deep neural network models trained on synthetic image data with known ground truth, an Alzheimer's disease prediction task using MRI and hippocampus segmentations from the OASIS-3 dataset, and a cell nuclei classification task using the Lizard dataset.

[318]  arXiv:2404.12342 [pdf, other]
Title: Large Language Models in Targeted Sentiment Analysis
Comments: Fine-tuned Flan-T5-xl outperforms the top #1 results of transformer-based classifier in RuSentNE-2023 competition, to appear in Lobachevskii Journal of Mathematics No.8/2024 proceedings
Subjects: Computation and Language (cs.CL)

In this paper we investigate the use of decoder-based generative transformers for extracting sentiment towards the named entities in Russian news articles. We study sentiment analysis capabilities of instruction-tuned large language models (LLMs). We consider the dataset of RuSentNE-2023 in our study. The first group of experiments was aimed at the evaluation of zero-shot capabilities of LLMs with closed and open transparencies. The second covers the fine-tuning of Flan-T5 using the "chain-of-thought" (CoT) three-hop reasoning framework (THoR). We found that the results of the zero-shot approaches are similar to the results achieved by baseline fine-tuned encoder-based transformers (BERT-base). Reasoning capabilities of the fine-tuned Flan-T5 models with THoR achieve at least 5% increment with the base-size model compared to the results of the zero-shot experiment. The best results of sentiment analysis on RuSentNE-2023 were achieved by fine-tuned Flan-T5-xl, which surpassed the results of previous state-of-the-art transformer-based classifiers. Our CoT application framework is publicly available: https://github.com/nicolay-r/Reasoning-for-Sentiment-Analysis-Framework

[319]  arXiv:2404.12347 [pdf, other]
Title: AniClipart: Clipart Animation with Text-to-Video Priors
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define B\'{e}zier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.

[320]  arXiv:2404.12349 [pdf, other]
Title: Evaluating AI for Law: Bridging the Gap with Open-Source Solutions
Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

This study evaluates the performance of general-purpose AI, like ChatGPT, in legal question-answering tasks, highlighting significant risks to legal professionals and clients. It suggests leveraging foundational models enhanced by domain-specific knowledge to overcome these issues. The paper advocates for creating open-source legal AI systems to improve accuracy, transparency, and narrative diversity, addressing general AI's shortcomings in legal contexts.

[321]  arXiv:2404.12352 [pdf, other]
Title: Point-In-Context: Understanding Point Cloud via In-Context Learning
Comments: Project page: this https URL arXiv admin note: text overlap with arXiv:2306.08659
Subjects: Computer Vision and Pattern Recognition (cs.CV)

With the emergence of large-scale models trained on diverse datasets, in-context learning has emerged as a promising paradigm for multitasking, notably in natural language processing and image processing. However, its application in 3D point cloud tasks remains largely unexplored. In this work, we introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module and proposing a vanilla version of PIC called Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model for various 3D point cloud tasks, with inputs and outputs modeled as coordinates. In this paradigm, the challenging segmentation task is achieved by assigning label points with XYZ coordinates for each category; the final prediction is then chosen based on the label point closest to the predictions. To break the limitation by the fixed label-coordinate assignment, which has poor generalization upon novel classes, we propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving dynamic context labeling and model training. By utilizing dynamic in-context labels and extra in-context pairs, PIC-S achieves enhanced performance and generalization capability in and across part segmentation datasets. PIC is a general framework so that other tasks or datasets can be seamlessly introduced into our PIC through a unified data format. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of generalizing unseen datasets and performing novel part segmentation by customizing prompts.

[322]  arXiv:2404.12353 [pdf, other]
Title: V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

[323]  arXiv:2404.12355 [pdf, other]
Title: Towards a Foundation Model for Partial Differential Equation: Multi-Operator Learning and Extrapolation
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)

Foundation models, such as large language models, have demonstrated success in addressing various language and image processing tasks. In this work, we introduce a multi-modal foundation model for scientific problems, named PROSE-PDE. Our model, designed for bi-modality to bi-modality learning, is a multi-operator learning approach which can predict future states of spatiotemporal systems while concurrently learning the underlying governing equations of the physical system. Specifically, we focus on multi-operator learning by training distinct one-dimensional time-dependent nonlinear constant coefficient partial differential equations, with potential applications to many physical applications including physics, geology, and biology. More importantly, we provide three extrapolation studies to demonstrate that PROSE-PDE can generalize physical features through the robust training of multiple operators and that the proposed model can extrapolate to predict PDE solutions whose models or data were unseen during the training. Furthermore, we show through systematic numerical experiments that the utilization of the symbolic modality in our model effectively resolves the well-posedness problems with training multiple operators and thus enhances our model's predictive capabilities.

[324]  arXiv:2404.12358 [pdf, other]
Title: From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
Subjects: Machine Learning (cs.LG)

Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference, first we theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-tun dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

[325]  arXiv:2404.12359 [pdf, other]
Title: Inverse Neural Rendering for Explainable Multi-Object Tracking
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at https://light.princeton.edu/inverse-rendering-tracking/.

[326]  arXiv:2404.12361 [pdf, other]
Title: Learning the Domain Specific Inverse NUFFT for Accelerated Spiral MRI using Diffusion Models
Subjects: Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)

Deep learning methods for accelerated MRI achieve state-of-the-art results but largely ignore additional speedups possible with noncartesian sampling trajectories. To address this gap, we created a generative diffusion model-based reconstruction algorithm for multi-coil highly undersampled spiral MRI. This model uses conditioning during training as well as frequency-based guidance to ensure consistency between images and measurements. Evaluated on retrospective data, we show high quality (structural similarity > 0.87) in reconstructed images with ultrafast scan times (0.02 seconds for a 2D image). We use this algorithm to identify a set of optimal variable-density spiral trajectories and show large improvements in image quality compared to conventional reconstruction using the non-uniform fast Fourier transform. By combining efficient spiral sampling trajectories, multicoil imaging, and deep learning reconstruction, these methods could enable the extremely high acceleration factors needed for real-time 3D imaging.

[327]  arXiv:2404.12362 [pdf, other]
Title: Transformer tricks: Removing weights for skipless transformers
Authors: Nils Graef
Comments: 6 pages, 4 figures
Subjects: Machine Learning (cs.LG)

He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped-query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity). See arXiv:2402.13388 and https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.

[328]  arXiv:2404.12365 [pdf, other]
Title: When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes
Comments: Accepted to NAACL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

We present FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multiclass classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub and PyPi, presenting a user-friendly solution for NLP practitioners.

[329]  arXiv:2404.12366 [pdf, ps, other]
Title: Accounting for AI and Users Shaping One Another: The Role of Mathematical Models
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)

As AI systems enter into a growing number of societal domains, these systems increasingly shape and are shaped by user preferences, opinions, and behaviors. However, the design of AI systems rarely accounts for how AI and users shape one another. In this position paper, we argue for the development of formal interaction models which mathematically specify how AI and users shape one another. Formal interaction models can be leveraged to (1) specify interactions for implementation, (2) monitor interactions through empirical analysis, (3) anticipate societal impacts via counterfactual analysis, and (4) control societal impacts via interventions. The design space of formal interaction models is vast, and model design requires careful consideration of factors such as style, granularity, mathematical complexity, and measurability. Using content recommender systems as a case study, we critically examine the nascent literature of formal interaction models with respect to these use-cases and design axes. More broadly, we call for the community to leverage formal interaction models when designing, evaluating, or auditing any AI system which interacts with users.

[330]  arXiv:2404.12368 [pdf, other]
Title: Gradient-Regularized Out-of-Distribution Detection
Comments: Under review for the 18th European Conference on Computer Vision (ECCV) 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

One of the challenges for neural networks in real-life applications is the overconfident errors these models make when the data is not from the original training distribution.
Addressing this issue is known as Out-of-Distribution (OOD) detection.
Many state-of-the-art OOD methods employ an auxiliary dataset as a surrogate for OOD data during training to achieve improved performance.
However, these methods fail to fully exploit the local information embedded in the auxiliary dataset.
In this work, we propose the idea of leveraging the information embedded in the gradient of the loss function during training to enable the network to not only learn a desired OOD score for each sample but also to exhibit similar behavior in a local neighborhood around each sample.
We also develop a novel energy-based sampling method to allow the network to be exposed to more informative OOD samples during the training phase. This is especially important when the auxiliary dataset is large. We demonstrate the effectiveness of our method through extensive experiments on several OOD benchmarks, improving the existing state-of-the-art FPR95 by 4% on our ImageNet experiment.
We further provide a theoretical analysis through the lens of certified robustness and Lipschitz analysis to showcase the theoretical foundation of our work. We will publicly release our code after the review process.

[331]  arXiv:2404.12369 [pdf, other]
Title: KDk: A Defense Mechanism Against Label Inference Attacks in Vertical Federated Learning
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)

Vertical Federated Learning (VFL) is a category of Federated Learning in which models are trained collaboratively among parties with vertically partitioned data. Typically, in a VFL scenario, the labels of the samples are kept private from all the parties except for the aggregating server, that is the label owner. Nevertheless, recent works discovered that by exploiting gradient information returned by the server to bottom models, with the knowledge of only a small set of auxiliary labels on a very limited subset of training data points, an adversary can infer the private labels. These attacks are known as label inference attacks in VFL. In our work, we propose a novel framework called KDk, that combines Knowledge Distillation and k-anonymity to provide a defense mechanism against potential label inference attacks in a VFL scenario. Through an exhaustive experimental campaign we demonstrate that by applying our approach, the performance of the analyzed label inference attacks decreases consistently, even by more than 60%, maintaining the accuracy of the whole VFL almost unaltered.

[332]  arXiv:2404.12372 [pdf, other]
Title: MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.

[333]  arXiv:2404.12376 [pdf, other]
Title: Matching the Statistical Query Lower Bound for k-sparse Parity Problems with Stochastic Gradient Descent
Comments: 36 pages, 7 figures, 3 tables
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

The $k$-parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-parity problem with stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that SGD can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hypercube ($k\le O(\sqrt{d})$) with a sample complexity of $\tilde{O}(d^{k-1})$ using $2^{\Theta(k)}$ neurons, thus matching the established $\Omega(d^{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical analysis begins by constructing a good neural network capable of correctly solving the $k$-parity problem. We then demonstrate how a trained neural network with SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors. Our theoretical results and findings are supported by empirical evidence, showcasing the efficiency and efficacy of our approach.

[334]  arXiv:2404.12377 [pdf, other]
Title: RoboDreamer: Learning Compositional World Models for Robot Imagination
Subjects: Robotics (cs.RO)

Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this issue, we introduce RoboDreamer, an innovative approach for learning a compositional world model by factorizing the video generation. We leverage the natural compositionality of language to parse instructions into a set of lower-level primitives, which we condition a set of models on to generate videos. We illustrate how this factorization naturally enables compositional generalization, by allowing us to formulate a new natural language instruction as a combination of previously seen components. We further show how such a factorization enables us to add additional multimodal goals, allowing us to specify a video we wish to generate given both natural language instructions and a goal image. Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.

[335]  arXiv:2404.12378 [pdf, other]
Title: 6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene Reconstruction
Comments: Joint first authorship. Project page: this https URL Code this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at https://github.com/continental/6Img-to-3D, and more examples can be found at our website here https://6Img-to-3D.GitHub.io/.

[336]  arXiv:2404.12379 [pdf, other]
Title: Dynamic Gaussians Mesh: Consistent Mesh Reconstruction from Monocular Videos
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Modern 3D engines and graphics pipelines require mesh as a memory-efficient representation, which allows efficient rendering, geometry processing, texture editing, and many other downstream operations. However, it is still highly difficult to obtain high-quality mesh in terms of structure and detail from monocular visual observations. The problem becomes even more challenging for dynamic scenes and objects. To this end, we introduce Dynamic Gaussians Mesh (DG-Mesh), a framework to reconstruct a high-fidelity and time-consistent mesh given a single monocular video. Our work leverages the recent advancement in 3D Gaussian Splatting to construct the mesh sequence with temporal consistency from a video. Building on top of this representation, DG-Mesh recovers high-quality meshes from the Gaussian points and can track the mesh vertices over time, which enables applications such as texture editing on dynamic objects. We introduce the Gaussian-Mesh Anchoring, which encourages evenly distributed Gaussians, resulting better mesh reconstruction through mesh-guided densification and pruning on the deformed Gaussians. By applying cycle-consistent deformation between the canonical and the deformed space, we can project the anchored Gaussian back to the canonical space and optimize Gaussians across all time frames. During the evaluation on different datasets, DG-Mesh provides significantly better mesh reconstruction and rendering than baselines.

[337]  arXiv:2404.12382 [pdf, other]
Title: Lazy Diffusion Transformer for Interactive Image Editing
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.

[338]  arXiv:2404.12383 [pdf, ps, other]
Title: G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis
Comments: accepted to CVPR2024; project page at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose G-HOP, a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand, conditioned on the object category. To learn a 3D spatial diffusion model that can capture this joint distribution, we represent the human hand via a skeletal distance field to obtain a representation aligned with the (latent) signed distance field for the object. We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis. We believe that our model, trained by aggregating seven diverse real-world interaction datasets spanning across 155 categories, represents a first approach that allows jointly generating both hand and object. Our empirical evaluations demonstrate the benefit of this joint prior in video-based reconstruction and human grasp synthesis, outperforming current task-specific baselines.
Project website: https://judyye.github.io/ghop-www

[339]  arXiv:2404.12385 [pdf, other]
Title: MeshLRM: Large Reconstruction Model for High-Quality Mesh
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction and rendering within the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning a pre-trained NeRF LRM with mesh rendering. Moreover, we improve the LRM architecture by simplifying several complex designs in previous LRMs. MeshLRM's NeRF initialization is sequentially trained with low- and high-resolution images; this new LRM training strategy enables significantly faster convergence and thereby leads to better quality with less compute. Our approach achieves state-of-the-art mesh reconstruction from sparse-view inputs and also allows for many downstream applications, including text-to-3D and single-image-to-3D generation. Project page: https://sarahweiii.github.io/meshlrm/

[340]  arXiv:2404.12386 [pdf, other]
Title: SOHES: Self-supervised Open-world Hierarchical Entity Segmentation
Comments: ICLR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES.github.io.

[341]  arXiv:2404.12387 [pdf, other]
Title: Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at this http URL . A showcase of non cherry picked qualitative examples can also be found at this http URL .

[342]  arXiv:2404.12388 [pdf, other]
Title: VideoGigaGAN: Towards Detail-rich Video Super-Resolution
Comments: project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Video super-resolution (VSR) approaches have shown impressive temporal consistency in upsampled videos. However, these approaches tend to generate blurrier results than their image counterparts as they are limited in their generative capability. This raises a fundamental question: can we extend the success of a generative image upsampler to the VSR task while preserving the temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that can produce videos with high-frequency details and temporal consistency. VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply inflating GigaGAN to a video model by adding temporal modules produces severe temporal flickering. We identify several key issues and propose techniques that significantly improve the temporal consistency of upsampled videos. Our experiments show that, unlike previous VSR methods, VideoGigaGAN generates temporally consistent videos with more fine-grained appearance details. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with $8\times$ super-resolution.

[343]  arXiv:2404.12389 [pdf, other]
Title: Moving Object Segmentation: All You Need Is SAM (and Flow)
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.

[344]  arXiv:2404.12390 [pdf, other]
Title: BLINK: Multimodal Large Language Models Can See but Not Perceive
Comments: Multimodal Benchmark, Project Url: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.

[345]  arXiv:2404.12391 [pdf, other]
Title: On the Content Bias in Fréchet Video Distance
Comments: CVPR 2024. Project webpage: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

Fr\'echet Video Distance (FVD), a prominent metric for evaluating video generation models, is known to conflict with human perception occasionally. In this paper, we aim to explore the extent of FVD's bias toward per-frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD increases only slightly with large temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions, one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's bias towards the quality of individual frames. We further observe that the bias can be attributed to the features extracted from a supervised video classifier trained on the content-biased dataset. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally, we revisit a few real-world examples to validate our hypothesis.

Cross-lists for Fri, 19 Apr 24

[346]  arXiv:2206.06267 (cross-list from eess.IV) [pdf, other]
Title: MMMNA-Net for Overall Survival Time Prediction of Brain Tumor Patients
Comments: Accepted EMBC 2022
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Overall survival (OS) time is one of the most important evaluation indices for gliomas situations. Multimodal Magnetic Resonance Imaging (MRI) scans play an important role in the study of glioma prognosis OS time. Several deep learning-based methods are proposed for the OS time prediction on multi-modal MRI problems. However, these methods usually fuse multi-modal information at the beginning or at the end of the deep learning networks and lack the fusion of features from different scales. In addition, the fusion at the end of networks always adapts global with global (eg. fully connected after concatenation of global average pooling output) or local with local (eg. bilinear pooling), which loses the information of local with global. In this paper, we propose a novel method for multi-modal OS time prediction of brain tumor patients, which contains an improved nonlocal features fusion module introduced on different scales. Our method obtains a relative 8.76% improvement over the current state-of-art method (0.6989 vs. 0.6426 on accuracy). Extensive testing demonstrates that our method could adapt to situations with missing modalities. The code is available at https://github.com/TangWen920812/mmmna-net.

[347]  arXiv:2404.11520 (cross-list from math.OC) [pdf, other]
Title: Equitably allocating wildfire resilience investments for power grids: The curse of aggregation and vulnerability indices
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)

Wildfires ignited by power systems infrastructure are among the most destructive wildfires; hence some utility companies in wildfire-prone regions have pursued a proactive policy of emergency power shutoffs. These shutoffs, while mitigating the risk of disastrous ignition events, result in power outages that could negatively impacts vulnerable communities. In this paper, we consider how to equitably allocate funds to underground and effectively de-risk power lines in transmission networks. We explore the impact of the 2021 White House resource allocation policy called the Justice40 initiative, which states that 40% of the benefits of federally-funded climate-related investments should go to socially vulnerable communities. The definition of what constitutes a vulnerable community varies by organization, and we consider two major recently proposed vulnerability indices: the Justice40 index created under the 2021 White House and the Social Vulnerability Index (SVI) developed by the Center for Disease Control and Prevention (CDC). We show that allocating budget according to these two indices fails to reduce power outages for indigenous communities and those subject to high wildfire ignition risk using a high-fidelity synthetic power grid dataset that matches the key features of the Texas transmission system. We discuss how aggregation of communities and "one size fits all" vulnerability indices might be the reasons for the misalignment between the goals of vulnerability indices and their realized impact in this particular case study. We provide a method of achieving an equitable investment plan by adding group-level protections on percentage of load that is shed across each population group of interest.

[348]  arXiv:2404.11619 (cross-list from eess.AS) [pdf, ps, other]
Title: Advancing Speech Translation: A Corpus of Mandarin-English Conversational Telephone Speech
Comments: 2 pages
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

This paper introduces a set of English translations for a 123-hour subset of the CallHome Mandarin Chinese data and the HKUST Mandarin Telephone Speech data for the task of speech translation. Paired source-language speech and target-language text is essential for training end-to-end speech translation systems and can provide substantial performance improvements for cascaded systems as well, relative to training on more widely available text data sets. We demonstrate that fine-tuning a general-purpose translation model to our Mandarin-English conversational telephone speech training set improves target-domain BLEU by more than 8 points, highlighting the importance of matched training data.

[349]  arXiv:2404.11624 (cross-list from math.GM) [pdf, ps, other]
Title: Token Space: A Category Theory Framework for AI Computations
Authors: Wuming Pan
Comments: 42 pages,5 tables
Subjects: General Mathematics (math.GM); Machine Learning (cs.LG)

This paper introduces the Token Space framework, a novel mathematical construct designed to enhance the interpretability and effectiveness of deep learning models through the application of category theory. By establishing a categorical structure at the Token level, we provide a new lens through which AI computations can be understood, emphasizing the relationships between tokens, such as grouping, order, and parameter types. We explore the foundational methodologies of the Token Space, detailing its construction, the role of construction operators and initial categories, and its application in analyzing deep learning models, specifically focusing on attention mechanisms and Transformer architectures. The integration of category theory into AI research offers a unified framework to describe and analyze computational structures, enabling new research paths and development possibilities. Our investigation reveals that the Token Space framework not only facilitates a deeper theoretical understanding of deep learning models but also opens avenues for the design of more efficient, interpretable, and innovative models, illustrating the significant role of category theory in advancing computational models.

[350]  arXiv:2404.11674 (cross-list from hep-lat) [pdf, other]
Title: Practical applications of machine-learned flows on gauge fields
Comments: 9 pages, 5 figures, proceedings of the 40th International Symposium on Lattice Field Theory (Lattice 2023)
Subjects: High Energy Physics - Lattice (hep-lat); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)

Normalizing flows are machine-learned maps between different lattice theories which can be used as components in exact sampling and inference schemes. Ongoing work yields increasingly expressive flows on gauge fields, but it remains an open question how flows can improve lattice QCD at state-of-the-art scales. We discuss and demonstrate two applications of flows in replica exchange (parallel tempering) sampling, aimed at improving topological mixing, which are viable with iterative improvements upon presently available flows.

[351]  arXiv:2404.11725 (cross-list from eess.IV) [pdf, ps, other]
Title: Postoperative glioblastoma segmentation: Development of a fully automated pipeline using deep convolutional neural networks and comparison with currently available models
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Accurately assessing tumor removal is paramount in the management of glioblastoma. We developed a pipeline using MRI scans and neural networks to segment tumor subregions and the surgical cavity in postoperative images. Our model excels in accurately classifying the extent of resection, offering a valuable tool for clinicians in assessing treatment effectiveness.

[352]  arXiv:2404.11741 (cross-list from physics.med-ph) [pdf, other]
Title: Diffusion Schrödinger Bridge Models for High-Quality MR-to-CT Synthesis for Head and Neck Proton Treatment Planning
Comments: International Conference on the use of Computers in Radiation therapy (ICCR)
Subjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)

In recent advancements in proton therapy, MR-based treatment planning is gaining momentum to minimize additional radiation exposure compared to traditional CT-based methods. This transition highlights the critical need for accurate MR-to-CT image synthesis, which is essential for precise proton dose calculations. Our research introduces the Diffusion Schr\"odinger Bridge Models (DSBM), an innovative approach for high-quality MR-to-CT synthesis. DSBM learns the nonlinear diffusion processes between MR and CT data distributions. This method improves upon traditional diffusion models by initiating synthesis from the prior distribution rather than the Gaussian distribution, enhancing both generation quality and efficiency. We validated the effectiveness of DSBM on a head and neck cancer dataset, demonstrating its superiority over traditional image synthesis methods through both image-level and dosimetric-level evaluations. The effectiveness of DSBM in MR-based proton treatment planning highlights its potential as a valuable tool in various clinical scenarios.

[353]  arXiv:2404.11768 (cross-list from cond-mat.stat-mech) [pdf, ps, other]
Title: Tensor-Networks-based Learning of Probabilistic Cellular Automata Dynamics
Comments: 9 pages, 7 figures
Subjects: Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Quantum Physics (quant-ph)

Algorithms developed to solve many-body quantum problems, like tensor networks, can turn into powerful quantum-inspired tools to tackle problems in the classical domain. In this work, we focus on matrix product operators, a prominent numerical technique to study many-body quantum systems, especially in one dimension. It has been previously shown that such a tool can be used for classification, learning of deterministic sequence-to-sequence processes and of generic quantum processes. We further develop a matrix product operator algorithm to learn probabilistic sequence-to-sequence processes and apply this algorithm to probabilistic cellular automata. This new approach can accurately learn probabilistic cellular automata processes in different conditions, even when the process is a probabilistic mixture of different chaotic rules. In addition, we find that the ability to learn these dynamics is a function of the bit-wise difference between the rules and whether one is much more likely than the other.

[354]  arXiv:2404.11811 (cross-list from physics.chem-ph) [pdf, ps, other]
Title: Physics-informed active learning for accelerating quantum chemical simulations
Subjects: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Quantum chemical simulations can be greatly accelerated by constructing machine learning potentials, which is often done using active learning (AL). The usefulness of the constructed potentials is often limited by the high effort required and their insufficient robustness in the simulations. Here we introduce the end-to-end AL for constructing robust data-efficient potentials with affordable investment of time and resources and minimum human interference. Our AL protocol is based on the physics-informed sampling of training points, automatic selection of initial data, and uncertainty quantification. The versatility of this protocol is shown in our implementation of quasi-classical molecular dynamics for simulating vibrational spectra, conformer search of a key biochemical molecule, and time-resolved mechanism of the Diels-Alder reaction. These investigations took us days instead of weeks of pure quantum chemical calculations on a high-performance computing cluster.

[355]  arXiv:2404.11843 (cross-list from eess.IV) [pdf, other]
Title: Computer-Aided Diagnosis of Thoracic Diseases in Chest X-rays using hybrid CNN-Transformer Architecture
Authors: Sonit Singh
Comments: 24 pages, 13 Figures, 13 Tables. arXiv admin note: text overlap with arXiv:1904.09925 by other authors
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Medical imaging has been used for diagnosis of various conditions, making it one of the most powerful resources for effective patient care. Due to widespread availability, low cost, and low radiation, chest X-ray is one of the most sought after radiology examination for the diagnosis of various thoracic diseases. Due to advancements in medical imaging technologies and increasing patient load, current radiology workflow faces various challenges including increasing backlogs, working long hours, and increase in diagnostic errors. An automated computer-aided diagnosis system that can interpret chest X-rays to augment radiologists by providing actionable insights has potential to provide second opinion to radiologists, highlight relevant regions in the image, in turn expediting clinical workflow, reducing diagnostic errors, and improving patient care. In this study, we applied a novel architecture augmenting the DenseNet121 Convolutional Neural Network (CNN) with multi-head self-attention mechanism using transformer, namely SA-DenseNet121, that can identify multiple thoracic diseases in chest X-rays. We conducted experiments on four of the largest chest X-ray datasets, namely, ChestX-ray14, CheXpert, MIMIC-CXR-JPG, and IU-CXR. Experimental results in terms of area under the receiver operating characteristics (AUC-ROC) shows that augmenting CNN with self-attention has potential in diagnosing different thoracic diseases from chest X-rays. The proposed methodology has the potential to support the reading workflow, improve efficiency, and reduce diagnostic errors.

[356]  arXiv:2404.11883 (cross-list from econ.TH) [pdf, ps, other]
Title: Testing the simplicity of strategy-proof mechanisms
Subjects: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT)

This paper experimentally evaluates four mechanisms intended to achieve the Uniform outcome in rationing problems (Sprumont, 1991). Our benchmark is the dominant-strategy, direct-revelation mechanism of the Uniform rule. A strategically equivalent mechanism that provides non-binding feedback during the reporting period greatly improves performance. A sequential revelation mechanism produces modest improvements despite not possessing dominant strategies. A novel, obviously strategy-proof mechanism, devised by Arribillaga et al. (2023), does not improve performance. We characterize each alternative to the direct mechanism, finding general lessons about the advantages of real-time feedback and sequentiality of play as well as the potential shortcomings of an obviously strategy-proof mechanism.

[357]  arXiv:2404.11889 (cross-list from eess.IV) [pdf, other]
Title: Multi-view X-ray Image Synthesis with Multiple Domain Disentanglement from CT Scans
Comments: 13 pages, 10 figures
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods ($\pi$-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.

[358]  arXiv:2404.11929 (cross-list from eess.IV) [pdf, other]
Title: A Symmetric Regressor for MRI-Based Assessment of Striatal Dopamine Transporter Uptake in Parkinson's Disease
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Dopamine transporter (DAT) imaging is commonly used for monitoring Parkinson's disease (PD), where striatal DAT uptake amount is computed to assess PD severity. However, DAT imaging has a high cost and the risk of radiance exposure and is not available in general clinics. Recently, MRI patch of the nigral region has been proposed as a safer and easier alternative. This paper proposes a symmetric regressor for predicting the DAT uptake amount from the nigral MRI patch. Acknowledging the symmetry between the right and left nigrae, the proposed regressor incorporates a paired input-output model that simultaneously predicts the DAT uptake amounts for both the right and left striata. Moreover, it employs a symmetric loss that imposes a constraint on the difference between right-to-left predictions, resembling the high correlation in DAT uptake amounts in the two lateral sides. Additionally, we propose a symmetric Monte-Carlo (MC) dropout method for providing a fruitful uncertainty estimate of the DAT uptake prediction, which utilizes the above symmetry. We evaluated the proposed approach on 734 nigral patches, which demonstrated significantly improved performance of the symmetric regressor compared with the standard regressors while giving better explainability and feature representation. The symmetric MC dropout also gave precise uncertainty ranges with a high probability of including the true DAT uptake amounts within the range.

[359]  arXiv:2404.11953 (cross-list from quant-ph) [pdf, other]
Title: Tailoring Fault-Tolerance to Quantum Algorithms
Comments: 19 pages IEEE double column, 30 figures
Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT)

The standard approach to universal fault-tolerant quantum computing is to develop a general purpose quantum error correction mechanism that can implement a universal set of logical gates fault-tolerantly. Given such a scheme, any quantum algorithm can be realized fault-tolerantly by composing the relevant logical gates from this set. However, we know that quantum computers provide a significant quantum advantage only for specific quantum algorithms. Hence, a universal quantum computer can likely gain from compiling such specific algorithms using tailored quantum error correction schemes. In this work, we take the first steps towards such algorithm-tailored quantum fault-tolerance. We consider Trotter circuits in quantum simulation, which is an important application of quantum computing. We develop a solve-and-stitch algorithm to systematically synthesize physical realizations of Clifford Trotter circuits on the well-known $[\![ n,n-2,2 ]\!]$ error-detecting code family. Our analysis shows that this family implements Trotter circuits with optimal depth, thereby serving as an illuminating example of tailored quantum error correction. We achieve fault-tolerance for these circuits using flag gadgets, which add minimal overhead. The solve-and-stitch algorithm has the potential to scale beyond this specific example and hence provide a principled approach to tailored fault-tolerance in quantum computing.

[360]  arXiv:2404.11965 (cross-list from stat.ML) [pdf, other]
Title: Multi-fidelity Gaussian process surrogate modeling for regression problems in physics
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)

One of the main challenges in surrogate modeling is the limited availability of data due to resource constraints associated with computationally expensive simulations. Multi-fidelity methods provide a solution by chaining models in a hierarchy with increasing fidelity, associated with lower error, but increasing cost. In this paper, we compare different multi-fidelity methods employed in constructing Gaussian process surrogates for regression. Non-linear autoregressive methods in the existing literature are primarily confined to two-fidelity models, and we extend these methods to handle more than two levels of fidelity. Additionally, we propose enhancements for an existing method incorporating delay terms by introducing a structured kernel. We demonstrate the performance of these methods across various academic and real-world scenarios. Our findings reveal that multi-fidelity methods generally have a smaller prediction error for the same computational cost as compared to the single-fidelity method, although their effectiveness varies across different scenarios.

[361]  arXiv:2404.11969 (cross-list from math.LO) [pdf, other]
Title: Lewis and Brouwer meet Strong Löb
Subjects: Logic (math.LO); Logic in Computer Science (cs.LO)

We study the principle phi implies box phi, known as `Strength' or `the Completeness Principle', over the constructive version of L\"ob's Logic. We consider this principle both for the modal language with the necessity operator and for the modal language with the Lewis arrow, where L\"ob's Logic is suitably adapted.
Central insights of provability logic, like the de Jongh-Sambin Theorem and the de Jongh-Sambin-Bernardi Theorem, take a simple form in the presence of Strength. We present these simple versions. We discuss the semantics of two salient systems and prove uniform interpolation for both. In addition, we sketch arithmetical interpretations of our systems. Finally, we describe the various connections of our subject with Computer Science.

[362]  arXiv:2404.11974 (cross-list from eess.IV) [pdf, other]
Title: Device (In)Dependence of Deep Learning-based Image Age Approximation
Comments: This work was accepted and presented in: 2022 ICPR-Workshop on Artificial Intelligence for Multimedia Forensics and Disinformation Detection. Montreal, Quebec, Canada. However, due to a technical issue on the publishing companies' side, the work does not appear in the workshop proceedings
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The goal of temporal image forensic is to approximate the age of a digital image relative to images from the same device. Usually, this is based on traces left during the image acquisition pipeline. For example, several methods exist that exploit the presence of in-field sensor defects for this purpose. In addition to these 'classical' methods, there is also an approach in which a Convolutional Neural Network (CNN) is trained to approximate the image age. One advantage of a CNN is that it independently learns the age features used. This would make it possible to exploit other (different) age traces in addition to the known ones (i.e., in-field sensor defects). In a previous work, we have shown that the presence of strong in-field sensor defects is irrelevant for a CNN to predict the age class. Based on this observation, the question arises how device (in)dependent the learned features are. In this work, we empirically asses this by training a network on images from a single device and then apply the trained model to images from different devices. This evaluation is performed on 14 different devices, including 10 devices from the publicly available 'Northumbria Temporal Image Forensics' database. These 10 different devices are based on five different device pairs (i.e., with the identical camera model).

[363]  arXiv:2404.12001 (cross-list from q-fin.CP) [pdf, ps, other]
Title: Internet sentiment exacerbates intraday overtrading, evidence from A-Share market
Authors: Peng Yifeng
Comments: 27 pages, 5 tables
Subjects: Computational Finance (q-fin.CP); Computational Engineering, Finance, and Science (cs.CE)

Market fluctuations caused by overtrading are important components of systemic market risk. This study examines the effect of investor sentiment on intraday overtrading activities in the Chinese A-share market. Employing high-frequency sentiment indices inferred from social media posts on the Eastmoney forum Guba, the research focuses on constituents of the CSI 300 and CSI 500 indices over a period from 01/01/2018, to 12/30/2022. The empirical analysis indicates that investor sentiment exerts a significantly positive impact on intraday overtrading, with the influence being more pronounced among institutional investors relative to individual traders. Moreover, sentiment-driven overtrading is found to be more prevalent during bull markets as opposed to bear markets. Additionally, the effect of sentiment on overtrading is observed to be more pronounced among individual investors in large-cap stocks compared to small- and mid-cap stocks.

[364]  arXiv:2404.12070 (cross-list from math.PR) [pdf, ps, other]
Title: Towards an Approximation Theory of Observable Operator Models
Authors: Wojciech Anyszka
Comments: 15 pages
Subjects: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)

Observable operator models (OOMs) offer a powerful framework for modelling stochastic processes, surpassing the traditional hidden Markov models (HMMs) in generality and efficiency. However, using OOMs to model infinite-dimensional processes poses significant theoretical challenges. This article explores a rigorous approach to developing an approximation theory for OOMs of infinite-dimensional processes. Building upon foundational work outlined in an unpublished tutorial [Jae98], an inner product structure on the space of future distributions is rigorously established and the continuity of observable operators with respect to the associated 2-norm is proven. The original theorem proven in this thesis describes a fundamental obstacle in making an infinite-dimensional space of future distributions into a Hilbert space. The presented findings lay the groundwork for future research in approximating observable operators of infinite-dimensional processes, while a remedy to the encountered obstacle is suggested.

[365]  arXiv:2404.12141 (cross-list from q-bio.BM) [pdf, other]
Title: MolCRAFT: Structure-Based Drug Design in Continuous Parameter Space
Comments: 19 pages, 11 figures
Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)

Generative models for structure-based drug design (SBDD) have shown promising results in recent years. Existing works mainly focus on how to generate molecules with higher binding affinity, ignoring the feasibility prerequisites for generated 3D poses and resulting in false positives. We conduct thorough studies on key factors of ill-conformational problems when applying autoregressive methods and diffusion to SBDD, including mode collapse and hybrid continuous-discrete space. In this paper, we introduce \ours, the first SBDD model that operates in the continuous parameter space, together with a novel noise reduced sampling strategy. Empirical results show that our model consistently achieves superior performance in binding affinity with more stable 3D structure, demonstrating our ability to accurately model interatomic interactions. To our best knowledge, MolCRAFT is the first to achieve reference-level Vina Scores (-6.59 kcal/mol), outperforming other strong baselines by a wide margin (-0.84 kcal/mol). Code is available at https://github.com/AlgoMole/MolCRAFT.

[366]  arXiv:2404.12148 (cross-list from eess.SP) [pdf, other]
Title: Unknown Interference Modeling for Rate Adaptation in Cell-Free Massive MIMO Networks
Comments: 6 pages, Accepted at IEEE Wireless Communications and Networking Conference (WCNC), 2024. arXiv admin note: text overlap with arXiv:2305.07344
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)

Co-channel interference poses a challenge in any wireless communication network where the time-frequency resources are reused over different geographical areas. The interference is particularly diverse in cell-free massive multiple-input multiple-output (MIMO) networks, where a large number of user equipments (UEs) are multiplexed by a multitude of access points (APs) on the same time-frequency resources. For realistic and scalable network operation, only the interference from UEs belonging to the same serving cluster of APs can be estimated in real-time and suppressed by precoding/combining. As a result, the unknown interference arising from scheduling variations in neighboring clusters makes the rate adaptation hard and can lead to outages. This paper aims to model the unknown interference power in the uplink of a cell-free massive MIMO network. The results show that the proposed method effectively describes the distribution of the unknown interference power and provides a tool for rate adaptation with guaranteed target outage.

[367]  arXiv:2404.12163 (cross-list from eess.IV) [pdf, other]
Title: Unsupervised Microscopy Video Denoising
Comments: Accepted at CVPRW 2024
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In this paper, we introduce a novel unsupervised network to denoise microscopy videos featured by image sequences captured by a fixed location microscopy camera. Specifically, we propose a DeepTemporal Interpolation method, leveraging a temporal signal filter integrated into the bottom CNN layers, to restore microscopy videos corrupted by unknown noise types. Our unsupervised denoising architecture is distinguished by its ability to adapt to multiple noise conditions without the need for pre-existing noise distribution knowledge, addressing a significant challenge in real-world medical applications. Furthermore, we evaluate our denoising framework using both real microscopy recordings and simulated data, validating our outperforming video denoising performance across a broad spectrum of noise scenarios. Extensive experiments demonstrate that our unsupervised model consistently outperforms state-of-the-art supervised and unsupervised video denoising techniques, proving especially effective for microscopy videos.

[368]  arXiv:2404.12170 (cross-list from eess.SP) [pdf, other]
Title: Secure Semantic Communication for Image Transmission in the Presence of Eavesdroppers
Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)

Semantic communication (SemCom) has emerged as a key technology for the forthcoming sixth-generation (6G) network, attributed to its enhanced communication efficiency and robustness against channel noise. However, the open nature of wireless channels renders them vulnerable to eavesdropping, posing a serious threat to privacy. To address this issue, we propose a novel secure semantic communication (SemCom) approach for image transmission, which integrates steganography technology to conceal private information within non-private images (host images). Specifically, we propose an invertible neural network (INN)-based signal steganography approach, which embeds channel input signals of a private image into those of a host image before transmission. This ensures that the original private image can be reconstructed from the received signals at the legitimate receiver, while the eavesdropper can only decode the information of the host image. Simulation results demonstrate that the proposed approach maintains comparable reconstruction quality of both host and private images at the legitimate receiver, compared to scenarios without any secure mechanisms. Experiments also show that the eavesdropper is only able to reconstruct host images, showcasing the enhanced security provided by our approach.

[369]  arXiv:2404.12178 (cross-list from physics.soc-ph) [pdf, other]
Title: Designing a sector-coupled European energy system robust to 60 years of historical weather data
Subjects: Physics and Society (physics.soc-ph); Systems and Control (eess.SY)

As energy systems transform to rely on renewable energy and electrification, they encounter stronger year-to-year variability in energy supply and demand. However, most infrastructure planning is based on a single weather year, resulting in a lack of robustness. In this paper, we optimize energy infrastructure for a European energy system designed for net-zero CO$_2$ emissions in 62 different weather years. Subsequently, we fix the capacity layouts and simulate their operation in every weather year, to evaluate resource adequacy and CO$_2$ emissions abatement. We show that interannual weather variability causes variation of $\pm$10\% in total system cost. The most expensive capacity layout obtains the lowest net CO$_2$ emissions but not the highest resource adequacy. Instead, capacity layouts designed with years including compound weather events result in a more robust and cost-effective design. Deploying CO$_2$-emitting backup generation is a cost-effective robustness measure, which only increase CO$_2$ emissions marginally as the average CO$_2$ emissions remain less than 1\% of 1990 levels. Our findings highlight how extreme weather years drive investments in robustness measures, making them compatible with all weather conditions within six decades of historical weather data.

[370]  arXiv:2404.12205 (cross-list from math.OC) [pdf, other]
Title: Tracing Pareto-optimal points for multi-objective shape optimization applied to electric machines
Comments: 9 pages, 3 figures. SCEE2024 conference proceeding paper
Subjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)

In the context of the optimization of rotating electric machines, many different objective functions are of interest and considering this during the optimization is of crucial importance. While evolutionary algorithms can provide a Pareto front straightforwardly and are widely used in this context, derivative-based optimization algorithms can be computationally more efficient. In this case, a Pareto front can be obtained by performing several optimization runs with different weights. In this work, we focus on a free-form shape optimization approach allowing for arbitrary motor geometries. In particular, we propose a way to efficiently obtain Pareto-optimal points by moving along to the Pareto front exploiting a homotopy method based on second order shape derivatives.

[371]  arXiv:2404.12287 (cross-list from math.GT) [pdf, other]
Title: Lifting maps between graphs to embeddings
Authors: Alexey Gorelov
Comments: 38 pages, 8 figures
Subjects: Geometric Topology (math.GT); Computational Geometry (cs.CG); Discrete Mathematics (cs.DM); Combinatorics (math.CO)

In this paper, we study conditions for the existence of an embedding $\widetilde{f} \colon P \to Q \times \mathbb{R}$ such that $f = \mathrm{pr}_Q \circ \widetilde{f}$, where $f \colon P \to Q$ is a piecewise linear map between polyhedra. Our focus is on non-degenerate maps between graphs, where non-degeneracy means that the preimages of points are finite sets.
We introduce combinatorial techniques and establish necessary and sufficient conditions for the general case. Using these results, we demonstrate that the problem of the existence of a lifting reduces to testing the satisfiability of a 3-CNF formula. Additionally, we construct a counterexample to a result by V. Po\'{e}naru on lifting of smooth immersions to embeddings.
Furthermore, by establishing connections between the stated problem and the approximability by embeddings, we deduce that, in the case of generic maps from a tree to a segment, a weaker condition becomes sufficient for the existence of a lifting.

[372]  arXiv:2404.12290 (cross-list from stat.ML) [pdf, other]
Title: Debiased Distribution Compression
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)

Modern compression methods can summarize a target distribution $\mathbb{P}$ more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging quickly to $\mathbb{P}$. We introduce a new suite of compression methods suitable for compression with biased input sequences. Given $n$ points targeting the wrong distribution and quadratic time, Stein Kernel Thinning (SKT) returns $\sqrt{n}$ equal-weighted points with $\widetilde{O}(n^{-1/2})$ maximum mean discrepancy (MMD) to $\mathbb {P}$. For larger-scale compression tasks, Low-rank SKT achieves the same feat in sub-quadratic time using an adaptive low-rank debiasing procedure that may be of independent interest. For downstream tasks that support simplex or constant-preserving weights, Stein Recombination and Stein Cholesky achieve even greater parsimony, matching the guarantees of SKT with as few as $\operatorname{poly-log}(n)$ weighted points. Underlying these advances are new guarantees for the quality of simplex-weighted coresets, the spectral decay of kernel matrices, and the covering numbers of Stein kernel Hilbert spaces. In our experiments, our techniques provide succinct and accurate posterior summaries while overcoming biases due to burn-in, approximate Markov chain Monte Carlo, and tempering.

[373]  arXiv:2404.12294 (cross-list from stat.ML) [pdf, other]
Title: floZ: Evidence estimation from posterior samples with normalizing flows
Comments: 10 pages, 4 figures, 1 table
Subjects: Machine Learning (stat.ML); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)

We propose a novel method (floZ), based on normalizing flows, for estimating the Bayesian evidence (and its numerical uncertainty) from a set of samples drawn from the unnormalized posterior distribution. We validate it on distributions whose evidence is known analytically, up to 15 parameter space dimensions, and compare with two state-of-the-art techniques for estimating the evidence: nested sampling (which computes the evidence as its main target) and a k-nearest-neighbors technique that produces evidence estimates from posterior samples. Provided representative samples from the target posterior are available, our method is more robust to posterior distributions with sharp features, especially in higher dimensions. It has wide applicability, e.g., to estimate the evidence from variational inference, Markov-chain Monte Carlo samples, or any other method that delivers samples from the unnormalized posterior density.

[374]  arXiv:2404.12356 (cross-list from stat.ML) [pdf, other]
Title: Improving the interpretability of GNN predictions through conformal-based graph sparsification
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

Graph Neural Networks (GNNs) have achieved state-of-the-art performance in solving graph classification tasks. However, most GNN architectures aggregate information from all nodes and edges in a graph, regardless of their relevance to the task at hand, thus hindering the interpretability of their predictions. In contrast to prior work, in this paper we propose a GNN \emph{training} approach that jointly i) finds the most predictive subgraph by removing edges and/or nodes -- -\emph{without making assumptions about the subgraph structure} -- while ii) optimizing the performance of the graph classification task. To that end, we rely on reinforcement learning to solve the resulting bi-level optimization with a reward function based on conformal predictions to account for the current in-training uncertainty of the classifier. Our empirical results on nine different graph classification datasets show that our method competes in performance with baselines while relying on significantly sparser subgraphs, leading to more interpretable GNN-based predictions.

[375]  arXiv:2404.12367 (cross-list from cond-mat.mtrl-sci) [pdf, other]
Title: Information theory unifies atomistic machine learning, uncertainty quantification, and materials thermodynamics
Subjects: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)

An accurate description of information is relevant for a range of problems in atomistic modeling, such as sampling methods, detecting rare events, analyzing datasets, or performing uncertainty quantification (UQ) in machine learning (ML)-driven simulations. Although individual methods have been proposed for each of these tasks, they lack a common theoretical background integrating their solutions. Here, we introduce an information theoretical framework that unifies predictions of phase transformations, kinetic events, dataset optimality, and model-free UQ from atomistic simulations, thus bridging materials modeling, ML, and statistical mechanics. We first demonstrate that, for a proposed representation, the information entropy of a distribution of atom-centered environments is a surrogate value for thermodynamic entropy. Using molecular dynamics (MD) simulations, we show that information entropy differences from trajectories can be used to build phase diagrams, identify rare events, and recover classical theories of nucleation. Building on these results, we use this general concept of entropy to quantify information in datasets for ML interatomic potentials (IPs), informing compression, explaining trends in testing errors, and evaluating the efficiency of active learning strategies. Finally, we propose a model-free UQ method for MLIPs using information entropy, showing it reliably detects extrapolation regimes, scales to millions of atoms, and goes beyond model errors. This method is made available as the package QUESTS: Quick Uncertainty and Entropy via STructural Similarity, providing a new unifying theory for data-driven atomistic modeling and combining efforts in ML, first-principles thermodynamics, and simulations.

Replacements for Fri, 19 Apr 24

[376]  arXiv:1704.04370 (replaced) [pdf, other]
Title: Fast Similarity Sketching
Comments: The original version was directly based on a conference paper of the same title from FOCS'17. This new version is substantially revised with some cleaner and stronger theorems, particularly concerning the high probability domain. Moreover, there is one more author, Jakob Houen. In addition, one of the old authors, Mathias, has changed surname from Knudsen to Langhede
Subjects: Data Structures and Algorithms (cs.DS)
[377]  arXiv:1904.06866 (replaced) [pdf, ps, other]
Title: Predicting human decisions with behavioral theories and machine learning
Comments: This version includes a large and significant update on the previous (2019) version
Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
[378]  arXiv:2001.05053 (replaced) [src]
Title: Tight Static Lower Bounds for Non-Adaptive Data Structures
Comments: This paper has been superceded and merged with arXiv:2308.16042
Subjects: Data Structures and Algorithms (cs.DS)
[379]  arXiv:2002.06910 (replaced) [pdf, other]
Title: t-viSNE: Interactive Assessment and Interpretation of t-SNE Projections
Comments: This manuscript is published in the IEEE Transactions on Visualization and Computer Graphics Journal (IEEE TVCG)
Journal-ref: IEEE TVCG 2020, 26(8), 2696-2714
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
[380]  arXiv:2012.01205 (replaced) [pdf, other]
Title: VisEvol: Visual Analytics to Support Hyperparameter Search through Evolutionary Optimization
Comments: This manuscript is accepted for publication in a special issue of Computer Graphics Forum (CGF)
Journal-ref: Computer Graphics Forum 2021, 40(3), 201-214
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
[381]  arXiv:2103.14539 (replaced) [pdf, other]
Title: FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches
Comments: This manuscript is accepted for publication in the IEEE Transactions on Visualization and Computer Graphics Journal (IEEE TVCG)
Journal-ref: IEEE TVCG 2022, 28(4), 1773-1791
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
[382]  arXiv:2106.09435 (replaced) [pdf, other]
Title: Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers
Comments: ICML 2021, 9 pages, coded implementation available in this https URL (jpsro.py in examples)
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
[383]  arXiv:2107.09847 (replaced) [pdf, other]
Title: CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding
Comments: 9 pages with 4 figures and 3 tables. This work has been accepted for presentation at CogSci 2024 and is currently under revision
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[384]  arXiv:2109.07877 (replaced) [pdf, other]
Title: MFE-NER: Multi-feature Fusion Embedding for Chinese Named Entity Recognition
Authors: Jiatong Li, Kui Meng
Subjects: Computation and Language (cs.CL)
[385]  arXiv:2112.00334 (replaced) [pdf, other]
Title: VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees
Comments: This manuscript is accepted for publication in the Information Visualization (IV) - SAGE Journals
Journal-ref: Information Visualization, 2021, 22(2), 115-139
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
[386]  arXiv:2202.09890 (replaced) [pdf, ps, other]
Title: Deterministic Patterns for Multiple Access with Latency and Reliability Guarantees
Subjects: Information Theory (cs.IT)
[387]  arXiv:2203.15753 (replaced) [pdf, other]
Title: HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques
Comments: This manuscript is accepted for publication in Computer Graphics Forum (CGF)
Journal-ref: Computer Graphics Forum 2023, 42(1), 135-154
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
[388]  arXiv:2206.08818 (replaced) [pdf, other]
Title: Projected distances for multi-parameter persistence modules
Comments: 56 pages, 6 figures, several minor corrections, accepted in Annales de l'institut Fourier
Subjects: Algebraic Topology (math.AT); Computational Geometry (cs.CG)
[389]  arXiv:2206.14068 (replaced) [pdf, other]
Title: FuSeBMC v4: Improving code coverage with smart seeds via BMC, fuzzing and static analysis
Comments: 24 pages, In The Formal Aspects of Computing Journal (FAC 2024)
Subjects: Software Engineering (cs.SE)
[390]  arXiv:2207.08015 (replaced) [pdf, ps, other]
Title: Parallel Best Arm Identification in Heterogeneous Environments
Comments: 15 pages (published in SPAA 2024)
Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
[391]  arXiv:2207.12653 (replaced) [pdf, other]
Title: Incremental Measurement of Structural Entropy for Dynamic Graphs
Subjects: Information Theory (cs.IT)
[392]  arXiv:2207.14282 (replaced) [pdf, other]
Title: Geometric relative entropies and barycentric Rényi divergences
Comments: v5: Several minor errors corrected following referee feedback. 80 pages
Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Mathematical Physics (math-ph); Functional Analysis (math.FA)
[393]  arXiv:2208.08690 (replaced) [pdf, other]
Title: A Survey on Open Information Extraction from Rule-based Model to Large Language Model
Comments: The first five authors contributed to this work equally. Names are ordered randomly
Subjects: Computation and Language (cs.CL)
[394]  arXiv:2208.10878 (replaced) [pdf, other]
Title: Transferability Ranking of Adversarial Examples
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
[395]  arXiv:2210.12100 (replaced) [pdf, other]
Title: Boomerang: Local sampling on image manifolds using diffusion models
Comments: Published in Transactions on Machine Learning Research
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
[396]  arXiv:2211.10280 (replaced) [pdf, other]
Title: TensAIR: Real-Time Training of Neural Networks from Data-streams
Subjects: Machine Learning (cs.LG); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
[397]  arXiv:2211.11368 (replaced) [pdf, other]
Title: Precise Asymptotics for Spectral Methods in Mixed Generalized Linear Models
Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
[398]  arXiv:2211.13525 (replaced) [pdf, other]
Title: Evaluating Search-Based Software Microbenchmark Prioritization
Comments: 17 pages, 7 figures, 4 tables, 1 listing; accepted in IEEE Transactions on Software Engineering
Subjects: Software Engineering (cs.SE)
[399]  arXiv:2212.03338 (replaced) [pdf, other]
Title: Framework-agnostic Semantically-aware Global Reasoning for Segmentation
Comments: Published in WACV 2024
Journal-ref: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2024, pp. 988-998
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[400]  arXiv:2212.03539 (replaced) [pdf, other]
Title: MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels
Comments: This manuscript is accepted for publication in Proceedings of the 16th IEEE Pacific Visualization Symposium (PacificVis '23)
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
[401]  arXiv:2212.04282 (replaced) [pdf, other]
Title: Mitigating Spurious Correlations for Self-supervised Recommendation
Comments: Accepted by Machine Intelligence Research 2023. (MIR paper link this https URL)
Journal-ref: Machine Intelligence Research vol. 20, no. 6, pp. 263-275, 2023
Subjects: Information Retrieval (cs.IR)
[402]  arXiv:2212.11737 (replaced) [pdf, other]
Title: The State of the Art in Enhancing Trust in Machine Learning Models with the Use of Visualizations
Journal-ref: Computer Graphics Forum 2020, 39(3), 713-756
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
[403]  arXiv:2301.04517 (replaced) [pdf, other]
Title: A new dataset for measuring the performance of blood vessel segmentation methods under distribution shifts
Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[404]  arXiv:2301.04872 (replaced) [pdf, other]
Title: Explainable Ponzi Schemes Detection on Ethereum
Comments: Accepted to ACM SAC'24
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
[405]  arXiv:2302.03098 (replaced) [pdf, other]
Title: One-shot Empirical Privacy Estimation for Federated Learning
Comments: Final revision, oral presentation at ICLR 2024
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
[406]  arXiv:2302.04143 (replaced) [pdf, other]
Title: Predicting Thrombectomy Recanalization from CT Imaging Using Deep Learning Models
Comments: Medical Imaging with Deep Learning 2022 accepted short paper Jun 2022
Journal-ref: Medical Imaging with Deep Learning 2022
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[407]  arXiv:2302.06836 (replaced) [pdf, other]
Title: COMET: Neural Cost Model Explanation Framework
Comments: Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, 2024
Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
[408]  arXiv:2303.00055 (replaced) [pdf, other]
Title: Learning time-scales in two-layers neural networks
Comments: 64 pages, 15 figures
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
[409]  arXiv:2303.01056 (replaced) [pdf, ps, other]
Title: Deep Learning Based Code Generation Methods: Literature Review
Comments: in Chinese language
Journal-ref: Ruan Jian Xue Bao/Journal of Software (in Chinese), 2024 . http://www.jos.org.cn/1000-9825/6981.htm
Subjects: Software Engineering (cs.SE)
[410]  arXiv:2303.06090 (replaced) [pdf, ps, other]
Title: Simple and efficient four-cycle counting on sparse graphs
Subjects: Data Structures and Algorithms (cs.DS)
[411]  arXiv:2303.12307 (replaced) [pdf, other]
Title: Predicting and Enhancing the Fairness of DNNs with the Curvature of Perceptual Manifolds
Comments: 17pages, Accepted by CVPR 2023, Submitted to TPAMI
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[412]  arXiv:2303.12525 (replaced) [pdf, other]
Title: A survey on hardware-based malware detection approaches
Journal-ref: IEEE ACCESS 2024
Subjects: Cryptography and Security (cs.CR)
[413]  arXiv:2303.13959 (replaced) [pdf, other]
Title: Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion
Comments: IJCAI2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[414]  arXiv:2303.14176 (replaced) [pdf, other]
Title: A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception
Journal-ref: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[415]  arXiv:2303.14467 (replaced) [pdf, other]
Title: A Survey on the Densest Subgraph Problem and Its Variants
Comments: Accepted to ACM Computing Surveys
Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
[416]  arXiv:2304.00133 (replaced) [pdf, other]
Title: DeforestVis: Behavior Analysis of Machine Learning Models with Surrogate Decision Stumps
Comments: This manuscript is accepted for publication in Computer Graphics Forum (CGF)
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
[417]  arXiv:2304.04974 (replaced) [pdf, other]
Title: Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR
Comments: 12 pages, 7 figures, IEEE/ACM Transactions on Audio, Speech, and Language Processing
Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
[418]  arXiv:2304.10265 (replaced) [pdf, other]
Title: Replication in Requirements Engineering: the NLP for RE Case
Subjects: Software Engineering (cs.SE)
[419]  arXiv:2304.12530 (replaced) [pdf, ps, other]
Title: Resource Specifications for Resource-Manipulating Programs
Subjects: Programming Languages (cs.PL)
[420]  arXiv:2305.00220 (replaced) [pdf, other]
Title: Relaxed forced choice improves performance of visual quality assessment methods
Comments: 6 pages, 3 figures, accepted at the 2023 15th International Conference on Quality of Multimedia Experience (QoMEX). Database is publicly accessible at this http URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[421]  arXiv:2305.11474 (replaced) [pdf, other]
Title: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration
Comments: CVPR 2024 Workshop - NTIRE. Codes are available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[422]  arXiv:2305.11527 (replaced) [pdf, other]
Title: InstructIE: A Bilingual Instruction-based Information Extraction Dataset
Comments: Work in progress; project homepage: this https URL dataset: this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
[423]  arXiv:2305.14497 (replaced) [pdf, other]
Title: Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement
Comments: Accepted to EMNLP 2023 Findings. Codes and prompts are available at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[424]  arXiv:2305.18820 (replaced) [pdf, other]
Title: Robust Reinforcement Learning Objectives for Sequential Recommender Systems
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
[425]  arXiv:2306.00605 (replaced) [pdf, other]
Title: Stay on Track: A Frenet Wrapper to Overcome Off-road Trajectories in Vehicle Motion Prediction
Subjects: Robotics (cs.RO)
[426]  arXiv:2306.01162 (replaced) [pdf, other]
Title: Integrated Sensing-Communication-Computation for Edge Artificial Intelligence
Comments: This paper was accepted by IEEE Internet of Things Magazine on April-18-2024
Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[427]  arXiv:2306.05923 (replaced) [pdf, other]
Title: When Authentication Is Not Enough: On the Security of Behavioral-Based Driver Authentication Systems
Subjects: Cryptography and Security (cs.CR)
[428]  arXiv:2306.10085 (replaced) [pdf, other]
Title: Current Trends in Digital Twin Development, Maintenance, and Operation: An Interview Study
Subjects: Software Engineering (cs.SE)
[429]  arXiv:2306.11371 (replaced) [pdf, other]
Title: Visually grounded few-shot word learning in low-resource settings
Comments: Accepted to TASLP. arXiv admin note: substantial text overlap with arXiv:2305.15937
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
[430]  arXiv:2306.17332 (replaced) [pdf, other]
Title: Designing Stable Neural Networks using Convex Analysis and ODEs
Comments: 34 pages, 6 figures. This is the accepted version of a paper published in Physica D: Nonlinear Phenomena
Subjects: Machine Learning (cs.LG)
[431]  arXiv:2306.17535 (replaced) [pdf, other]
Title: A multi-level analysis of data quality for formal software citation
Subjects: Digital Libraries (cs.DL)
[432]  arXiv:2307.10757 (replaced) [pdf, other]
Title: Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition
Comments: This paper was accepted by IEEE Transactions on Affective Computing 2024
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
[433]  arXiv:2307.15361 (replaced) [pdf, other]
Title: Confident Feature Ranking
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[434]  arXiv:2308.01055 (replaced) [pdf, other]
Title: Towards optimal sensor placement for inverse problems in spaces of measures
Comments: 34 pages, 8 figures
Journal-ref: Inverse Probl. 40.5 (2024) 055007
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
[435]  arXiv:2308.02022 (replaced) [pdf, other]
Title: Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models
Subjects: Computation and Language (cs.CL)
[436]  arXiv:2308.02465 (replaced) [pdf, other]
Title: Label Inference Attacks against Node-level Vertical Federated GNNs
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
[437]  arXiv:2308.06409 (replaced) [pdf, other]
Title: Through the Lens of Google CrUX: Dissecting Web Browsing Experience Across Devices and Countries
Comments: 9 pages, 6 figures and 1 table. Accepted for publication at 2024 IFIP Networking Conference
Subjects: Networking and Internet Architecture (cs.NI); Performance (cs.PF)
[438]  arXiv:2308.06981 (replaced) [pdf, other]
Title: The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track
Comments: Accepted for Transactions of the International Society for Music Information Retrieval
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
[439]  arXiv:2308.07203 (replaced) [pdf, other]
Title: Successive Refinement of Shannon Cipher System Under Maximal Leakage
Subjects: Information Theory (cs.IT)
[440]  arXiv:2308.07476 (replaced) [pdf, ps, other]
Title: Dependent rounding with strong negative-correlation, and scheduling on unrelated machines to minimize completion time
Authors: David G. Harris
Journal-ref: SODA 2024
Subjects: Data Structures and Algorithms (cs.DS)
[441]  arXiv:2308.10390 (replaced) [pdf, other]
Title: LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models
Subjects: Computation and Language (cs.CL)
[442]  arXiv:2308.10913 (replaced) [pdf, other]
Title: Automated mapping of virtual environments with visual predictive coding
Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
[443]  arXiv:2308.15426 (replaced) [pdf, ps, other]
Title: Combining swap structures: the case of Paradefinite Ivlev-like modal logics based on FDE
Comments: Definition of hyperintensionality on p.2 was corrected
Subjects: Logic in Computer Science (cs.LO); Logic (math.LO)
[444]  arXiv:2308.16512 (replaced) [pdf, other]
Title: MVDream: Multi-view Diffusion for 3D Generation
Comments: Reorganized for arXiv; Our project page is this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[445]  arXiv:2309.05153 (replaced) [pdf, other]
Title: Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[446]  arXiv:2309.08513 (replaced) [pdf, other]
Title: SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels
Comments: This work has been accepted by IJCV2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[447]  arXiv:2309.08909 (replaced) [pdf, other]
Title: CARLA-Loc: Synthetic SLAM Dataset with Full-stack Sensor Setup in Challenging Weather and Dynamic Environments
Subjects: Robotics (cs.RO)
[448]  arXiv:2309.10931 (replaced) [pdf, ps, other]
Title: A Family of Pretrained Transformer Language Models for Russian
Comments: to appear in LREC-COLING-2024
Subjects: Computation and Language (cs.CL)
[449]  arXiv:2309.11744 (replaced) [pdf, ps, other]
Title: Infinite Horizon Average Cost Optimality Criteria for Mean-Field Control
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
[450]  arXiv:2309.13461 (replaced) [pdf, other]
Title: Tight bounds on Pauli channel learning without entanglement
Comments: 20 pages, 2 figure; v2: add Fig.2 comparing our lower bound with noisy-entanglement-assisted upper bound, close to accepted version
Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
[451]  arXiv:2309.16208 (replaced) [pdf, other]
Title: Low-rank tensor completion via tensor joint rank with logarithmic composite norm
Authors: Hongbing Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[452]  arXiv:2309.16388 (replaced) [pdf, other]
Title: Exposing Image Splicing Traces in Scientific Publications via Uncertainty-guided Refinement
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[453]  arXiv:2309.16408 (replaced) [pdf, other]
Title: Assessing the Solvency of Virtual Asset Service Providers: Are Current Standards Sufficient?
Subjects: General Finance (q-fin.GN); Cryptography and Security (cs.CR)
[454]  arXiv:2310.01884 (replaced) [pdf, other]
Title: Enhanced LFTSformer: A Novel Long-Term Financial Time Series Prediction Model Using Advanced Feature Engineering and the DS Encoder Informer Architecture
Comments: The methodology, experiments, and language of the original version have been completely updated. Detailed adjustments will be made in the future
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
[455]  arXiv:2310.02141 (replaced) [pdf, other]
Title: Adaptive Gait Modeling and Optimization for Principally Kinematic Systems
Comments: 7 pages, 4 figures
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
[456]  arXiv:2310.02233 (replaced) [pdf, other]
Title: Generalized Schrödinger Bridge Matching
Comments: ICLR 2024 Camera Ready
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
[457]  arXiv:2310.02670 (replaced) [pdf, other]
Title: Searching 2D-Strings for Matching Frames
Subjects: Data Structures and Algorithms (cs.DS)
[458]  arXiv:2310.05886 (replaced) [pdf, other]
Title: Streaming Anchor Loss: Augmenting Supervision with Temporal Significance
Comments: Published at IEEE ICASSP 2024, please see this https URL
Journal-ref: In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6110-6114). IEEE
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[459]  arXiv:2310.07108 (replaced) [pdf, other]
Title: An efficient saddle search method for ordered phase transitions involving translational invariance
Subjects: Numerical Analysis (math.NA)
[460]  arXiv:2310.08182 (replaced) [pdf, other]
Title: XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness Evaluation
Comments: Paper accepted by Synthetic Data for Computer Vision Workshop @ IEEE CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[461]  arXiv:2310.08475 (replaced) [pdf, other]
Title: Can We Edit Multimodal Large Language Models?
Comments: EMNLP 2023. Add the Exact Match/Accuracy results of Reliability and T-Generality
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
[462]  arXiv:2310.09271 (replaced) [pdf, other]
Title: Efficiency of Non-Truthful Auctions in Auto-bidding with Budget Constraints
Subjects: Computer Science and Game Theory (cs.GT)
[463]  arXiv:2310.10688 (replaced) [pdf, other]
Title: A decoder-only foundation model for time-series forecasting
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[464]  arXiv:2310.18784 (replaced) [pdf, other]
Title: High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise
Comments: 32 pages, 4 figures
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
[465]  arXiv:2311.01188 (replaced) [pdf, other]
Title: Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[466]  arXiv:2311.01949 (replaced) [pdf, other]
Title: Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks
Comments: Accepted by ICASSP 2024
Subjects: Computation and Language (cs.CL)
[467]  arXiv:2311.06503 (replaced) [pdf, other]
Title: Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering
Comments: Work in progress. Code is available at this https URL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[468]  arXiv:2311.06634 (replaced) [pdf, ps, other]
Title: Back to Basics: Fast Denoising Iterative Algorithm
Authors: Deborah Pereg
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[469]  arXiv:2311.07601 (replaced) [pdf, other]
Title: Online Advertisements with LLMs: Opportunities and Challenges
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
[470]  arXiv:2311.08329 (replaced) [pdf, other]
Title: KTRL+F: Knowledge-Augmented In-Document Search
Subjects: Computation and Language (cs.CL)
[471]  arXiv:2311.09104 (replaced) [pdf, other]
Title: Cross-view and Cross-pose Completion for 3D Human Understanding
Comments: CVPR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[472]  arXiv:2311.09590 (replaced) [pdf, other]
Title: MARformer: An Efficient Metal Artifact Reduction Transformer for Dental CBCT Images
Comments: under consideration of Computer Vision and Image Understanding journal
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[473]  arXiv:2311.11581 (replaced) [pdf, other]
Title: Generalized extrapolation methods based on compositions of a basic 2nd-order scheme
Comments: 17 figures
Subjects: Numerical Analysis (math.NA)
[474]  arXiv:2311.12554 (replaced) [pdf, other]
Title: Multiscale interpolative construction of quantized tensor trains
Authors: Michael Lindsey
Subjects: Numerical Analysis (math.NA)
[475]  arXiv:2311.13590 (replaced) [pdf, other]
Title: Triangle-free $2$-matchings
Authors: Katarzyna Paluch
Comments: A more polished version and a clearer explanation of \Delta-diminishing hinges and 1-vulnerable triangles
Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM)
[476]  arXiv:2311.14422 (replaced) [pdf, ps, other]
Title: Investigation on the Impact of Heat Waves on Distribution System Failures
Subjects: Systems and Control (eess.SY)
[477]  arXiv:2311.15260 (replaced) [pdf, other]
Title: NeuRAD: Neural Rendering for Autonomous Driving
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[478]  arXiv:2311.16520 (replaced) [pdf, other]
Title: Value Approximation for Two-Player General-Sum Differential Games with State Constraints
Comments: Accepted to TRO
Subjects: Robotics (cs.RO); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
[479]  arXiv:2311.17116 (replaced) [pdf, other]
Title: REF$^2$-NeRF: Reflection and Refraction aware Neural Radiance Field
Comments: 10 pages, 8 figures, 2 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[480]  arXiv:2312.02027 (replaced) [pdf, other]
Title: Stochastic Optimal Control Matching
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
[481]  arXiv:2312.02734 (replaced) [pdf, ps, other]
Title: Geometric Data-Driven Dimensionality Reduction in MPC with Guarantees
Comments: This paper is presented at ECC 2024
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
[482]  arXiv:2312.04265 (replaced) [pdf, other]
Title: Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[483]  arXiv:2312.04519 (replaced) [pdf, other]
Title: Bootstrapping Autonomous Driving Radars with Self-Supervised Learning
Comments: 12 pages, 5 figures, to be published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[484]  arXiv:2312.04876 (replaced) [pdf, other]
Title: GVE-Louvain: Fast Louvain Algorithm for Community Detection in Shared Memory Setting
Authors: Subhajit Sahu
Comments: 11 pages, 8 figures, 2 tables
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
[485]  arXiv:2312.05771 (replaced) [pdf, other]
Title: Hacking Task Confounder in Meta-Learning
Comments: Accepted by IJCAI 2024, 9 pages, 5 figures, 4 tables
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[486]  arXiv:2312.05916 (replaced) [pdf, ps, other]
Title: Switching Frequency Limitation with Finite Control Set Model Predictive Control via Slack Variables
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
[487]  arXiv:2312.06585 (replaced) [pdf, other]
[488]  arXiv:2312.08888 (replaced) [pdf, other]
Title: Read Between the Layers: Leveraging Intra-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models
Comments: Preprint under review
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[489]  arXiv:2312.11219 (replaced) [pdf, other]
Title: Generalised Adaptive Cross Approximation for Convolution Quadrature based Boundary Element Formulation
Subjects: Numerical Analysis (math.NA)
[490]  arXiv:2312.12871 (replaced) [pdf, other]
Title: Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[491]  arXiv:2312.16400 (replaced) [pdf, other]
Title: LGMRec: Local and Global Graph Learning for Multimodal Recommendation
Comments: Accepted by AAAI2024
Subjects: Information Retrieval (cs.IR)
[492]  arXiv:2312.16867 (replaced) [pdf, other]
Title: DualFluidNet: an Attention-based Dual-pipeline Network for FLuid Simulation
Comments: 14 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
[493]  arXiv:2401.06091 (replaced) [pdf, other]
Title: A Closer Look at AUROC and AUPRC under Class Imbalance
Authors: Matthew B. A. McDermott (1), Lasse Hyldig Hansen (2), Haoran Zhang (3), Giovanni Angelotti (4), Jack Gallifant (3) ((1) Harvard Medical School, (2) Aarhus University, (3) Massachusetts Institute of Technology, (4) IRCCS Humanitas Research Hospital)
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
[494]  arXiv:2401.06341 (replaced) [pdf, other]
Title: AffordanceLLM: Grounding Affordance from Vision Language Models
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[495]  arXiv:2401.09834 (replaced) [pdf, ps, other]
Title: Convergence of a spatial semidiscretization for a three-dimensional stochastic Allen-Cahn equation with multiplicative noise
Authors: Binjie Li, Qin Zhou
Subjects: Numerical Analysis (math.NA); Probability (math.PR)
[496]  arXiv:2401.10703 (replaced) [pdf, other]
Title: DRAT Proofs of Unsatisfiability for SAT Modulo Monotonic Theories
Subjects: Logic in Computer Science (cs.LO)
[497]  arXiv:2401.11609 (replaced) [pdf, other]
Title: Graph Edits for Counterfactual Explanations: A comparative study
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
[498]  arXiv:2401.12451 (replaced) [pdf, ps, other]
Title: Methods and strategies for improving the novel view synthesis quality of neural radiation field
Journal-ref: IEEE ACCESS 12 (2024) 50548-50555
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[499]  arXiv:2401.15355 (replaced) [pdf, ps, other]
Title: Improved bounds on the interactive capacity via error pattern analysis
Comments: Shorter version accepted at ISIT 2024
Subjects: Information Theory (cs.IT)
[500]  arXiv:2401.15663 (replaced) [pdf, other]
Title: Low-resolution Prior Equilibrium Network for CT Reconstruction
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
[501]  arXiv:2401.16116 (replaced) [src]
Title: Quantum Cheques
Comments: Construction 1 is insecure which is the building block for the rest of the paper. Therefore, we decided to withdraw it
Subjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR)
[502]  arXiv:2401.16158 (replaced) [pdf, other]
Title: Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Comments: Accepted by ICLR 2024 Workshop in Large Language Model (LLM) Agents
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[503]  arXiv:2401.16553 (replaced) [pdf, other]
Title: SelectLLM: Can LLMs Select Important Instructions to Annotate?
Comments: First Authors: Ritik Sachin Parkar and Jaehyung Kim | Second Author: Jong Inn Park | PI: Dongyeop Kang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[504]  arXiv:2402.01805 (replaced) [pdf, other]
Title: Can LLMs perform structured graph reasoning?
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[505]  arXiv:2402.01858 (replaced) [pdf, other]
Title: Explaining latent representations of generative models with large multimodal models
Comments: ICLR 2024 Workshop on Reliable and Responsible Foundation Models
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[506]  arXiv:2402.02286 (replaced) [pdf, other]
Title: Multi-Level Aggregation and Recursive Alignment Architecture for Efficient Parallel Inference Segmentation Network
Comments: 15 pages, 9 figures and 12 Tables. Manuscript completed on April 30, 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[507]  arXiv:2402.06118 (replaced) [pdf, other]
Title: ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
Comments: 10 pages, 3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[508]  arXiv:2402.06244 (replaced) [pdf, other]
Title: Quantifying and Enhancing Multi-modal Robustness with Modality Preference
Comments: Accepted to ICLR 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[509]  arXiv:2402.06598 (replaced) [pdf, other]
Title: CigaR: Cost-efficient Program Repair with LLMs
Subjects: Software Engineering (cs.SE)
[510]  arXiv:2402.06618 (replaced) [pdf, other]
Title: Polynomial parametrisation of the canonical iterates to the solution of $-γg'= g^{-1}$
Authors: Roland Miyamoto
Comments: 7 pages, 1 figure, 2 tables
Subjects: Combinatorics (math.CO); Classical Analysis and ODEs (math.CA); Numerical Analysis (math.NA)
[511]  arXiv:2402.06764 (replaced) [pdf, other]
Title: GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding
Comments: Published in AAAI Spring Symposium: AAAI-MAKE 2024
Subjects: Artificial Intelligence (cs.AI)
[512]  arXiv:2402.06775 (replaced) [pdf, ps, other]
Title: Twenty Constructionist Things to Do with Artificial Intelligence and Machine Learning
Subjects: Computers and Society (cs.CY)
[513]  arXiv:2402.06885 (replaced) [pdf, other]
Title: DimVis: Interpreting Visual Clusters in Dimensionality Reduction With Explainable Boosting Machine
Comments: This manuscript is accepted for publication in EuroVis 2024 MLVis Workshop
Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Computation (stat.CO)
[514]  arXiv:2402.07632 (replaced) [pdf, other]
Title: Overconfident and Unconfident AI Hinder Human-AI Collaboration
Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
[515]  arXiv:2402.08171 (replaced) [pdf, ps, other]
Title: Epistemic Power in AI Ethics Labor: Legitimizing Located Complaints
Comments: Accepted to ACM FAccT 2024
Subjects: Computers and Society (cs.CY)
[516]  arXiv:2402.11515 (replaced) [pdf, other]
Title: Optimal Parallelization Strategies for Active Flow Control in Deep Reinforcement Learning-Based Computational Fluid Dynamics
Authors: Wang Jia, Hang Xu
Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
[517]  arXiv:2402.13919 (replaced) [pdf, other]
Title: SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization
Comments: Equal contribution for the first two authors
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[518]  arXiv:2402.15473 (replaced) [pdf, other]
Title: Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization
Comments: 19 pages, 6 figures, 21 tables
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[519]  arXiv:2402.15584 (replaced) [pdf, other]
Title: State Space Models for Event Cameras
Comments: 18 pages, 5 figures, 6 tables, CVPR 2024 Camera Ready paper
Journal-ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[520]  arXiv:2402.15758 (replaced) [pdf, other]
Title: Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[521]  arXiv:2402.16078 (replaced) [pdf, other]
Title: Beyond Spatio-Temporal Representations: Evolving Fourier Transform for Temporal Graphs
Comments: Accepted as a full conference paper in the International Conference on Learning Representations 2024
Subjects: Machine Learning (cs.LG)
[522]  arXiv:2402.17152 (replaced) [pdf, other]
Title: Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
Comments: 26 pages, 13 figures. this https URL Full technical report to follow
Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR)
[523]  arXiv:2402.17177 (replaced) [pdf, other]
Title: Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Comments: 37 pages, 18 figures; GitHub: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[524]  arXiv:2402.17402 (replaced) [pdf, other]
Title: Beacon, a lightweight deep reinforcement learning benchmark library for flow control
Subjects: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
[525]  arXiv:2402.19329 (replaced) [pdf, other]
Title: Social Links vs. Language Barriers: Decoding the Global Spread of Streaming Content
Comments: 8pages, 4 figures, and 1 table in manuscript, including 9 SI figures and 7 SI tables
Subjects: Physics and Society (physics.soc-ph); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
[526]  arXiv:2403.00028 (replaced) [pdf, ps, other]
Title: Lower Bounds for Differential Privacy Under Continual Observation and Online Threshold Queries
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
[527]  arXiv:2403.00199 (replaced) [pdf, other]
Title: Improving Socratic Question Generation using Data Augmentation and Preference Optimization
Comments: To appear at the 19th BEA Workshop co-located with NAACL-2024
Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
[528]  arXiv:2403.01194 (replaced) [pdf, other]
Title: A Comparative Study of Rapidly-exploring Random Tree Algorithms Applied to Ship Trajectory Planning and Behavior Generation
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
[529]  arXiv:2403.01255 (replaced) [pdf, other]
Title: Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey
Journal-ref: Information Fusion, Elsevier, 2024
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[530]  arXiv:2403.01888 (replaced) [pdf, other]
Title: Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost Benchmarks
Comments: Submitted to AutoML Conference 2024 ABCD Track
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[531]  arXiv:2403.04062 (replaced) [pdf, other]
Title: Chance-Constrained Control for Safe Spacecraft Autonomy: Convex Programming Approach
Authors: Kenshiro Oguri
Comments: Accepted for 2024 IEEE American Control Conference
Subjects: Optimization and Control (math.OC); Robotics (cs.RO); Systems and Control (eess.SY)
[532]  arXiv:2403.04923 (replaced) [pdf, other]
Title: Control-based Graph Embeddings with Data Augmentation for Contrastive Learning
Comments: Accepted in 2024 American Control Conference (ACC), July 8-12, 2024 in Toronto, ON, Canada
Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
[533]  arXiv:2403.07760 (replaced) [pdf, other]
Title: Simplified Tight Bounds for Monotone Minimal Perfect Hashing
Authors: Dmitry Kosolobov
Comments: 13 pages, 4 figures
Subjects: Data Structures and Algorithms (cs.DS)
[534]  arXiv:2403.12005 (replaced) [pdf, other]
Title: Visualization for Trust in Machine Learning Revisited: The State of the Field in 2023
Comments: This manuscript is accepted for publication in the IEEE Computer Graphics and Applications Journal (IEEE CG&A)
Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
[535]  arXiv:2403.12213 (replaced) [pdf, ps, other]
Title: Private graphon estimation via sum-of-squares
Comments: 71 pages, accepted to STOC 2024
Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Machine Learning (cs.LG); Machine Learning (stat.ML)
[536]  arXiv:2403.13846 (replaced) [pdf, other]
Title: A Clustering Method with Graph Maximum Decoding Information
Comments: 9 pages, 9 figures, IJCNN 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
[537]  arXiv:2403.14115 (replaced) [pdf, other]
Title: Training point-based deep learning networks for forest segmentation with synthetic data
Comments: 15 pages, 4 figures. Submitted to the International Conference on Pattern Recognition (ICPR) 2024
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[538]  arXiv:2403.14707 (replaced) [pdf, other]
Title: Information Fusion in Multimodal IoT Systems for physical activity level monitoring
Subjects: Computers and Society (cs.CY)
[539]  arXiv:2403.15182 (replaced) [pdf, other]
Title: PDE-CNNs: Axiomatic Derivations and Applications
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
[540]  arXiv:2403.15297 (replaced) [pdf, other]
Title: Sphere Neural-Networks for Rational Reasoning
Subjects: Artificial Intelligence (cs.AI)
[541]  arXiv:2403.15919 (replaced) [pdf, other]
Title: Negotiating the Shared Agency between Humans & AI in the Recommender System
Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
[542]  arXiv:2403.16038 (replaced) [pdf, other]
Title: Monotonic Paraphrasing Improves Generalization of Language Model Prompting
Comments: Under review at ARR 2024 April
Subjects: Computation and Language (cs.CL)
[543]  arXiv:2403.17893 (replaced) [pdf, other]
Title: A Survey on 3D Egocentric Human Pose Estimation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[544]  arXiv:2403.17924 (replaced) [pdf, other]
Title: AID: Attention Interpolation of Text-to-Image Diffusion
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
[545]  arXiv:2403.18933 (replaced) [pdf, other]
Title: SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages
Comments: SemEval 2024 Task Description Paper. arXiv admin note: text overlap with arXiv:2402.08638
Subjects: Computation and Language (cs.CL)
[546]  arXiv:2404.01300 (replaced) [pdf, other]
Title: NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields
Comments: 29 pages, 13 figures. Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[547]  arXiv:2404.02124 (replaced) [pdf, other]
Title: Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models
Comments: NAACL 2024 findings
Subjects: Computation and Language (cs.CL)
[548]  arXiv:2404.02230 (replaced) [pdf, other]
Title: "Against the Void": An Interview and Survey Study on How Rust Developers Use Unsafe Code
Comments: 12 pages with references, preprint
Subjects: Software Engineering (cs.SE)
[549]  arXiv:2404.02491 (replaced) [pdf, other]
Title: Measuring Social Norms of Large Language Models
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[550]  arXiv:2404.03027 (replaced) [pdf, other]
Title: JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[551]  arXiv:2404.03035 (replaced) [pdf, other]
Title: Global Convergence of High-Order Regularization Methods with Sums-of-Squares Taylor Models
Subjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)
[552]  arXiv:2404.03094 (replaced) [pdf, other]
Title: Low Frequency Sampling in Model Predictive Path Integral Control
Comments: Published to RA-L
Journal-ref: IEEE Robotics and Automation Letters, vol. 9, no. 5, pp.4543-4550, 2024
Subjects: Robotics (cs.RO); Optimization and Control (math.OC)
[553]  arXiv:2404.03914 (replaced) [pdf, other]
Title: Open vocabulary keyword spotting through transfer learning from speech synthesis
Subjects: Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[554]  arXiv:2404.05317 (replaced) [pdf, other]
Title: WebXR, A-Frame and Networked-Aframe as a Basis for an Open Metaverse: A Conceptual Architecture
Authors: Giuseppe Macario
Comments: minor fixes/rephrasing
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
[555]  arXiv:2404.05673 (replaced) [pdf, other]
Title: CoReS: Orchestrating the Dance of Reasoning and Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[556]  arXiv:2404.05728 (replaced) [pdf, ps, other]
Title: A Large-Scale Exploration of $μ$-Transfer
Authors: Lucas Lingle
Comments: V3: Formatting, refs, extra data. V2: Formatting, SP experiment
Subjects: Machine Learning (cs.LG)
[557]  arXiv:2404.06211 (replaced) [pdf, other]
Title: Unified Physical-Digital Attack Detection Challenge
Comments: 11 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[558]  arXiv:2404.06714 (replaced) [pdf, other]
Title: Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness
Comments: 9 pages, 2 figures, 4 tables; accepted at LREC-COLING 2024
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[559]  arXiv:2404.07831 (replaced) [pdf, ps, other]
Title: Protected QR Code-based Anti-counterfeit System for Pharmaceutical Manufacturing
Subjects: Cryptography and Security (cs.CR)
[560]  arXiv:2404.07982 (replaced) [pdf, other]
Title: Language Imbalance Can Boost Cross-lingual Generalisation
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[561]  arXiv:2404.08217 (replaced) [pdf, other]
Title: Escape with Your Self: Expressive Reachability Types with Sound and Decidable Bidirectional Type Checking
Subjects: Programming Languages (cs.PL)
[562]  arXiv:2404.08224 (replaced) [src]
Title: HCL-MTSAD: Hierarchical Contrastive Consistency Learning for Accurate Detection of Industrial Multivariate Time Series Anomalies
Comments: This paper is a manuscript that is still in the process of revision, including Table 1, Figure 2, problem definition in section III.B and method description proposed in section IV. In addition, the submitter has not been authorized by the first author and other co-authors to post the paper to arXiv
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT); Systems and Control (eess.SY)
[563]  arXiv:2404.08273 (replaced) [pdf, other]
Title: Struggle with Adversarial Defense? Try Diffusion
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
[564]  arXiv:2404.08277 (replaced) [pdf, other]
Title: FaceFilterSense: A Filter-Resistant Face Recognition and Facial Attribute Analysis Framework
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[565]  arXiv:2404.08354 (replaced) [pdf, other]
Title: Gaining More Insight into Neural Semantic Parsing with Challenging Benchmarks
Subjects: Computation and Language (cs.CL)
[566]  arXiv:2404.08786 (replaced) [pdf, other]
Title: NeuroLGP-SM: Scalable Surrogate-Assisted Neuroevolution for Deep Neural Networks
Comments: Accept for IEEE Congress on Evolutionary Computation (CEC) (CEC 2024), Yokohama, Japan, 8 pages, 5 figures, 2 tables
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
[567]  arXiv:2404.08838 (replaced) [pdf, ps, other]
Title: Predicting Traffic Congestion at Urban Intersections Using Data-Driven Modeling
Subjects: Machine Learning (cs.LG)
[568]  arXiv:2404.08977 (replaced) [pdf, other]
Title: RoNID: New Intent Discovery with Generated-Reliable Labels and Cluster-friendly Representations
Comments: DASFAA 2024
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
[569]  arXiv:2404.08995 (replaced) [pdf, other]
Title: Beyond Known Clusters: Probe New Prototypes for Efficient Generalized Class Discovery
Comments: 9 pages, 7 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[570]  arXiv:2404.09037 (replaced) [pdf, other]
Title: Intention-Aware Control Based on Belief-Space Specifications and Stochastic Expansion
Subjects: Systems and Control (eess.SY)
[571]  arXiv:2404.09354 (replaced) [pdf, other]
Title: A Paradigm For Collaborative Pervasive Fog Computing Ecosystems at the Network Edge
Comments: 6 pages
Journal-ref: IEEE International Conference on Communication, Networks, and Satellite (ComNetSat), 2023
Subjects: Networking and Internet Architecture (cs.NI)
[572]  arXiv:2404.09624 (replaced) [pdf, other]
Title: AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[573]  arXiv:2404.09683 (replaced) [pdf, other]
Title: Post-Training Network Compression for 3D Medical Image Segmentation: Reducing Computational Efforts via Tucker Decomposition
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[574]  arXiv:2404.09835 (replaced) [pdf, other]
Title: Hardness of Packing, Covering and Partitioning Simple Polygons with Unit Squares
Comments: 56 pages, 64 figures
Subjects: Computational Geometry (cs.CG)
[575]  arXiv:2404.10209 (replaced) [pdf, other]
Title: Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by Large Language Models
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[576]  arXiv:2404.10320 (replaced) [pdf, other]
Title: CARE to Compare: A real-world dataset for anomaly detection in wind turbine data
Comments: 28 pages, 3 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
[577]  arXiv:2404.10335 (replaced) [pdf, other]
Title: Efficiently Adversarial Examples Generation for Visual-Language Models under Targeted Transfer Scenarios using Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[578]  arXiv:2404.10496 (replaced) [pdf, other]
Title: Spiral of Silences: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering
Subjects: Information Retrieval (cs.IR)
[579]  arXiv:2404.10520 (replaced) [pdf, ps, other]
Title: A Game-Theoretic Approach for PMU Deployment Against False Data Injection Attacks
Subjects: Systems and Control (eess.SY)
[580]  arXiv:2404.10662 (replaced) [pdf, other]
Title: Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
[581]  arXiv:2404.10715 (replaced) [pdf, other]
Title: Dynamic Frequency-Based Fingerprinting Attacks against Modern Sandbox Environments
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
[582]  arXiv:2404.10756 (replaced) [pdf, other]
Title: A High-Order Conservative Cut Finite Element Method for Problems in Time-Dependent Domains
Comments: 22 pages, 20 figures
Subjects: Numerical Analysis (math.NA)
[583]  arXiv:2404.10849 (replaced) [pdf, other]
Title: End-To-End Training and Testing Gamification Framework to Learn Human Highway Driving
Comments: Accepted by ITSC
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
[584]  arXiv:2404.10879 (replaced) [pdf, other]
Title: FlexMap Fusion: Georeferencing and Automated Conflation of HD Maps with OpenStreetMap
Comments: 7 pages
Subjects: Robotics (cs.RO)
[585]  arXiv:2404.10898 (replaced) [pdf, other]
Title: Nonnegative tensor train for the multicomponent Smoluchowski equation
Comments: 12 pages, 4 figures, 4 tables
Subjects: Numerical Analysis (math.NA); Soft Condensed Matter (cond-mat.soft); Optimization and Control (math.OC)
[586]  arXiv:2404.10944 (replaced) [pdf, other]
Title: Threat Behavior Textual Search by Attention Graph Isomorphism
Journal-ref: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024
Subjects: Information Retrieval (cs.IR)
[587]  arXiv:2404.11031 (replaced) [pdf, other]
Title: TaCOS: Task-Specific Camera Optimization with Simulation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
[588]  arXiv:2404.11052 (replaced) [pdf, other]
Title: Supervised Contrastive Vision Transformer for Breast Histopathological Image Classification
Comments: 8 pages, 7 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
[589]  arXiv:2404.11086 (replaced) [pdf, other]
Title: ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models
Comments: arXiv admin note: text overlap with arXiv:2305.08322 by other authors
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
[590]  arXiv:2404.11098 (replaced) [pdf, other]
Title: LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[591]  arXiv:2404.11136 (replaced) [pdf, ps, other]
Title: On the Performance of RIS-assisted Networks with HQAM
Comments: 6 pages, 5 figures, to be presented in EuCNC & 6G Summit 2024
Subjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
[592]  arXiv:2404.11163 (replaced) [pdf, other]
Title: LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory
Comments: Published at IJCAI 2024
Subjects: Machine Learning (cs.LG)
[593]  arXiv:2404.11172 (replaced) [pdf, other]
Title: Deep Neural Networks via Complex Network Theory: a Perspective
Comments: IJCAI'24 (full paper, main track)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
[594]  arXiv:2404.11184 (replaced) [pdf, other]
Title: FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document
Subjects: Computation and Language (cs.CL)
[595]  arXiv:2404.11194 (replaced) [pdf, other]
Title: Simultaneous compensation of input delay and state/input quantization for linear systems via switched predictor feedback
Comments: 12 pages, 7 figures, submitted to Systems & Control Letters
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Analysis of PDEs (math.AP)
[596]  arXiv:2404.11326 (replaced) [pdf, other]
Title: Single-temporal Supervised Remote Change Detection for Domain Generalization
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[597]  arXiv:2404.11358 (replaced) [pdf, other]
Title: DeblurGS: Gaussian Splatting for Camera Motion Blur
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[598]  arXiv:2404.11410 (replaced) [pdf, other]
Title: SERENE: A Collusion Resilient Replication-based Verification Framework
Comments: 9 pages
Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
[599]  arXiv:2404.11433 (replaced) [pdf, other]
Title: Runtime Analyses of NSGA-III on Many-Objective Problems
Comments: To appear at GECCO 2024
Subjects: Neural and Evolutionary Computing (cs.NE)
[600]  arXiv:2404.11459 (replaced) [pdf, other]
Title: Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
Authors: Wei Chen, Zhiyuan Li
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
[601]  arXiv:2404.11525 (replaced) [pdf, other]
Title: JointViT: Modeling Oxygen Saturation Levels with Joint Supervision on Long-Tailed OCTA
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
[602]  arXiv:2404.11596 (replaced) [pdf, other]
Title: Urban highways are barriers to social ties
Comments: Main text: 8 pages, 4 figures, 1 table. Supplementary Information: 11 pages, 11 figures, 5 tables
Subjects: Physics and Society (physics.soc-ph); Computers and Society (cs.CY)
[603]  arXiv:2404.11603 (replaced) [pdf, other]
Title: Isoparametric Virtual Element Methods
Subjects: Numerical Analysis (math.NA)
[604]  arXiv:2404.11607 (replaced) [pdf, other]
Title: Private federated discovery of out-of-vocabulary words for Gboard
Subjects: Data Structures and Algorithms (cs.DS)
[605]  arXiv:2404.11614 (replaced) [pdf, other]
Title: Dynamic Typography: Bringing Text to Life via Video Diffusion Prior
Comments: Our demo page is available at: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)
[ total of 605 entries: 1-605 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, recent, 2404, contact, help  (Access key information)